-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ChatGLM #527
Conversation
eval 结果{'results': {'hellaswag': {'acc': 0.4439354710217088, 'acc_stderr': 0.0049583141142664905, 'acc_norm': 0.5696076478789086, 'acc_norm_stderr': 0.0049411916073179105}}, 'versions': {'hellaswag': 0}, 'config': {'model': 'chatglm', 'batch_size': 1, 'device': 'cuda:0', 'num_fewshot': 0, 'limit': None, 'bootstrap_iters': 100000}} 尚存在问题训练阶段最后会卡住(trainer.train()结束后会卡住不动),可以正常训练和保存所有模型 |
该Pr完成了: chatglm-6B的微调(lora微调/全量),推理; [lora部分也可移植到其他模型] 显存占用情况1. full finetune1n4g[1-4] fp16 1dp 4tp 1pp batch_size=1
[01/09 18:34:36 lb.utils.events]: eta: 17:02:20 iteration: 9/27736 consumed_samples: 80 total_loss: 5.674 time: 2.2448 s/iter data_time: 0.1167 s/iter total_throughput: 3.56 samples/s lr: 1.62e-08 1n4g[1-4] fp16 1dp 1tp 4pp batch_size=1
[01/09 18:43:45 lb.utils.events]: eta: 10:05:12 iteration: 9/27736 consumed_samples: 80 total_loss: 5.674 time: 1.3446 s/iter data_time: 0.0538 s/iter total_throughput: 5.95 samples/s lr: 1.62e-08 2. lora finetune1n4g[1-4] fp16 1dp 4tp 1pp batch_size=1
[01/09 18:55:21 lb.utils.events]: eta: 12:51:07 iteration: 9/27736 consumed_samples: 80 total_loss: 5.674 time: 1.7432 s/iter data_time: 0.0278 s/iter total_throughput: 4.59 samples/s lr: 1.62e-08 1n4g[1-4] fp16 1dp 1tp 4pp batch_size=1
[01/09 19:01:28 lb.utils.events]: eta: 6:29:57 iteration: 9/27736 consumed_samples: 80 total_loss: 5.674 time: 0.8229 s/iter data_time: 0.0110 s/iter total_throughput: 9.72 samples/s lr: 1.62e-08 |
def main(args):
cfg = LazyConfig.load(args.config_file)
cfg = LazyConfig.apply_overrides(cfg, args.opts)
default_setup(cfg, args)
seed_for_rank = cfg.train.seed + flow.env.get_rank()
flow.manual_seed(seed_for_rank)
flow.cuda.manual_seed(seed_for_rank)
np.random.seed(seed_for_rank)
random.seed(seed_for_rank)
if args.fast_dev_run:
cfg.train.train_epoch = 0
cfg.train.train_iter = 20
cfg.train.evaluation.eval_period = 10
cfg.train.log_period = 1
if args.eval_only:
tokenizer = None
if try_get_key(cfg, "tokenization") is not None:
tokenizer = DefaultTrainer.build_tokenizer(cfg)
model = DefaultTrainer.build_model(cfg)
Checkpointer(model, save_dir=cfg.train.output_dir).resume_or_load(
cfg.train.load_weight, resume=args.resume
)
if try_get_key(cfg, "graph.enabled", default=False):
model = DefaultTrainer.build_graph(cfg, model, is_train=False)
test_loader = DefaultTrainer.build_test_loader(cfg, tokenizer)
if len(test_loader) == 0:
logger = logging.getLogger(__name__)
logger.info("No dataset in dataloader.test, please set dataset for dataloader.test")
_ = DefaultTrainer.test(cfg, test_loader, model)
return
trainer = ChatGLMTrainer(cfg)
return trainer.train()
if __name__ == "__main__":
args = default_argument_parser().parse_args()
main(args) 在训练结束后(保存完所有模型,并且在libai中trainer.train()完成后),会卡住不动一段时间,然后发生下面的报错
版本信息
|
No description provided.