-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
多卡chatglm2 sft报错RuntimeError: expected scalar type Half but found Float #60
Comments
经测试,应该是在eval的时候报错 |
with torch.autocast("cuda"): |
嗯 |
你好我在单卡训练chatglm-6b时也遇到了这个错误,然后在PT阶段的train和eval加上了with torch.autocast("cuda"): 现在可以跑通了,但是我又测试了下bloom发现加了这个之后eval的 perplexity 2W+数据不正常,加这个会影响PT阶段训练模型的性能吗 |
是的,我也发现这个问题了,加上之后的loss也降不下来。我重新安装了环境,可以解决这个问题。另外,将eval_step设置的很大,不进行eval,也可以解决这个问题,没发现对新模型有什么影响。 |
你好这个应该跟环境没什么联系吧,你是在chatglm模型跑出现这个问题吗,加上了with torch.autocast("cuda"): 这个以后,调整rval_step为多大可以解决这个问题哈 |
我在4V100的环境,最新库上运行是没问题的。2A100的环境、不是最新的意向库上运行出现这个问题,但A100环境库由于一些原因不太好改动,我也不确定是不是库的问题。 不加with torch.autocast("cuda"): ,eval_step大于你的训练步数可以解决这个问题(expected scalar type Half but found Float) |
refer mymusise/ChatGLM-Tuning#179 and #125 |
│ 1892 │ │ │ │ │ │
│ ││ 2670 │ │ │ labels = None │
│ /root/.local/lib/python3.9/site-packages/torch/nn/modules/module.py:1501 in _call_impl │
│ ❱ 848 │ │ transformer_outputs = self.transformer( ││ 168 │ module.forward = new_forward ││ new_forward ││ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks ││ 1892 │ │ │ │ │ │
│ 2670 │ │ │ labels = None │
│ ❱ 848 │ │ transformer_outputs = self.transformer( │
│ │
│ 164 │ │ else: │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /root/.local/lib/python3.9/site-packages/torch/nn/modules/linear.py:114 in forward │
│ │
│ 111 │ │ │ init.uniform(self.bias, -bound, bound) │
│ 112 │ │
│ 113 │ def forward(self, input: Tensor) -> Tensor: │
│ ❱ 114 │ │ return F.linear(input, self.weight, self.bias) │
│ 115 │ │
│ 116 │ def extra_repr(self) -> str: │
│ 117 │ │ return 'in_features={}, out_features={}, bias={}'.format( │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: expected scalar type Half but found Float
在第500步报错(save_steps =500),sh CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node 2 supervised_finetuning.py
单卡是正常的
The text was updated successfully, but these errors were encountered: