Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多卡chatglm2 sft报错RuntimeError: expected scalar type Half but found Float #60

Closed
zhr0313 opened this issue Jul 2, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@zhr0313
Copy link

zhr0313 commented Jul 2, 2023

│ 1892 │ │ │ │ │ │
│ ││ 2670 │ │ │ labels = None │
│ /root/.local/lib/python3.9/site-packages/torch/nn/modules/module.py:1501 in _call_impl │
│ ❱ 848 │ │ transformer_outputs = self.transformer( ││ 168 │ module.forward = new_forward ││ new_forward ││ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks ││ 1892 │ │ │ │ │ │
│ 2670 │ │ │ labels = None │
│ ❱ 848 │ │ transformer_outputs = self.transformer( │
│ │
│ 164 │ │ else: │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /root/.local/lib/python3.9/site-packages/torch/nn/modules/linear.py:114 in forward │
│ │
│ 111 │ │ │ init.uniform
(self.bias, -bound, bound) │
│ 112 │ │
│ 113 │ def forward(self, input: Tensor) -> Tensor: │
│ ❱ 114 │ │ return F.linear(input, self.weight, self.bias) │
│ 115 │ │
│ 116 │ def extra_repr(self) -> str: │
│ 117 │ │ return 'in_features={}, out_features={}, bias={}'.format( │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: expected scalar type Half but found Float

在第500步报错(save_steps =500),sh CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node 2 supervised_finetuning.py
单卡是正常的

@zhr0313 zhr0313 added the bug Something isn't working label Jul 2, 2023
@zhr0313
Copy link
Author

zhr0313 commented Jul 2, 2023

经测试,应该是在eval的时候报错

@zhr0313
Copy link
Author

zhr0313 commented Jul 2, 2023

with torch.autocast("cuda"):
train 和eval代码前都加上,解决问题

@shibing624
Copy link
Owner

@daimazz1
Copy link

你好我在单卡训练chatglm-6b时也遇到了这个错误,然后在PT阶段的train和eval加上了with torch.autocast("cuda"): 现在可以跑通了,但是我又测试了下bloom发现加了这个之后eval的 perplexity 2W+数据不正常,加这个会影响PT阶段训练模型的性能吗

@zhr0313
Copy link
Author

zhr0313 commented Jul 20, 2023

是的,我也发现这个问题了,加上之后的loss也降不下来。我重新安装了环境,可以解决这个问题。另外,将eval_step设置的很大,不进行eval,也可以解决这个问题,没发现对新模型有什么影响。

@daimazz1
Copy link

是的,我也发现这个问题了,加上之后的loss也降不下来。我重新安装了环境,可以解决这个问题。另外,将eval_step设置的很大,不进行eval,也可以解决这个问题,没发现对新模型有什么影响。

你好这个应该跟环境没什么联系吧,你是在chatglm模型跑出现这个问题吗,加上了with torch.autocast("cuda"): 这个以后,调整rval_step为多大可以解决这个问题哈

@zhr0313
Copy link
Author

zhr0313 commented Jul 23, 2023

我在4V100的环境,最新库上运行是没问题的。2A100的环境、不是最新的意向库上运行出现这个问题,但A100环境库由于一些原因不太好改动,我也不确定是不是库的问题。 不加with torch.autocast("cuda"): ,eval_step大于你的训练步数可以解决这个问题(expected scalar type Half but found Float)

@shibing624
Copy link
Owner

shibing624 commented Jul 28, 2023

refer mymusise/ChatGLM-Tuning#179 and #125

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants