多卡chatglm2 sft报错RuntimeError: expected scalar type Half but found Float #60

zhr0313 · 2023-07-02T08:47:37Z

│ 1892 │ │ │ │ │ │
│ ││ 2670 │ │ │ labels = None │
│ /root/.local/lib/python3.9/site-packages/torch/nn/modules/module.py:1501 in _call_impl │
│ ❱ 848 │ │ transformer_outputs = self.transformer( ││ 168 │ module.forward = new_forward ││ new_forward ││ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks ││ 1892 │ │ │ │ │ │
│ 2670 │ │ │ labels = None │
│ ❱ 848 │ │ transformer_outputs = self.transformer( │
│ │
│ 164 │ │ else: │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /root/.local/lib/python3.9/site-packages/torch/nn/modules/linear.py:114 in forward │
│ │
│ 111 │ │ │ init.uniform(self.bias, -bound, bound) │
│ 112 │ │
│ 113 │ def forward(self, input: Tensor) -> Tensor: │
│ ❱ 114 │ │ return F.linear(input, self.weight, self.bias) │
│ 115 │ │
│ 116 │ def extra_repr(self) -> str: │
│ 117 │ │ return 'in_features={}, out_features={}, bias={}'.format( │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: expected scalar type Half but found Float

在第500步报错（save_steps =500），sh CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node 2 supervised_finetuning.py
单卡是正常的

zhr0313 · 2023-07-02T09:04:13Z

经测试，应该是在eval的时候报错

zhr0313 · 2023-07-02T11:30:50Z

with torch.autocast("cuda"):
train 和eval代码前都加上，解决问题

shibing624 · 2023-07-02T14:24:48Z

嗯

daimazz1 · 2023-07-20T12:38:29Z

你好我在单卡训练chatglm-6b时也遇到了这个错误，然后在PT阶段的train和eval加上了with torch.autocast("cuda"): 现在可以跑通了，但是我又测试了下bloom发现加了这个之后eval的 perplexity 2W+数据不正常，加这个会影响PT阶段训练模型的性能吗

zhr0313 · 2023-07-20T19:16:19Z

是的，我也发现这个问题了，加上之后的loss也降不下来。我重新安装了环境，可以解决这个问题。另外，将eval_step设置的很大，不进行eval，也可以解决这个问题，没发现对新模型有什么影响。

daimazz1 · 2023-07-21T07:47:56Z

是的，我也发现这个问题了，加上之后的loss也降不下来。我重新安装了环境，可以解决这个问题。另外，将eval_step设置的很大，不进行eval，也可以解决这个问题，没发现对新模型有什么影响。

你好这个应该跟环境没什么联系吧，你是在chatglm模型跑出现这个问题吗，加上了with torch.autocast("cuda"): 这个以后，调整rval_step为多大可以解决这个问题哈

zhr0313 · 2023-07-23T13:03:44Z

我在4V100的环境，最新库上运行是没问题的。2A100的环境、不是最新的意向库上运行出现这个问题，但A100环境库由于一些原因不太好改动，我也不确定是不是库的问题。不加with torch.autocast("cuda"): ，eval_step大于你的训练步数可以解决这个问题（expected scalar type Half but found Float）

shibing624 · 2023-07-28T02:55:56Z

refer mymusise/ChatGLM-Tuning#179 and #125

zhr0313 added the bug Something isn't working label Jul 2, 2023

shibing624 closed this as completed Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

多卡chatglm2 sft报错RuntimeError: expected scalar type Half but found Float #60

多卡chatglm2 sft报错RuntimeError: expected scalar type Half but found Float #60

zhr0313 commented Jul 2, 2023

zhr0313 commented Jul 2, 2023

zhr0313 commented Jul 2, 2023

shibing624 commented Jul 2, 2023

daimazz1 commented Jul 20, 2023

zhr0313 commented Jul 20, 2023

daimazz1 commented Jul 21, 2023

zhr0313 commented Jul 23, 2023

shibing624 commented Jul 28, 2023 •

edited

Loading

多卡chatglm2 sft报错RuntimeError: expected scalar type Half but found Float #60

多卡chatglm2 sft报错RuntimeError: expected scalar type Half but found Float #60

Comments

zhr0313 commented Jul 2, 2023

zhr0313 commented Jul 2, 2023

zhr0313 commented Jul 2, 2023

shibing624 commented Jul 2, 2023

daimazz1 commented Jul 20, 2023

zhr0313 commented Jul 20, 2023

daimazz1 commented Jul 21, 2023

zhr0313 commented Jul 23, 2023

shibing624 commented Jul 28, 2023 • edited Loading

shibing624 commented Jul 28, 2023 •

edited

Loading