You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Can't train the megatron-11b model for finetuning, it shows CUDA error: invalid device function. I guess this is related to pytorch version and the cuda version or apex version, but it is not sure what is the correct way, since apex running is also hard.
Code
Traceback (most recent call last):
File "/opt/miniconda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/tmp/fairseq/fairseq/distributed/utils.py", line 328, in distributed_main
main(cfg, **kwargs)
File "/tmp/fairseq/fairseq_cli/train.py", line 173, in main
PREFIX=/blob2/v-lijuwu/fairseq/examples/megatron_11b
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
File "/opt/miniconda/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/tmp/fairseq/fairseq_cli/train.py", line 284, in train
log_output = trainer.train_step(samples)
File "/opt/miniconda/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/tmp/fairseq/fairseq/trainer.py", line 701, in train_step
raise e
File "/tmp/fairseq/fairseq/trainer.py", line 675, in train_step
ignore_grad=is_dummy_batch,
File "/tmp/fairseq/fairseq/tasks/fairseq_task.py", line 475, in train_step
loss, sample_size, logging_output = criterion(model, sample)
File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/tmp/fairseq/fairseq/model_parallel/criterions/vocab_parallel_cross_entropy.py", line 42, in forward
net_output = model(**sample["net_input"])
File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/training_daemon/utils/hook.py", line 170, in wrapper
return func(*args, **kwargs)
File "/tmp/fairseq/fairseq/models/fairseq_model.py", line 496, in forward
return self.decoder(src_tokens, **kwargs)
File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/tmp/fairseq/fairseq/models/transformer.py", line 825, in forward
alignment_heads=alignment_heads,
File "/tmp/fairseq/fairseq/models/transformer.py", line 847, in extract_features
alignment_heads,
File "/tmp/fairseq/fairseq/models/transformer.py", line 951, in extract_features_scriptable
need_head_weights=bool((idx == alignment_layer)),
File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/tmp/fairseq/fairseq/modules/transformer_layer.py", line 353, in forward
attn_mask=self_attn_mask,
File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/tmp/fairseq/fairseq/model_parallel/modules/multihead_attention.py", line 135, in forward
q = self.q_proj(query)
File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/tmp/fairseq/fairseq/model_parallel/megatron/mpu/layers.py", line 243, in forward
output_parallel = F.linear(input_parallel, self.weight, self.bias)
File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/functional.py", line 1678, in linear
output += bias
RuntimeError: CUDA error: invalid device function
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!
What is your question?
Can't train the megatron-11b model for finetuning, it shows
CUDA error: invalid device function
. I guess this is related to pytorch version and the cuda version or apex version, but it is not sure what is the correct way, since apex running is also hard.Code
Training code:
What have you tried?
I follow the https://github.com/pytorch/fairseq/tree/v0.10.2/examples/megatron_11b to set up the data, model and training code, environment. Since apex is required, I install the apex accordingly, with one modification is that I follow NVIDIA/apex#323 (comment) to remove the version checking code.
After installing the apex, I try to run the training with reloading the pre-trained model, then the error shows as before.
I searched but didn't find a clear answer and solution.
What's your environment?
pip
, source): pip installThe text was updated successfully, but these errors were encountered: