Megatron 11b training: CUDA error: invalid device function #3681

apeterswu · 2021-07-03T08:18:46Z

What is your question?

Can't train the megatron-11b model for finetuning, it shows CUDA error: invalid device function. I guess this is related to pytorch version and the cuda version or apex version, but it is not sure what is the correct way, since apex running is also hard.

Code

Traceback (most recent call last):
  File "/opt/miniconda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/tmp/fairseq/fairseq/distributed/utils.py", line 328, in distributed_main
    main(cfg, **kwargs)
  File "/tmp/fairseq/fairseq_cli/train.py", line 173, in main

PREFIX=/blob2/v-lijuwu/fairseq/examples/megatron_11b
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/opt/miniconda/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/tmp/fairseq/fairseq_cli/train.py", line 284, in train
    log_output = trainer.train_step(samples)
  File "/opt/miniconda/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/tmp/fairseq/fairseq/trainer.py", line 701, in train_step
    raise e
  File "/tmp/fairseq/fairseq/trainer.py", line 675, in train_step
    ignore_grad=is_dummy_batch,
  File "/tmp/fairseq/fairseq/tasks/fairseq_task.py", line 475, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/tmp/fairseq/fairseq/model_parallel/criterions/vocab_parallel_cross_entropy.py", line 42, in forward
    net_output = model(**sample["net_input"])
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/miniconda/lib/python3.7/site-packages/training_daemon/utils/hook.py", line 170, in wrapper
    return func(*args, **kwargs)
  File "/tmp/fairseq/fairseq/models/fairseq_model.py", line 496, in forward
    return self.decoder(src_tokens, **kwargs)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/tmp/fairseq/fairseq/models/transformer.py", line 825, in forward
    alignment_heads=alignment_heads,
  File "/tmp/fairseq/fairseq/models/transformer.py", line 847, in extract_features
    alignment_heads,
  File "/tmp/fairseq/fairseq/models/transformer.py", line 951, in extract_features_scriptable
    need_head_weights=bool((idx == alignment_layer)),
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/tmp/fairseq/fairseq/modules/transformer_layer.py", line 353, in forward
    attn_mask=self_attn_mask,
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/tmp/fairseq/fairseq/model_parallel/modules/multihead_attention.py", line 135, in forward
    q = self.q_proj(query)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/tmp/fairseq/fairseq/model_parallel/megatron/mpu/layers.py", line 243, in forward
    output_parallel = F.linear(input_parallel, self.weight, self.bias)
  File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/functional.py", line 1678, in linear
    output += bias
RuntimeError: CUDA error: invalid device function

Training code:

PREFIX=~/fairseq/examples/megatron_11b
DATA_PATH=$PREFIX/wiki103_bin

WARM_MODEL_PATH=$PREFIX/megatron_11b_model
MODEL_NAME=wiki_tune_base

fairseq-train $DATA_PATH \
  --distributed-world-size 8  \
  --memory-efficient-fp16 \
  --num-workers 2 \
  --model-parallel-size 8 \
  --criterion vocab_parallel_cross_entropy \
  --task language_modeling \
  --sample-break-mode none \
  --tokens-per-sample 1024 \
  --arch transformer_lm_megatron_11b \
  --restore-file $WARM_MODEL_PATH/model.pt \
  --save-dir $PREFIX/models/$MODEL_NAME \
  --share-decoder-input-output-embed \
  --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-08 --clip-norm 0.0 \
  --lr-scheduler inverse_sqrt --lr 0.0001 \
  --warmup-updates 3000 --weight-decay 0.01 \
  --dropout 0.1 --attention-dropout 0.1 \
  --batch-size 1 \
  --max-update 50000

What have you tried?

I follow the https://github.com/pytorch/fairseq/tree/v0.10.2/examples/megatron_11b to set up the data, model and training code, environment. Since apex is required, I install the apex accordingly, with one modification is that I follow NVIDIA/apex#323 (comment) to remove the version checking code.

After installing the apex, I try to run the training with reloading the pre-trained model, then the error shows as before.

I searched but didn't find a clear answer and solution.

What's your environment?

fairseq Version (e.g., 1.0 or master): 1.0.0
PyTorch Version (e.g., 1.0): 1.6.0
OS (e.g., Linux): Linux 18.04
How you installed fairseq (pip, source): pip install
Build command you used (if compiling from source):
Python version: 3.7.0
CUDA/cuDNN version: cuda10.0 cudnn 7
GPU models and configuration: v100*32G
Any other relevant information:

The text was updated successfully, but these errors were encountered:

stale · 2022-04-17T16:20:34Z

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale · 2022-04-30T11:21:37Z

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

apeterswu added needs triage question labels Jul 3, 2021

stale bot added the stale label Apr 17, 2022

stale bot closed this as completed Apr 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Megatron 11b training: CUDA error: invalid device function #3681

Megatron 11b training: CUDA error: invalid device function #3681

apeterswu commented Jul 3, 2021

stale bot commented Apr 17, 2022

stale bot commented Apr 30, 2022

Megatron 11b training: CUDA error: invalid device function #3681

Megatron 11b training: CUDA error: invalid device function #3681

Comments

apeterswu commented Jul 3, 2021

What is your question?

Code

What have you tried?

What's your environment?

stale bot commented Apr 17, 2022

stale bot commented Apr 30, 2022