Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ET5 - finetune-t5-ynat.py 실행 중 학습 오류 #29

Closed
seuly1203 opened this issue Dec 1, 2021 · 2 comments
Closed

ET5 - finetune-t5-ynat.py 실행 중 학습 오류 #29

seuly1203 opened this issue Dec 1, 2021 · 2 comments

Comments

@seuly1203
Copy link
Contributor

실행 셀 코드:

!CUDA_VISIBLE_DEVICES=0 python seq2seq_finetune_t5_ynat.py \
--do_train --do_eval --predict_with_generate \
--model_name_or_path /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5 \
--data_dir /content/drive/MyDrive/ET5_test/ynat-v1.1 \
--output_dir /content/drive/MyDrive/ET5_test/output \
--overwrite_output_dir \
--save_steps 100000 \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 1 \
--num_train_epochs 1.0

오류 메세지:

12/01/2021 05:14:34 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1, distributed training: False, 16-bits training: False
12/01/2021 05:14:34 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/content/drive/MyDrive/ET5_test/output', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=16, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_steps=0, logging_dir='runs/Dec01_05-14-34_eb535e39a1b5', logging_first_step=False, logging_steps=500, save_steps=100000, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', fp16_backend='auto', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name='/content/drive/MyDrive/ET5_test/output', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, label_smoothing=0.0, sortish_sampler=False, predict_with_generate=True, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler='linear')
[INFO|configuration_utils.py:447] 2021-12-01 05:14:34,267 >> loading configuration file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/config.json
[INFO|configuration_utils.py:485] 2021-12-01 05:14:34,267 >> Model config T5Config {
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.3.2",
  "use_cache": true,
  "vocab_size": 45100
}

[INFO|configuration_utils.py:447] 2021-12-01 05:14:34,269 >> loading configuration file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/config.json
[INFO|configuration_utils.py:485] 2021-12-01 05:14:34,269 >> Model config T5Config {
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.3.2",
  "use_cache": true,
  "vocab_size": 45100
}

[INFO|tokenization_utils_base.py:1688] 2021-12-01 05:14:34,269 >> Model name '/content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5' not found in model shortcut name list (t5-small, t5-base, t5-large, t5-3b, t5-11b). Assuming '/content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5' is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1721] 2021-12-01 05:14:34,271 >> Didn't find file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/tokenizer.json. We won't load it.
[INFO|tokenization_utils_base.py:1721] 2021-12-01 05:14:34,271 >> Didn't find file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1721] 2021-12-01 05:14:34,272 >> Didn't find file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/special_tokens_map.json. We won't load it.
[INFO|tokenization_utils_base.py:1784] 2021-12-01 05:14:34,273 >> loading file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/spiece.model
[INFO|tokenization_utils_base.py:1784] 2021-12-01 05:14:34,273 >> loading file None
[INFO|tokenization_utils_base.py:1784] 2021-12-01 05:14:34,273 >> loading file None
[INFO|tokenization_utils_base.py:1784] 2021-12-01 05:14:34,273 >> loading file None
[INFO|tokenization_utils_base.py:1784] 2021-12-01 05:14:34,273 >> loading file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/tokenizer_config.json
[INFO|modeling_utils.py:1025] 2021-12-01 05:14:34,456 >> loading weights file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/pytorch_model.bin
[INFO|modeling_utils.py:1143] 2021-12-01 05:14:42,207 >> All model checkpoint weights were used when initializing T5ForConditionalGeneration.

[INFO|modeling_utils.py:1152] 2021-12-01 05:14:42,207 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
#####	 Reading an input file ...	 /content/drive/MyDrive/ET5_test/ynat-v1.1/train.json
#####	 Create examples ... : 45678it [00:00, 666217.22it/s]
#####	 Get source and target texts ... : 100% 45677/45677 [00:00<00:00, 1432044.61it/s]
#####	 Reading an input file ...	 /content/drive/MyDrive/ET5_test/ynat-v1.1/val.json
#####	 Create examples ... : 9107it [00:00, 694942.72it/s]
#####	 Get source and target texts ... : 100% 9106/9106 [00:00<00:00, 1586826.72it/s]
12/01/2021 05:14:48 - INFO - __main__ -   *** Train ***
/usr/local/lib/python3.7/dist-packages/transformers/trainer.py:705: FutureWarning: `model_path` is deprecated and will be removed in a future version. Use `resume_from_checkpoint` instead.
  FutureWarning,
[INFO|trainer.py:724] 2021-12-01 05:14:48,220 >> Loading model from /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5).
[INFO|configuration_utils.py:447] 2021-12-01 05:14:48,222 >> loading configuration file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/config.json
[INFO|configuration_utils.py:485] 2021-12-01 05:14:48,222 >> Model config T5Config {
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.3.2",
  "use_cache": true,
  "vocab_size": 45100
}

[INFO|modeling_utils.py:1025] 2021-12-01 05:14:48,224 >> loading weights file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/pytorch_model.bin
[INFO|modeling_utils.py:1143] 2021-12-01 05:14:55,663 >> All model checkpoint weights were used when initializing T5ForConditionalGeneration.

[INFO|modeling_utils.py:1152] 2021-12-01 05:14:55,663 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
[INFO|trainer.py:837] 2021-12-01 05:14:56,744 >> ***** Running training *****
[INFO|trainer.py:838] 2021-12-01 05:14:56,744 >>   Num examples = 45676
[INFO|trainer.py:839] 2021-12-01 05:14:56,744 >>   Num Epochs = 1
[INFO|trainer.py:840] 2021-12-01 05:14:56,744 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:841] 2021-12-01 05:14:56,744 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:842] 2021-12-01 05:14:56,744 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:843] 2021-12-01 05:14:56,744 >>   Total optimization steps = 2855
  0% 0/2855 [00:00<?, ?it/s]Traceback (most recent call last):
  File "seq2seq_finetune_t5_ynat.py", line 379, in <module>
    main()
  File "seq2seq_finetune_t5_ynat.py", line 316, in main
    model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 940, in train
    tr_loss += self.training_step(model, inputs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1320, in training_step
    loss.backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
  0% 0/2855 [00:00<?, ?it/s]
@seuly1203
Copy link
Contributor Author

비슷한 이슈
allenai/allennlp#5064 (comment)

@seuly1203
Copy link
Contributor Author

CUDA에러 발생 이유: CUDA / 사용 라이브러리 버전이 맞지 않는 경우 혹은 입력 데이터 형식이 이상할 경우

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant