ET5 - finetune-t5-ynat.py 실행 중 학습 오류 #29

seuly1203 · 2021-12-01T05:43:19Z

실행 셀 코드:

!CUDA_VISIBLE_DEVICES=0 python seq2seq_finetune_t5_ynat.py \
--do_train --do_eval --predict_with_generate \
--model_name_or_path /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5 \
--data_dir /content/drive/MyDrive/ET5_test/ynat-v1.1 \
--output_dir /content/drive/MyDrive/ET5_test/output \
--overwrite_output_dir \
--save_steps 100000 \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 1 \
--num_train_epochs 1.0

오류 메세지:

12/01/2021 05:14:34 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1, distributed training: False, 16-bits training: False
12/01/2021 05:14:34 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/content/drive/MyDrive/ET5_test/output', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=16, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_steps=0, logging_dir='runs/Dec01_05-14-34_eb535e39a1b5', logging_first_step=False, logging_steps=500, save_steps=100000, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', fp16_backend='auto', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name='/content/drive/MyDrive/ET5_test/output', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, label_smoothing=0.0, sortish_sampler=False, predict_with_generate=True, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler='linear')
[INFO|configuration_utils.py:447] 2021-12-01 05:14:34,267 >> loading configuration file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/config.json
[INFO|configuration_utils.py:485] 2021-12-01 05:14:34,267 >> Model config T5Config {
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.3.2",
  "use_cache": true,
  "vocab_size": 45100
}

[INFO|configuration_utils.py:447] 2021-12-01 05:14:34,269 >> loading configuration file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/config.json
[INFO|configuration_utils.py:485] 2021-12-01 05:14:34,269 >> Model config T5Config {
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.3.2",
  "use_cache": true,
  "vocab_size": 45100
}

[INFO|tokenization_utils_base.py:1688] 2021-12-01 05:14:34,269 >> Model name '/content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5' not found in model shortcut name list (t5-small, t5-base, t5-large, t5-3b, t5-11b). Assuming '/content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5' is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1721] 2021-12-01 05:14:34,271 >> Didn't find file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/tokenizer.json. We won't load it.
[INFO|tokenization_utils_base.py:1721] 2021-12-01 05:14:34,271 >> Didn't find file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1721] 2021-12-01 05:14:34,272 >> Didn't find file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/special_tokens_map.json. We won't load it.
[INFO|tokenization_utils_base.py:1784] 2021-12-01 05:14:34,273 >> loading file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/spiece.model
[INFO|tokenization_utils_base.py:1784] 2021-12-01 05:14:34,273 >> loading file None
[INFO|tokenization_utils_base.py:1784] 2021-12-01 05:14:34,273 >> loading file None
[INFO|tokenization_utils_base.py:1784] 2021-12-01 05:14:34,273 >> loading file None
[INFO|tokenization_utils_base.py:1784] 2021-12-01 05:14:34,273 >> loading file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/tokenizer_config.json
[INFO|modeling_utils.py:1025] 2021-12-01 05:14:34,456 >> loading weights file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/pytorch_model.bin
[INFO|modeling_utils.py:1143] 2021-12-01 05:14:42,207 >> All model checkpoint weights were used when initializing T5ForConditionalGeneration.

[INFO|modeling_utils.py:1152] 2021-12-01 05:14:42,207 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
#####	 Reading an input file ...	 /content/drive/MyDrive/ET5_test/ynat-v1.1/train.json
#####	 Create examples ... : 45678it [00:00, 666217.22it/s]
#####	 Get source and target texts ... : 100% 45677/45677 [00:00<00:00, 1432044.61it/s]
#####	 Reading an input file ...	 /content/drive/MyDrive/ET5_test/ynat-v1.1/val.json
#####	 Create examples ... : 9107it [00:00, 694942.72it/s]
#####	 Get source and target texts ... : 100% 9106/9106 [00:00<00:00, 1586826.72it/s]
12/01/2021 05:14:48 - INFO - __main__ -   *** Train ***
/usr/local/lib/python3.7/dist-packages/transformers/trainer.py:705: FutureWarning: `model_path` is deprecated and will be removed in a future version. Use `resume_from_checkpoint` instead.
  FutureWarning,
[INFO|trainer.py:724] 2021-12-01 05:14:48,220 >> Loading model from /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5).
[INFO|configuration_utils.py:447] 2021-12-01 05:14:48,222 >> loading configuration file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/config.json
[INFO|configuration_utils.py:485] 2021-12-01 05:14:48,222 >> Model config T5Config {
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.3.2",
  "use_cache": true,
  "vocab_size": 45100
}

[INFO|modeling_utils.py:1025] 2021-12-01 05:14:48,224 >> loading weights file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/pytorch_model.bin
[INFO|modeling_utils.py:1143] 2021-12-01 05:14:55,663 >> All model checkpoint weights were used when initializing T5ForConditionalGeneration.

[INFO|modeling_utils.py:1152] 2021-12-01 05:14:55,663 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
[INFO|trainer.py:837] 2021-12-01 05:14:56,744 >> ***** Running training *****
[INFO|trainer.py:838] 2021-12-01 05:14:56,744 >>   Num examples = 45676
[INFO|trainer.py:839] 2021-12-01 05:14:56,744 >>   Num Epochs = 1
[INFO|trainer.py:840] 2021-12-01 05:14:56,744 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:841] 2021-12-01 05:14:56,744 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:842] 2021-12-01 05:14:56,744 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:843] 2021-12-01 05:14:56,744 >>   Total optimization steps = 2855
  0% 0/2855 [00:00<?, ?it/s]Traceback (most recent call last):
  File "seq2seq_finetune_t5_ynat.py", line 379, in <module>
    main()
  File "seq2seq_finetune_t5_ynat.py", line 316, in main
    model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 940, in train
    tr_loss += self.training_step(model, inputs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1320, in training_step
    loss.backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
  0% 0/2855 [00:00<?, ?it/s]

The text was updated successfully, but these errors were encountered:

seuly1203 · 2021-12-01T05:47:16Z

비슷한 이슈
allenai/allennlp#5064 (comment)

seuly1203 · 2021-12-05T06:40:16Z

CUDA에러 발생 이유: CUDA / 사용 라이브러리 버전이 맞지 않는 경우 혹은 입력 데이터 형식이 이상할 경우

seuly1203 closed this as completed Dec 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ET5 - finetune-t5-ynat.py 실행 중 학습 오류 #29

ET5 - finetune-t5-ynat.py 실행 중 학습 오류 #29

seuly1203 commented Dec 1, 2021

seuly1203 commented Dec 1, 2021

seuly1203 commented Dec 5, 2021

ET5 - finetune-t5-ynat.py 실행 중 학습 오류 #29

ET5 - finetune-t5-ynat.py 실행 중 학습 오류 #29

Comments

seuly1203 commented Dec 1, 2021

seuly1203 commented Dec 1, 2021

seuly1203 commented Dec 5, 2021