Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting following error on google colab #667

Open
1 of 2 tasks
smktech9 opened this issue Jan 13, 2025 · 1 comment
Open
1 of 2 tasks

Getting following error on google colab #667

smktech9 opened this issue Jan 13, 2025 · 1 comment
Assignees

Comments

@smktech9
Copy link

System Info / 系統信息

Google Colab

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

Run finetuning text2video in google colab

Expected behavior / 期待表现

The following values were not passed to accelerate launch and had defaults used instead:
--num_processes was set to a value of 1
--num_machines was set to a value of 1
--mixed_precision was set to a value of 'no'
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
2025-01-13 21:54:00.044139: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-01-13 21:54:00.282656: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-01-13 21:54:00.353328: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-13 21:54:00.753650: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-13 21:54:02.855868: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:root:All CogVideoX models except cogvideox-2b were trained with bfloat16. Using fp16 precision may lead to training instability.
INFO:trainer:Initialized Trainer
INFO:trainer:Accelerator state:
Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

INFO:trainer:Initializing models
You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers
Downloading shards: 100% 4/4 [00:00<00:00, 1194.36it/s]
Loading checkpoint shards: 100% 4/4 [00:01<00:00, 2.21it/s]
Fetching 3 files: 100% 3/3 [00:00<00:00, 6250.83it/s]
{'ofs_embed_dim'} was not found in config. Values will be initialized to default values.
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1168, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 763, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/drive/MyDrive/train/CogVideo/finetune/train.py', '--model_path', 'THUDM/CogVideoX1.5-5B', '--model_name', 'cogvideox1.5-t2v', '--model_type', 't2v', '--training_type', 'lora', '--output_dir', '/content/drive/MyDrive/train', '--report_to', 'tensorboard', '--data_root', '/content/drive/MyDrive/train/Disney-VideoGeneration-Dataset', '--caption_column', 'prompt.txt', '--video_column', 'videos.txt', '--train_resolution', '49x480x720', '--train_epochs', '10', '--batch_size', '1', '--gradient_accumulation_steps', '1', '--mixed_precision', 'fp16', '--seed', '42', '--num_workers', '2', '--pin_memory', 'False', '--nccl_timeout', '3600', '--checkpointing_steps', '200', '--checkpointing_limit', '10']' died with <Signals.SIGKILL: 9>.

@OleehyO
Copy link
Collaborator

OleehyO commented Jan 14, 2025

We have not tested our code in the colab environment, so there might be some compatibility issues. Therefore, we recommend performing fine-tuning in a standard runtime environment.

@OleehyO OleehyO self-assigned this Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants