We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hello, training XTTSv2 from Coqui TTS leads to weird training lags with using DDP x6 RTX a6000 and 512GB RAM
Here is monitoring GPU load graph. Purple - gpu0, green - gpu1 (all the rest GPUs behave like gpu1)
With 4 GPU situation remains the same
I think there's some kind of error in Trainer.
python -m trainer.distribute --script recipes/ljspeech/xtts_v2/train_gpt_xtts.py --gpus 0,1,2,3,4,5
No response
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA RTX A6000 On | 00000000:01:00.0 Off | Off | | 46% 70C P2 229W / 300W | 32382MiB / 49140MiB | 91% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA RTX A6000 On | 00000000:25:00.0 Off | Off | | 42% 68C P2 246W / 300W | 27696MiB / 49140MiB | 77% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA RTX A6000 On | 00000000:41:00.0 Off | Off | | 38% 67C P2 256W / 300W | 27640MiB / 49140MiB | 63% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA RTX A6000 On | 00000000:81:00.0 Off | Off | | 39% 67C P2 245W / 300W | 27640MiB / 49140MiB | 67% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA RTX A6000 On | 00000000:A1:00.0 Off | Off | | 46% 70C P2 239W / 300W | 27620MiB / 49140MiB | 66% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA RTX A6000 On | 00000000:C2:00.0 Off | Off | | 30% 31C P8 17W / 300W | 3MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 2517964 C ...onov/anaconda3/envs/xtts/bin/python 32374MiB | | 1 N/A N/A 2516039 C python3 27688MiB | | 2 N/A N/A 2516040 C python3 27632MiB | | 3 N/A N/A 2516041 C python3 27632MiB | | 4 N/A N/A 2516042 C python3 27612MiB | +---------------------------------------------------------------------------------------+
The text was updated successfully, but these errors were encountered:
tried num_workers=0, >0, MP_THREADS_NUM and so on, nothing helps lots of ram and shared memory
Sorry, something went wrong.
No branches or pull requests
Describe the bug
Hello, training XTTSv2 from Coqui TTS leads to weird training lags with using DDP
x6 RTX a6000 and 512GB RAM
Here is monitoring GPU load graph. Purple - gpu0, green - gpu1 (all the rest GPUs behave like gpu1)
With 4 GPU situation remains the same
I think there's some kind of error in Trainer.
To Reproduce
python -m trainer.distribute --script recipes/ljspeech/xtts_v2/train_gpt_xtts.py --gpus 0,1,2,3,4,5
Expected behavior
No response
Logs
No response
Environment
Additional context
No response
The text was updated successfully, but these errors were encountered: