-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow Training Speed #21
Comments
Hi @s13kman, thanks for your interest! I would suggest to use multi-gpu training to speed up training since you have access to multiple GPUs. Actually multi-gpu is supported through Horovod (https://github.com/horovod/horovod).
Given that the dataset is relatively big, I actually train the models usually only on a single epoch. |
How long did it take you to train only a single epoch? |
Hi @CrossLee1 sorry for replying until now, so it takes around 6 hours, but I train them on 64 A100 GPUs (data parallel with Horovod) to speed up the process. I am quite sure there are a lot things to optimize here in terms of hardware usage, I was mostly going for fast experiments (walltime) to figure out what works the best (in terms of architecture, data augmentation, losses, etc.) rather than optimizing the training speed. |
Hi,
First of all great work! I really loved it. To replicate, I tried training on the Conceptual 12M Dataset with the depth and dims same as the pretrained models but the training was too slow. Even in 4 days it was going through the first (or 0th) epoch. I'm training it on NVIDIA Quadro RTX A6000 which I don't think is that much slow.
Any suggestions to improve the speed of training? I have multi-gpu access but seems it isn't supported rn.
Thanks !
The text was updated successfully, but these errors were encountered: