You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Pinging @sayakpaul for training scripts. I don't think args.lr_warmup_steps * args.gradient_accumulation_steps is correct because you are already doing lesser number of gradient updates when usihng accumulation, so increasing the time it takes to reach true/peak LR does not make sense. I think lr_warmup_steps * num_processes is correct so that each rank can get equal-ish number of learning steps going from low to true/peak LR.
For why we multiply learning rate, there are many papers and recipes. For a quick read, you could look at the accelerate docs and linked references: https://huggingface.co/docs/accelerate/concept_guides/performance#learning-rates. There is also some older wisdom that scaling learning rate by sqrt(X) performs better, if X is the increase in batch size, but in different sets of experiments, scaling linear worked well too.
For why we multiple lr_scheduler warmup steps, there have been past discussions so I'll reference them here. Feel free to drop a comment if you don't find an explanation sufficiently reasonable: this, this, this and this.
In the example code from train_text_to_image_sdxl.py:
But in train_text_to_image.py:
Why is there such a difference in these two cases?
The text was updated successfully, but these errors were encountered: