Confusion about accelerator.num_processes in get_scheduler #9633

hj13-mtlab · 2024-10-10T08:39:12Z

In the example code from train_text_to_image_sdxl.py:

num_warmup_steps = args.lr_warmup_steps * args.gradient_accumulation_steps

But in train_text_to_image.py:

num_warmup_steps_for_scheduler = args.lr_warmup_steps * accelerator.num_processes

Why is there such a difference in these two cases?

a-r-r-o-w · 2024-10-10T17:06:57Z

Pinging @sayakpaul for training scripts. I don't think args.lr_warmup_steps * args.gradient_accumulation_steps is correct because you are already doing lesser number of gradient updates when usihng accumulation, so increasing the time it takes to reach true/peak LR does not make sense. I think lr_warmup_steps * num_processes is correct so that each rank can get equal-ish number of learning steps going from low to true/peak LR.

Zephyrose · 2024-10-12T12:07:36Z

i have the question too,why multiply num_processes？

a-r-r-o-w · 2024-10-15T14:54:56Z

For why we multiply learning rate, there are many papers and recipes. For a quick read, you could look at the accelerate docs and linked references: https://huggingface.co/docs/accelerate/concept_guides/performance#learning-rates. There is also some older wisdom that scaling learning rate by sqrt(X) performs better, if X is the increase in batch size, but in different sets of experiments, scaling linear worked well too.

For why we multiple lr_scheduler warmup steps, there have been past discussions so I'll reference them here. Feel free to drop a comment if you don't find an explanation sufficiently reasonable: this, this, this and this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion about accelerator.num_processes in get_scheduler #9633

Confusion about accelerator.num_processes in get_scheduler #9633

hj13-mtlab commented Oct 10, 2024

a-r-r-o-w commented Oct 10, 2024

Zephyrose commented Oct 12, 2024

a-r-r-o-w commented Oct 15, 2024

Confusion about accelerator.num_processes in get_scheduler #9633

Confusion about accelerator.num_processes in get_scheduler #9633

Comments

hj13-mtlab commented Oct 10, 2024

a-r-r-o-w commented Oct 10, 2024

Zephyrose commented Oct 12, 2024

a-r-r-o-w commented Oct 15, 2024