You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[QUESTION] I used the zero-bubble commit 7ad9c81d for my experiments, and found that the memory usage of this zb-v model exceeds that of the previous zb1 model. What could be the issue? The specific configuration and results are shown in the image.
#36
Open
lbk-sys opened this issue
Jul 18, 2024
· 4 comments
Hi, Thanks for the interest in our work. Theoretically ZBV has the same activation memory as 1F1B and zbh1, but one difference is that ZBV also changes the placement of layers. One thing that might cause the difference might be the lm-head and embedding, which for 1f1b is on different stage but for zbv on the same stage.
A quick calculation:
For 1F1B and pp_stage=8, the peak memory is on rank 0 with 8x activation of 4 layers + parameter of 4 layers + parameter of input embedding.
For ZBV, the peak memory is on rank 0 with 8x activation of 4 layers + parameter of 4 layers + parameter of input embedding + parameter of lm-head.
I did a brief calculation that the parameter memory of lm-head for llama 7b is 16 (parameter + grad + optimzer state factor)* h * volcabulary = 2G, close to the difference between ZBV and baseline.
To verify this you can enlarge the mbs and you should see the memory difference between ZBV and baseline is a constant, because in this case only activation memory doubles.
I think for 1F1B the activation is not 8x but 4x. ZBV has similar mem requirements as interleaved schedule, which is more than 1F1B due to more warm-up microbatches.
The text was updated successfully, but these errors were encountered: