Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] I used the zero-bubble commit 7ad9c81d for my experiments, and found that the memory usage of this zb-v model exceeds that of the previous zb1 model. What could be the issue? The specific configuration and results are shown in the image. #36

Open
lbk-sys opened this issue Jul 18, 2024 · 4 comments

Comments

@lbk-sys
Copy link

lbk-sys commented Jul 18, 2024

1721284001438-jws (1)

@lbk-sys
Copy link
Author

lbk-sys commented Jul 18, 2024

I turned on the following features in zb-1 and zb-v:

zb-v:
--enable-zero-bubble
--zero-bubble-v-schedule
--allow-padding-num-layers
--zero-bubble-max-pending-backward $((1 * $PP))
--enable-optimizer-post-validation \

zb-1:
--enable-zero-bubble
--allow-padding-num-layers
--zero-bubble-max-pending-backward $((1 * $PP))
--enable-optimizer-post-validation \

@ufotalent
Copy link

Hi, Thanks for the interest in our work. Theoretically ZBV has the same activation memory as 1F1B and zbh1, but one difference is that ZBV also changes the placement of layers. One thing that might cause the difference might be the lm-head and embedding, which for 1f1b is on different stage but for zbv on the same stage.

A quick calculation:
For 1F1B and pp_stage=8, the peak memory is on rank 0 with 8x activation of 4 layers + parameter of 4 layers + parameter of input embedding.
For ZBV, the peak memory is on rank 0 with 8x activation of 4 layers + parameter of 4 layers + parameter of input embedding + parameter of lm-head.
I did a brief calculation that the parameter memory of lm-head for llama 7b is 16 (parameter + grad + optimzer state factor)* h * volcabulary = 2G, close to the difference between ZBV and baseline.

To verify this you can enlarge the mbs and you should see the memory difference between ZBV and baseline is a constant, because in this case only activation memory doubles.

Thanks!

@ufotalent
Copy link

BTW I feel that the acceleration ratio is lower than we expected, is it possible to share the logs so we can investigate a bit? Thanks

@Edenzzzz
Copy link

I think for 1F1B the activation is not 8x but 4x. ZBV has similar mem requirements as interleaved schedule, which is more than 1F1B due to more warm-up microbatches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants