[QUESTION] I used the zero-bubble commit 7ad9c81d for my experiments, and found that the memory usage of this zb-v model exceeds that of the previous zb1 model. What could be the issue? The specific configuration and results are shown in the image. #36

lbk-sys · 2024-07-18T06:41:42Z

lbk-sys · 2024-07-18T06:57:23Z

I turned on the following features in zb-1 and zb-v:

zb-v:
--enable-zero-bubble
--zero-bubble-v-schedule
--allow-padding-num-layers
--zero-bubble-max-pending-backward $((1 * $PP))
--enable-optimizer-post-validation \

zb-1:
--enable-zero-bubble
--allow-padding-num-layers
--zero-bubble-max-pending-backward $((1 * $PP))
--enable-optimizer-post-validation \

ufotalent · 2024-07-19T07:53:43Z

Hi, Thanks for the interest in our work. Theoretically ZBV has the same activation memory as 1F1B and zbh1, but one difference is that ZBV also changes the placement of layers. One thing that might cause the difference might be the lm-head and embedding, which for 1f1b is on different stage but for zbv on the same stage.

A quick calculation:
For 1F1B and pp_stage=8, the peak memory is on rank 0 with 8x activation of 4 layers + parameter of 4 layers + parameter of input embedding.
For ZBV, the peak memory is on rank 0 with 8x activation of 4 layers + parameter of 4 layers + parameter of input embedding + parameter of lm-head.
I did a brief calculation that the parameter memory of lm-head for llama 7b is 16 (parameter + grad + optimzer state factor)* h * volcabulary = 2G, close to the difference between ZBV and baseline.

To verify this you can enlarge the mbs and you should see the memory difference between ZBV and baseline is a constant, because in this case only activation memory doubles.

Thanks!

ufotalent · 2024-07-19T07:57:49Z

BTW I feel that the acceleration ratio is lower than we expected, is it possible to share the logs so we can investigate a bit? Thanks

Edenzzzz · 2024-08-20T03:42:59Z

I think for 1F1B the activation is not 8x but 4x. ZBV has similar mem requirements as interleaved schedule, which is more than 1F1B due to more warm-up microbatches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] I used the zero-bubble commit 7ad9c81d for my experiments, and found that the memory usage of this zb-v model exceeds that of the previous zb1 model. What could be the issue? The specific configuration and results are shown in the image. #36

[QUESTION] I used the zero-bubble commit 7ad9c81d for my experiments, and found that the memory usage of this zb-v model exceeds that of the previous zb1 model. What could be the issue? The specific configuration and results are shown in the image. #36

lbk-sys commented Jul 18, 2024 •

edited

Loading

lbk-sys commented Jul 18, 2024

ufotalent commented Jul 19, 2024

ufotalent commented Jul 19, 2024

Edenzzzz commented Aug 20, 2024

[QUESTION] I used the zero-bubble commit 7ad9c81d for my experiments, and found that the memory usage of this zb-v model exceeds that of the previous zb1 model. What could be the issue? The specific configuration and results are shown in the image. #36

[QUESTION] I used the zero-bubble commit 7ad9c81d for my experiments, and found that the memory usage of this zb-v model exceeds that of the previous zb1 model. What could be the issue? The specific configuration and results are shown in the image. #36

Comments

lbk-sys commented Jul 18, 2024 • edited Loading

lbk-sys commented Jul 18, 2024

ufotalent commented Jul 19, 2024

ufotalent commented Jul 19, 2024

Edenzzzz commented Aug 20, 2024

lbk-sys commented Jul 18, 2024 •

edited

Loading