[QUESTION] Measuring Pipeline Bubble Time During Megatron-LM Training #25

HodBadichi · 2024-06-14T13:39:52Z

I'm curious about how you measured the precise bubble time during a run in your experiments(T_Comm in the paper). Megatron-LM's scheduling combines communication and idle time within the same NCCL operation, making it difficult to distinguish them using timestamps or profilers.

I'm experimenting with Vanilla Megatron-LM to identify real-time bubbles. However, the ncclDevKernel_SendRecv function seems to include both communication and idle time, and even with GPU sampling, it's challenging to determine when the GPU is truely idle and when communication actually happens.

The text was updated successfully, but these errors were encountered:

ufotalent · 2024-06-19T02:44:27Z

Hi @HodBadichi , thanks for the interest in our work.

To clarify, the bubble rate we present in the paper Section 5.3 EFFICIENCY OF AUTOMATIC SCHEDULING is the theoretical bubble rate calculated by the scheduler, using profiled Tf, Tb, Tw and Tcomm. The Tcomm is profiled in a warmup phase before the starting iterations, where the communication is profiled without doing any other computations.

So in practice there might be a slight difference between the profiled timing and the real precise timing, due to either randomness or the slowing down because of communication-computation overlap. But this shouldn't affect the bubble rate analysis too much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Measuring Pipeline Bubble Time During Megatron-LM Training #25

[QUESTION] Measuring Pipeline Bubble Time During Megatron-LM Training #25

HodBadichi commented Jun 14, 2024

ufotalent commented Jun 19, 2024

[QUESTION] Measuring Pipeline Bubble Time During Megatron-LM Training #25

[QUESTION] Measuring Pipeline Bubble Time During Megatron-LM Training #25

Comments

HodBadichi commented Jun 14, 2024

ufotalent commented Jun 19, 2024