You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm curious about how you measured the precise bubble time during a run in your experiments(T_Comm in the paper). Megatron-LM's scheduling combines communication and idle time within the same NCCL operation, making it difficult to distinguish them using timestamps or profilers.
I'm experimenting with Vanilla Megatron-LM to identify real-time bubbles. However, the ncclDevKernel_SendRecv function seems to include both communication and idle time, and even with GPU sampling, it's challenging to determine when the GPU is truely idle and when communication actually happens.
The text was updated successfully, but these errors were encountered:
Hi @HodBadichi , thanks for the interest in our work.
To clarify, the bubble rate we present in the paper Section 5.3 EFFICIENCY OF AUTOMATIC SCHEDULING is the theoretical bubble rate calculated by the scheduler, using profiled Tf, Tb, Tw and Tcomm. The Tcomm is profiled in a warmup phase before the starting iterations, where the communication is profiled without doing any other computations.
So in practice there might be a slight difference between the profiled timing and the real precise timing, due to either randomness or the slowing down because of communication-computation overlap. But this shouldn't affect the bubble rate analysis too much.
I'm curious about how you measured the precise bubble time during a run in your experiments(T_Comm in the paper). Megatron-LM's scheduling combines communication and idle time within the same NCCL operation, making it difficult to distinguish them using timestamps or profilers.
I'm experimenting with Vanilla Megatron-LM to identify real-time bubbles. However, the
ncclDevKernel_SendRecv
function seems to include both communication and idle time, and even with GPU sampling, it's challenging to determine when the GPU is truely idle and when communication actually happens.The text was updated successfully, but these errors were encountered: