Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Measuring Pipeline Bubble Time During Megatron-LM Training #25

Open
HodBadichi opened this issue Jun 14, 2024 · 1 comment
Open

Comments

@HodBadichi
Copy link

I'm curious about how you measured the precise bubble time during a run in your experiments(T_Comm in the paper). Megatron-LM's scheduling combines communication and idle time within the same NCCL operation, making it difficult to distinguish them using timestamps or profilers.

I'm experimenting with Vanilla Megatron-LM to identify real-time bubbles. However, the ncclDevKernel_SendRecv function seems to include both communication and idle time, and even with GPU sampling, it's challenging to determine when the GPU is truely idle and when communication actually happens.

image

@ufotalent
Copy link

Hi @HodBadichi , thanks for the interest in our work.

To clarify, the bubble rate we present in the paper Section 5.3 EFFICIENCY OF AUTOMATIC SCHEDULING is the theoretical bubble rate calculated by the scheduler, using profiled Tf, Tb, Tw and Tcomm. The Tcomm is profiled in a warmup phase before the starting iterations, where the communication is profiled without doing any other computations.

So in practice there might be a slight difference between the profiled timing and the real precise timing, due to either randomness or the slowing down because of communication-computation overlap. But this shouldn't affect the bubble rate analysis too much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants