-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a miniversion containing only ZB-H1 and essential changes so other megatron forks can easily integrate #10
Comments
I ran a llama-7B instance in pipeline parallel mode (pp size=8, tp size=1) using ZB-H1, but I found there is no exact better performance vs. the original one. Both durations of each step are the same. Does it make sense? @QPHutu @ufotalent @P2333 |
Hi thanks for trying out ZB-H1. The result seems like problematic because in this situation ZBH1 should provide some acceleration. May we know some details on the setup? Like what’s the code repo and what’s briefly the changes to enable ZBH1? Also what’s the number of mini batches in pp? |
I ran the latest version of Megatron-LM and patch the quick implementation for zb-h1(commit-id: 95212f7). The global batch size and the mini batch size of the testing are 256 and 4 respectively. @ufotalent @QPHutu @P2333 |
@robotsp is it possible to share the training script? |
@ufotalent
Here are some of my parameter settings:
|
Hi @GeLee-Q , Thanks for the interest in our work. Could you share which version (or git commit) are you using? Thanks |
Thank you, I have integrated these two modifications into my own version of the Megatron-LM library that I use. NVIDIA@95212f7#diff-6078a722754eba8b855a8156b2dc22283858d10acd2d6bc8115086f35d4fbb7b |
Oh I get the problem. I think the reason is you turned on 'no_gradient_accumulation_fusion' which skips our modification on the layers.py My suggestion is to remove this flag. If you'll need to turn off grad accumulation for some reason, then in the W pass you'll need to manually do the W matmul and grad accumulation in sth like this:
|
Sorry forgot to @GeLee-Q |
Thank you very much for your guidance. I will continue to study your work in depth going forward. |
@ufotalent To implement a version using our own running engine and async IO
@QPHutu To implement a version by modifying 1f1b schedule using sync IO
The text was updated successfully, but these errors were encountered: