-
Notifications
You must be signed in to change notification settings - Fork 429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable aggregate mem monitoring #3042
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a unit test for this?
In particular I would write a distributed test where rank 0 creates a large tensor and then show rank 1 gets a higher mem usage
e944f70
to
14227f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One small bug and a request for a comment
Co-authored-by: Mihir Patel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 🐈 🐒
* enable aggregate mem monitoring * add test * lint * make more deterministic * pr comments * Update composer/callbacks/memory_monitor.py Co-authored-by: Mihir Patel <[email protected]> * updt doc str --------- Co-authored-by: Mihir Patel <[email protected]>
* enable aggregate mem monitoring * add test * lint * make more deterministic * pr comments * Update composer/callbacks/memory_monitor.py Co-authored-by: Mihir Patel <[email protected]> * updt doc str --------- Co-authored-by: Mihir Patel <[email protected]>
What does this PR do?
Enables memory monitor to aggregate stats across GPUs.
With dynamic graph execution, some GPUs may have more memory usage than others; this update allows the user to aggregate memory stats across GPUs.
What issue(s) does this change relate to?
Discussed offline
Local testing via: https://github.com/mosaicml/llm-foundry-private/pull/153
Before submitting
pre-commit
on your change? (see thepre-commit
section of prerequisites)