enable aggregate mem monitoring #3042

vchiley · 2024-02-21T03:28:03Z

What does this PR do?

Enables memory monitor to aggregate stats across GPUs.
With dynamic graph execution, some GPUs may have more memory usage than others; this update allows the user to aggregate memory stats across GPUs.

What issue(s) does this change relate to?

Discussed offline
Local testing via: https://github.com/mosaicml/llm-foundry-private/pull/153

Before submitting

Have you read the contributor guidelines?
Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
- discussed offline
Did you update any related docs and document your change?
Did you update any related tests and add any new tests related to your change? (see testing)
Did you run the tests locally to make sure they pass?
Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

mvpatel2000

Can you please add a unit test for this?

In particular I would write a distributed test where rank 0 creates a large tensor and then show rank 1 gets a higher mem usage

mvpatel2000

LGTM. One small bug and a request for a comment

composer/callbacks/memory_monitor.py

Co-authored-by: Mihir Patel <[email protected]>

mvpatel2000

🚀 🐈 🐒

* enable aggregate mem monitoring * add test * lint * make more deterministic * pr comments * Update composer/callbacks/memory_monitor.py Co-authored-by: Mihir Patel <[email protected]> * updt doc str --------- Co-authored-by: Mihir Patel <[email protected]>

enable aggregate mem monitoring

26197c1

vchiley requested review from mvpatel2000 and dakinggg February 21, 2024 03:28

mvpatel2000 reviewed Feb 21, 2024

View reviewed changes

vchiley and others added 5 commits February 21, 2024 14:08

Merge branch 'dev' into dist_mem_monitor

3f9314a

Merge branch 'dev' into dist_mem_monitor

3a263e9

add test

1192ec5

Merge branch 'dev' into dist_mem_monitor

de7e2bb

lint

0ebbc08

vchiley requested review from josejg and mvpatel2000 February 22, 2024 20:43

make more deterministic

14227f3

vchiley force-pushed the dist_mem_monitor branch from e944f70 to 14227f3 Compare February 22, 2024 21:36

mvpatel2000 reviewed Feb 23, 2024

View reviewed changes

composer/callbacks/memory_monitor.py Outdated Show resolved Hide resolved

composer/callbacks/memory_monitor.py Show resolved Hide resolved

vchiley and others added 2 commits February 23, 2024 18:07

pr comments

278af05

Merge branch 'dev' into dist_mem_monitor

48b36f0

vchiley requested a review from mvpatel2000 February 23, 2024 18:08

mvpatel2000 reviewed Feb 23, 2024

View reviewed changes

composer/callbacks/memory_monitor.py Outdated Show resolved Hide resolved

vchiley and others added 2 commits February 23, 2024 10:55

Update composer/callbacks/memory_monitor.py

b4b2656

Co-authored-by: Mihir Patel <[email protected]>

updt doc str

a2413de

vchiley requested a review from mvpatel2000 February 23, 2024 19:01

mvpatel2000 approved these changes Feb 23, 2024

View reviewed changes

vchiley enabled auto-merge (squash) February 23, 2024 19:16

vchiley merged commit a042759 into mosaicml:dev Feb 23, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable aggregate mem monitoring #3042

enable aggregate mem monitoring #3042

vchiley commented Feb 21, 2024 •

edited

Loading

mvpatel2000 left a comment

mvpatel2000 left a comment

mvpatel2000 left a comment

enable aggregate mem monitoring #3042

enable aggregate mem monitoring #3042

Conversation

vchiley commented Feb 21, 2024 • edited Loading

What does this PR do?

What issue(s) does this change relate to?

Before submitting

mvpatel2000 left a comment

Choose a reason for hiding this comment

mvpatel2000 left a comment

Choose a reason for hiding this comment

mvpatel2000 left a comment

Choose a reason for hiding this comment

vchiley commented Feb 21, 2024 •

edited

Loading