[ckpt-rewr] Get Model State Dict Util Function #3250

eracah · 2024-05-03T04:44:18Z

What does this PR do?

Adds an API for extracting model state dict from a model object.

State dict generation is a necessary operation before the save AND load of a checkpoint.
Currently in composer it is coupled with the State, and not very readable, hard to extend, hard to test, and hard for users to harness to do custom things. As such, we present a function to generate state_dict for the model decoupled from State as a standalone function. By making an explicit function for the model, it’s easier to test because we have a standalone function (we don’t have to make a dummy State function). Moreover, it’s easier to save each state dict as a separate file Also, an advanced user can just call these functions themselves if they have a custom, advanced script or callback.

This state dict generation function enables:

generating sharded or full state dicts
generating state dicts of different precision
specify keys to include
sprecify keys to exclude

These are all options that will be useful for save and load. Because save and load require state dict generation, we need these options in state dict generation as well

GRT-2903

…into get-model-sd

mvpatel2000

Is this just pulling out existing code into a helper fn?

@eracah it would be great to get slightly more description so I know what parts to carefully read over and what is less important

eracah · 2024-05-13T21:10:07Z

Is this just pulling out existing code into a helper fn?

@eracah it would be great to get slightly more description so I know what parts to carefully read over and what is less important

It's detailed in the design doc, but I can copy and paste it in if you want

mvpatel2000 · 2024-05-13T23:43:11Z

Is this just pulling out existing code into a helper fn?
@eracah it would be great to get slightly more description so I know what parts to carefully read over and what is less important

It's detailed in the design doc, but I can copy and paste it in if you want

It's easier when reviewing PRs to either link to the right part of design doc or copy paste description

composer/checkpoint/state_dict.py

…into get-model-sd

eracah · 2024-05-15T20:23:55Z

Is this just pulling out existing code into a helper fn?
@eracah it would be great to get slightly more description so I know what parts to carefully read over and what is less important

It's detailed in the design doc, but I can copy and paste it in if you want

It's easier when reviewing PRs to either link to the right part of design doc or copy paste description

Ok added description

mvpatel2000

Mostly LGTM! just a few minor nits that should be quick to clean up. Also looks like tests are failing, I think because torch gating is a bit off

tests/checkpoint/test_state_dict.py

tests/common/models.py

composer/checkpoint/state_dict.py

Co-authored-by: Mihir Patel <[email protected]>

…into get-model-sd

tests/checkpoint/test_state_dict.py

dakinggg

few nits, one larger comment/question.

I'm guessing there are a lot of tests that could be simplified by using the functionality that this PR adds. Is that true? If so, is it worth trying to do at least some of that as part of this PR? It would also implicitly test the functionality being added more, since those would be real uses cases.

composer/checkpoint/state_dict.py

tests/common/compare.py

…into get-model-sd

eracah · 2024-05-17T05:05:04Z

few nits, one larger comment/question.

I'm guessing there are a lot of tests that could be simplified by using the functionality that this PR adds. Is that true? If so, is it worth trying to do at least some of that as part of this PR? It would also implicitly test the functionality being added more, since those would be real uses cases.

Yes in theory, we would want to do that. However, in this PR we are just adding the API to be used in a script; we aren't actually adding this code to be used in Trainer. So until we swap out state.py state dict generation code for this one, we don't need to change those other tests or add any E2E tests.

eracah added 2 commits May 2, 2024 21:41

first commit get model sd

17d80d5

Add test stub

3bb5785

eracah marked this pull request as draft May 3, 2024 04:44

eracah and others added 12 commits May 6, 2024 16:34

Add test stub

1ebbd87

Added unit tests for non-sharded use cases

cca228b

Add support for ComposerModel (+ Precision tweak)

37e9753

Add sharded state dict support

4558182

add precision test

23434c0

Add precision test for sharded state dict

eb45d05

pre-commit

f1acf26

pre-commit

299aae5

Merge branch 'dev' into get-model-sd

5028039

mark gpu

c08abd1

Merge branch 'get-model-sd' of https://github.com/eracah/evan-composer …

b4f380c

…into get-model-sd

pre-commit

a3664ca

eracah marked this pull request as ready for review May 11, 2024 00:17

eracah requested review from mvpatel2000, bigning and dakinggg May 11, 2024 00:18

mvpatel2000 reviewed May 13, 2024

View reviewed changes

bigning reviewed May 14, 2024

View reviewed changes

composer/checkpoint/state_dict.py Outdated Show resolved Hide resolved

eracah added 2 commits May 15, 2024 20:11

add error for sharded + non-fsdp

8ab34f1

Merge branch 'get-model-sd' of https://github.com/eracah/evan-composer …

ab00b11

…into get-model-sd

eracah requested review from bigning and mvpatel2000 May 15, 2024 20:28

mvpatel2000 reviewed May 15, 2024

View reviewed changes

change gating

2bbdc27

eracah and others added 11 commits May 15, 2024 23:24

fix tests

da14580

Merge branch 'dev' into get-model-sd

f20ecb3

Merge branch 'dev' into get-model-sd

24aecbf

docstring

fae7746

Update composer/checkpoint/state_dict.py

ff8d91c

Co-authored-by: Mihir Patel <[email protected]>

Update composer/checkpoint/state_dict.py

aba4900

Co-authored-by: Mihir Patel <[email protected]>

pc

1975a56

fix version

d72e2d4

Merge branch 'get-model-sd' of https://github.com/eracah/evan-composer …

f5e4f53

…into get-model-sd

pc

f3a7cce

pc

50c2308

mvpatel2000 reviewed May 16, 2024

View reviewed changes

tests/checkpoint/test_state_dict.py Outdated Show resolved Hide resolved

tests/checkpoint/test_state_dict.py Outdated Show resolved Hide resolved

dakinggg reviewed May 16, 2024

View reviewed changes

composer/checkpoint/state_dict.py Show resolved Hide resolved

composer/checkpoint/state_dict.py Outdated Show resolved Hide resolved

composer/checkpoint/state_dict.py Outdated Show resolved Hide resolved

tests/common/compare.py Outdated Show resolved Hide resolved

eracah and others added 11 commits May 17, 2024 03:56

pre-commit

7fec0c2

Merge branch 'get-model-sd' of https://github.com/eracah/evan-composer …

2e3bf04

…into get-model-sd

Addressed some comments

2941373

pre-commit

307c780

add comments for new simple models

11bdd89

remove todo's

677002b

remove docstring arg

01c1560

change scope name for get_model_state_dict

5ccbe12

Merge branch 'get-model-sd' of https://github.com/eracah/evan-composer …

44468db

…into get-model-sd

pre-commit

aee42ff

Merge branch 'dev' into get-model-sd

94ab3da

eracah requested review from bigning, dakinggg and mvpatel2000 May 17, 2024 05:05

eracah enabled auto-merge (squash) May 17, 2024 05:06

eracah merged commit bddf44b into mosaicml:dev May 17, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ckpt-rewr] Get Model State Dict Util Function #3250

[ckpt-rewr] Get Model State Dict Util Function #3250

eracah commented May 3, 2024 •

edited

Loading

mvpatel2000 left a comment

eracah commented May 13, 2024

mvpatel2000 commented May 13, 2024

eracah commented May 15, 2024

mvpatel2000 left a comment

dakinggg left a comment •

edited

Loading

eracah commented May 17, 2024

[ckpt-rewr] Get Model State Dict Util Function #3250

[ckpt-rewr] Get Model State Dict Util Function #3250

Conversation

eracah commented May 3, 2024 • edited Loading

What does this PR do?

mvpatel2000 left a comment

Choose a reason for hiding this comment

eracah commented May 13, 2024

mvpatel2000 commented May 13, 2024

eracah commented May 15, 2024

mvpatel2000 left a comment

Choose a reason for hiding this comment

dakinggg left a comment • edited Loading

Choose a reason for hiding this comment

eracah commented May 17, 2024

eracah commented May 3, 2024 •

edited

Loading

dakinggg left a comment •

edited

Loading