Enable checkpointing with DCP #26

fegin · 2024-01-31T17:12:07Z

Summary:
This PR enable checkpointing. The PR only enables checkpointing in the local storages. Only when DCP enables automatic storage detection can this checkpoint manager support remote storages.

This PR didn't checkpoint dataloader.

Test Plan:
Changed CHECKPOINT_FOLDER to /tmp/checkpoint_chienchin and ran ./run_llama_train.sh twice. The first run ran through all 100 steps and the checkpoints were saved. The second run loaded the checkpoint back and detected the saved step count is 100. No training was done for the second step.

wanchaol

Looks great! thanks for doing this super fast! I have a few suggestions inlined.

wanchaol · 2024-01-31T21:45:52Z

run_llama_train.sh

@@ -7,6 +7,11 @@ TRAINER_DIR=${1:-/home/$USER/local/torchtrain}
 MODEL="debugmodel"
 NGPU=8
 MP=4
+# Change this string to a meaningful one to enable checkpoint
+CHECKPOINT_FOLDER=""


can we change this to something like a /tmp/torchtrain so that it saves somewhere when we locally run it?

We should set this be an opt-in feature so that people won't get surprise and may save too many files to /tmp when people are using the same machine. And since if the training finishes, there will be a checkpoint, users may unconsciously ignore all the new training because of an existing checkpoint with last_step. That happens a lot. So it's better to do an opt-in feature.

wanchaol · 2024-01-31T21:48:59Z

train.py

+    )
+    parser.add_argument(
+        "--checkpoint-interval-type",
+        type=str, default="seconds",


do we need to use interval type seconds here? I think maybe a simple checkpoint-interval that documented as number of iterations should be enough? as we ultimately don't know how much time a model fwd/bwd/optim time would take, I think a number of iterations is more sound than seconds.

I think we can keep the time feature as described above but use step as the default one.

wanchaol · 2024-01-31T21:51:10Z

train.py

            rank0_log(f"current loss: {train_state.current_loss}")

+            checkpoint.save(train_state.step)
+
+            if train_state.step == args.steps:


iiuc this is after all steps we save a final checkpoint?

wanchaol · 2024-01-31T21:52:44Z

torchtrain/checkpoint.py

+                and (curr_step - self.begin) < self.interval
+            ):
+                return
+            if self.interval_type == IntervalType.SECONDS:


I would prefer we get rid of the seconds handling as mentioned in another comment to keep our stack simple enough.

We can add it back once we feel this mode is needed for the actual training

It is sometimes better to use time because a lot of features can change the per-iteration time like model type, batch size and other stuffs. Using steps may require some tuning to avoid affect the overall performance.

We can change the default to steps so that users don't need to worry about it now.

Sure sounds good! My main motivation is that our library to be as simple as possible, we can evaluate once we start real trainings if we would use time interval type, and decide later whether we want to keep it or not

wanchaol · 2024-01-31T21:54:47Z

torchtrain/checkpoint.py

+        )
+
+    def load(self, step: int = -1) -> bool:
+        if not self.folder:


nit: we should check either in train.py or in this save and load method to only save/load when step % checkpoint_interval == 0, so that we skip the save/load logic when we don't need to save/load checkpoints.

wanchaol

lgtm! have a few more minor comments inlined

wanchaol · 2024-02-02T05:37:07Z

train.py

@@ -30,6 +31,18 @@ class TrainState:
    current_loss: float = -1
    losses: List[float] = field(default_factory=list)

+    def state_dict(self) -> Dict[str, Any]:


nit: to avoid confusion with the model/optim state dict, we should rename this to sth like train_state

This is naming is required by DCP.

wanchaol · 2024-02-02T05:37:29Z

train.py

+            "losses": torch.tensor(self.current_loss, dtype=torch.float32),
+        }
+
+    def load_state_dict(self, state_dict) -> None:


ditto: load_train_state to avoid confusion with DCP.save/load_state_dict

This is naming is required by DCP.

wanchaol · 2024-02-02T05:39:07Z

torchtrain/checkpoint.py

+            f"{time.monotonic() - begin} seconds"
+        )
+
+    def load(self, step: int = -1) -> bool:


why we have a step arg here? seems like we don't use this arg too, we should remove it first.

We do use the step. In the case where there are more than one checkpoint saved, users can specify the step to load a specific checkpoint.

wanchaol · 2024-02-02T05:42:04Z

torchtrain/checkpoint.py

+                and (curr_step - self.begin) < self.interval
+            ):
+                return
+            if self.interval_type == IntervalType.SECONDS:


Sure sounds good! My main motivation is that our library to be as simple as possible, we can evaluate once we start real trainings if we would use time interval type, and decide later whether we want to keep it or not

wanchaol · 2024-02-02T05:43:31Z

run_llama_train.sh

@@ -7,6 +7,11 @@ TRAINER_DIR=${1:-/home/$USER/local/torchtrain}
 MODEL="debugmodel"
 NGPU=8
 MP=4
+# Change this string to a meaningful one to enable checkpoint
+CHECKPOINT_FOLDER=""


wconstab · 2024-02-02T20:02:02Z

could you try rebasing before you merge? you'll pick up the linter CI that way and itll force you to lint your new files. Hopefully not too many conflicts to resolve since most of the hairy linter changes were in model.py

Summary: This PR enable checkpointing. The PR only enables checkpointing in the local storages. Only when DCP enables automatic storage detection can this checkpoint manager support remote storages. This PR didn't checkpoint dataloader. Test Plan: Changed CHECKPOINT_FOLDER to /tmp/checkpoint_chienchin and ran ./run_llama_train.sh twice. The first run ran through all 100 steps and the checkpoints were saved. The second run loaded the checkpoint back and detected the saved step count is 100. No training was done for the second step. Reviewers: Subscribers: Tasks: Tags:

…t of measurement.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: This PR enable checkpointing. The PR only enables checkpointing in the local storages. Only when DCP enables automatic storage detection can this checkpoint manager support remote storages. This PR didn't checkpoint dataloader. Test Plan: Changed CHECKPOINT_FOLDER to /tmp/checkpoint_chienchin and ran ./run_llama_train.sh twice. The first run ran through all 100 steps and the checkpoints were saved. The second run loaded the checkpoint back and detected the saved step count is 100. No training was done for the second step.

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 31, 2024

wanchaol reviewed Jan 31, 2024

View reviewed changes

fegin force-pushed the chienchin_enable_checkpoint branch from a0257bc to 017680f Compare February 1, 2024 23:57

wanchaol approved these changes Feb 2, 2024

View reviewed changes

fegin added 3 commits February 5, 2024 11:31

Simplify the code and use steps instead of seconds as the default uni…

ccf125c

…t of measurement.

Fix some merge errors

fe2e1c6

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

fegin force-pushed the chienchin_enable_checkpoint branch from 0541be2 to fe2e1c6 Compare February 5, 2024 19:34

fegin added 2 commits February 5, 2024 11:39

Fix the linter errors

a7347f0

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Add license

785566d

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

fegin merged commit 6bd9082 into main Feb 6, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable checkpointing with DCP #26

Enable checkpointing with DCP #26

fegin commented Jan 31, 2024

wanchaol left a comment

wanchaol Jan 31, 2024

fegin Feb 1, 2024 •

edited

Loading

wanchaol Feb 2, 2024

wanchaol Jan 31, 2024

fegin Feb 1, 2024 •

edited

Loading

wanchaol Jan 31, 2024

fegin Feb 1, 2024

wanchaol Jan 31, 2024

fegin Feb 1, 2024 •

edited

Loading

wanchaol Feb 2, 2024

wanchaol Jan 31, 2024

wanchaol left a comment

wanchaol Feb 2, 2024

fegin Feb 5, 2024

wanchaol Feb 2, 2024

fegin Feb 5, 2024

wanchaol Feb 2, 2024

fegin Feb 5, 2024

wanchaol Feb 2, 2024

wanchaol Feb 2, 2024

wconstab commented Feb 2, 2024

Enable checkpointing with DCP #26

Enable checkpointing with DCP #26

Conversation

fegin commented Jan 31, 2024

wanchaol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fegin Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fegin Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fegin Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanchaol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wconstab commented Feb 2, 2024

fegin Feb 1, 2024 •

edited

Loading

fegin Feb 1, 2024 •

edited

Loading

fegin Feb 1, 2024 •

edited

Loading