-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make CheckpointManager friendlier to custom StorageWriter/StorageReader #789
base: main
Are you sure you want to change the base?
Conversation
I see that CI fails on my new tests as they use CUDA and CI does not support that. Should I mark the test for async mode as skipped and move to CPU? |
@@ -402,24 +438,56 @@ def sync_func(): | |||
sync_func() | |||
self.staging = False | |||
|
|||
def _check_checkpoint_exitsts(self, step: int) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR!
I wanted to note that this should be "exists" rather than the current spelling, "exitsts"...thus _check_checkpoint_exists.
In a way, it's a nit but it's also important as misspelled apis can create confusion (have personally hit such things where type in correct spelling but api has mis-spelling and not obvious failures result.
There are 2 calls here to this function that would need to also be update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor point - '_verify_checkpoint_exists' would be smoother name (vs check_checkpoint).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, thanks! Will surely fix this.
Isn't fsspec enough for S3 usage? If fsspec is installed, DCP should also accept S3 path. |
Thank you for taking a look! Let me provide some more motivation here to answer your question:
I haven't checked this myself yet, but it should work with DCP itself, yes. But:
addressing the above points is the main focus of this PR. |
The goal of this PR is to make using custom
torch.distributed.checkpoint.storage.StorageReader/StorageWriter
s with torchtitan easier.For example, using
S3StorageWriter
from s3-connector-for-pytorch would allow storing checkpoints in AWS S3.Two changes needed to be made to allow this:
storage_reader
andstorage_writer
into theCheckpointManager
.os.path
module under overridable methods (as this logic is specific to the defaulttorch.distributed.checkpoint.filesystem.FileSystemReader/FileSystemWriter
).I also added basic tests for CheckpointManager to make sure I didn't break the original behavior