You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been running training on a different framework with FSDP1, where I saved the states with FULL_STATE_DICT - leading to optimizer states that are in a normal torch.save format. I'd love to resume from this checkpoint - is this currently supported by FSDP2 / DCP? When I naively try dcp.load it resulted in a shard index out of range error.
The text was updated successfully, but these errors were encountered:
syncdoth
changed the title
Can I load from non-FSDP optimizer state?
Can I load from non-FSDP optimizer state with FSDP2?
Dec 31, 2024
There should be a way to load it with DCP cc: @fegin@mori360 .
Full state dicts are state dicts without FSDP sharding. To make them loadable to FSDP2, you just need to iterate over the tensors in the optimizer state that should match the parameter sharding and shard them on dim-0 with DTensor. This can be done with some relatively simple code natively, but I will let @fegin or others comment on what the right way to do this with DCP APIs is.
Yes, you can write a script to do the conversion offline -- simply loading the torch.save optimizer state_dict and then call DCP.save. Then the saved checkpoints should be loadable with FSDP2 + DCP. This is a more complicated version: https://github.com/pytorch/torchtitan/blob/main/scripts/convert_llama_to_dcp.py but the idea is the same. In your case, if everything stay the same (e.g., parameter group), simply loading (torch.load) -> DCP.save should work.
I have been running training on a different framework with FSDP1, where I saved the states with FULL_STATE_DICT - leading to optimizer states that are in a normal
torch.save
format. I'd love to resume from this checkpoint - is this currently supported by FSDP2 / DCP? When I naively trydcp.load
it resulted in a shard index out of range error.The text was updated successfully, but these errors were encountered: