Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make save dir for each model training job #102

Open
dkimpara opened this issue Sep 19, 2024 · 2 comments
Open

make save dir for each model training job #102

dkimpara opened this issue Sep 19, 2024 · 2 comments
Assignees

Comments

@dkimpara
Copy link
Collaborator

for fsdp: its model_checkpoint.pt
for everything else: checkpoint.pt

@dkimpara dkimpara self-assigned this Sep 19, 2024
@dkimpara
Copy link
Collaborator Author

conclusion: specify checkpoint name under save_loc. add default behavior to parser, change train.py

@dkimpara
Copy link
Collaborator Author

update on renaming checkpoints: i've edited the bits of the codebase i'm familiar with but i'm not familiar the other parts that need to happen:

  • need to name model checkpoints we're saving to
  • need to name optimizer checkpoints loading from/to

I think it would be less complexity to make a new dir for each training run - based on a timestamp or similar. this also safeguards against accidentally overwriting files.

I've left the progress i've made in the branch rename_checkpoint

@dkimpara dkimpara changed the title checkpoint names are hardcoded into train and predict etc scripts make new dir for each model training job Oct 11, 2024
@dkimpara dkimpara changed the title make new dir for each model training job make save dir for each model training job Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant