Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to restart a training job that was killed by walltime #91

Open
GengSS opened this issue Jun 13, 2024 · 1 comment
Open

how to restart a training job that was killed by walltime #91

GengSS opened this issue Jun 13, 2024 · 1 comment

Comments

@GengSS
Copy link

GengSS commented Jun 13, 2024

Hi I wonder how to restart a training job (by nequip-train) that was killed because of walltime?

I just tried to resubmit the job at the original folder, but it fails immediately.

Thank you very much

Best
Geng

@DavidW99
Copy link

Hi Geng,

Could you share your error message?

Generally, if your previous run is successful, then the software will store the best model in the result directory and it should automatically load that model when you continue the training. However, if your previous training crashes due to an error when you run it the first time but not due to the walltime restriction, then no best model is stored and you shall remove your result directory to completely restart the training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants