You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dear all,
I observed that if I train on multiple GPUs and the trains ends with "patience" criteria the evaluation done after the training get stacked after loading the checkpoint (I think at this point
Thank you for reporting that. Could you tell me a bit more in details what you mean by "stacked"?
Would you helpful if you could attach your log, or any unexpected output that you observe.
Thanks for your reply.
I attached the log file. It gets stacked after loading the checkpoint, the calculation goes on until slurm killes the job for having reached the max required time apparently without doing anything or returning any error. test_run-3242_debug.log
Dear all,
I observed that if I train on multiple GPUs and the trains ends with "patience" criteria the evaluation done after the training get stacked after loading the checkpoint (I think at this point
mace/mace/cli/run_train.py
Line 702 in 4081abd
I do not observe this problem if I use "distributed:false" or if the training ends after reaching max_num_epochs.
The text was updated successfully, but these errors were encountered: