Stacked at evaluation at the end of training for multi-GPU and patience termination #677

dbidoggia · 2024-11-08T09:14:01Z

Dear all,
I observed that if I train on multiple GPUs and the trains ends with "patience" criteria the evaluation done after the training get stacked after loading the checkpoint (I think at this point

mace/mace/cli/run_train.py

Line 702 in 4081abd

distributed_model = DDP(model, device_ids=[local_rank])

).
I do not observe this problem if I use "distributed:false" or if the training ends after reaching max_num_epochs.

ilyes319 · 2024-11-11T13:28:29Z

Hello,

Thank you for reporting that. Could you tell me a bit more in details what you mean by "stacked"?
Would you helpful if you could attach your log, or any unexpected output that you observe.

dbidoggia · 2024-11-11T13:52:07Z

Thanks for your reply.
I attached the log file. It gets stacked after loading the checkpoint, the calculation goes on until slurm killes the job for having reached the max required time apparently without doing anything or returning any error.
test_run-3242_debug.log

alinelena · 2024-12-05T07:31:44Z

I have noticed something similar, looks like a race condition, since I could not reproduce it consistently.

ilyes319 added the bug Something isn't working label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stacked at evaluation at the end of training for multi-GPU and patience termination #677

Stacked at evaluation at the end of training for multi-GPU and patience termination #677

dbidoggia commented Nov 8, 2024

ilyes319 commented Nov 11, 2024 •

edited

Loading

dbidoggia commented Nov 11, 2024

alinelena commented Dec 5, 2024

Stacked at evaluation at the end of training for multi-GPU and patience termination #677

Stacked at evaluation at the end of training for multi-GPU and patience termination #677

Comments

dbidoggia commented Nov 8, 2024

ilyes319 commented Nov 11, 2024 • edited Loading

dbidoggia commented Nov 11, 2024

alinelena commented Dec 5, 2024

ilyes319 commented Nov 11, 2024 •

edited

Loading