Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stacked at evaluation at the end of training for multi-GPU and patience termination #677

Open
dbidoggia opened this issue Nov 8, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@dbidoggia
Copy link

Dear all,
I observed that if I train on multiple GPUs and the trains ends with "patience" criteria the evaluation done after the training get stacked after loading the checkpoint (I think at this point

distributed_model = DDP(model, device_ids=[local_rank])
).
I do not observe this problem if I use "distributed:false" or if the training ends after reaching max_num_epochs.

@ilyes319
Copy link
Contributor

ilyes319 commented Nov 11, 2024

Hello,

Thank you for reporting that. Could you tell me a bit more in details what you mean by "stacked"?
Would you helpful if you could attach your log, or any unexpected output that you observe.

@dbidoggia
Copy link
Author

Thanks for your reply.
I attached the log file. It gets stacked after loading the checkpoint, the calculation goes on until slurm killes the job for having reached the max required time apparently without doing anything or returning any error.
test_run-3242_debug.log

@alinelena
Copy link
Contributor

I have noticed something similar, looks like a race condition, since I could not reproduce it consistently.

@ilyes319 ilyes319 added the bug Something isn't working label Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants