-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HyperQueue submission gets stuck if workers fail #764
Comments
Hi, sorry for the late reply, I'm quite busy at the moment. The automatic allocator has an internal rate limiter that can cause allocations to be stopped if too many failures occur. I'm not sure if that's what is causing this issue though. Are you still using |
Yes, the HQ server is started by
|
Could you please send me the full log? |
Sure, log from that time period is at https://www.fzu.cz/~svatosm/hq-debug-output.log |
Thank you. It indeed looks like at least one of the issues is rate limiting. Except for the crash count, there is an additional rate limiter that kicks in after you have too many allocation or submission failures in a row. This rate limiter will then not allow further submissions until one hour has elapsed. This is clearly not ideal for the case of pre-emptible allocations. One additional issue that I noticed in the log is that some of your tasks are hitting the crash limit. This is used to prevent re-running tasks that might crash their workers too many times. In your case, if workers fail often, the crash counter of tasks will be increased quite a lot, which can then fail these tasks. Could you try setting the crash limit of your tasks to |
Hi, |
Btw: v0.20.0 does not increase crash limit when worker is explicitly stopped via "hq server stop", so it may happen that a task is restarted even |
Well, I need to ensure that jobs run only once. The crash limit is the only way I know how to do that but if there is other way, I could switch to that. |
Crash counter is now the only was and if you are not using |
Could you please specify the time period where the allocations weren't being created? I tried going through the log, and it seems that the allocations have been submitted at a steady state, e.g.:
It also seems that new workers were being added and task were being started periodically. |
Well, when I submitted this ticket, I was referring to time period between around 3 a 9 (when I made a new allocation queue) on 1.10.2024. |
If it helps, I can provide another example. In this log:
HQ does not submit batch jobs since the middle of October. |
For some time, I am observing this situation. HyperQueue works fine for some time and then something happens. Jobs are coming to the HyperQueue, are buffered there but no workers are submitted even though the allocation queue has backlog set to 10. The situation from this morning:
The text was updated successfully, but these errors were encountered: