-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HQ crashes #718
Comments
Hi! Thanks for the report. Here is what is happening:
Now, arguably HQ should probably skip reading these values instead of crashing here. On the other hand, if you do specify that you want to use Slurm, and it is not possible to find the remaining time limit, the worker could start without a time limit, which could be annoying in some cases. So crashing here tells you that something went wrong. As a hotfix, you can try As a sort of a separate question, do you have a specific reason for running the HQ server within a Slurm allocation? It's a valid use-case, but normally you can also run it on login nodes, which should be more ergonomic to use. |
Thank you so much for your help, it works now without losing workers. Starting HQ server on login node would be a good choice but when I do that and try to start workers using Slurm jobs, I get an error which is along the lines of "access token found but server is not reachable". This makes me think that in such a case the HQ server and workers do not communicate. Looking into HQ documentation, it seems that for a user without admin rights fixing the communication problems may not be possible. Thank you again for designing and improving HQ. |
I see. On most HPC clusters that we have tried it, login and compute nodes can communicate without problems. But if this is not possible on your cluster, then indeed you'll need to run the server inside allocations. |
Hello,
Firstly thank you for developing HQ.
I recently came across a crash while testing HQ v0.19.0. The submission script which is submitted to SLURM manager is as follows:
The error message is pasted below. Interestingly, this error message appears during only some of the identical runs tested. Please let me know if I should provide more information for debugging.
The text was updated successfully, but these errors were encountered: