Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HyperQueue submission gets stuck if workers fail #764

Open
svatosFZU opened this issue Oct 1, 2024 · 12 comments
Open

HyperQueue submission gets stuck if workers fail #764

svatosFZU opened this issue Oct 1, 2024 · 12 comments

Comments

@svatosFZU
Copy link

For some time, I am observing this situation. HyperQueue works fine for some time and then something happens. Jobs are coming to the HyperQueue, are buffered there but no workers are submitted even though the allocation queue has backlog set to 10. The situation from this morning:

  • there are no jobs in the batch
[[email protected] ~]$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  • list of workers confirms that
[[email protected] ~]$ hq-v0.19.0-linux-x64/hq worker list
+----+-------+----------+-----------+---------+----------------+
| ID | State | Hostname | Resources | Manager | Manager Job ID |
+----+-------+----------+-----------+---------+----------------+
  • there is enough jobs buffered
[[email protected] ~]$ hq-v0.19.0-linux-x64/hq job list
+------+------+---------+-------+
| ID   | Name | State   | Tasks |
+------+------+---------+-------+
| 4565 | bash | WAITING | 1     |
| 4567 | bash | WAITING | 1     |
| 4573 | bash | WAITING | 1     |
| 4584 | bash | WAITING | 1     |
| 4596 | bash | WAITING | 1     |
| 4597 | bash | WAITING | 1     |
| 4600 | bash | WAITING | 1     |
| 4609 | bash | WAITING | 1     |
| 4618 | bash | WAITING | 1     |
| 4626 | bash | WAITING | 1     |
| 4634 | bash | WAITING | 1     |
| 4635 | bash | WAITING | 1     |
| 4636 | bash | WAITING | 1     |
| 4637 | bash | WAITING | 1     |
| 4638 | bash | WAITING | 1     |
| 4639 | bash | WAITING | 1     |
| 4640 | bash | WAITING | 1     |
| 4641 | bash | WAITING | 1     |
| 4642 | bash | WAITING | 1     |
| 4644 | bash | WAITING | 1     |
| 4645 | bash | WAITING | 1     |
| 4648 | bash | WAITING | 1     |
| 4651 | bash | WAITING | 1     |
| 4652 | bash | WAITING | 1     |
| 4653 | bash | WAITING | 1     |
| 4654 | bash | WAITING | 1     |
| 4655 | bash | WAITING | 1     |
| 4656 | bash | WAITING | 1     |
| 4657 | bash | WAITING | 1     |
| 4659 | bash | WAITING | 1     |
| 4660 | bash | WAITING | 1     |
| 4663 | bash | WAITING | 1     |
| 4664 | bash | WAITING | 1     |
| 4666 | bash | WAITING | 1     |
| 4668 | bash | WAITING | 1     |
| 4670 | bash | WAITING | 1     |
| 4674 | bash | WAITING | 1     |
| 4677 | bash | WAITING | 1     |
| 4679 | bash | WAITING | 1     |
| 4680 | bash | WAITING | 1     |
| 4682 | bash | WAITING | 1     |
| 4683 | bash | WAITING | 1     |
| 4684 | bash | WAITING | 1     |
| 4685 | bash | WAITING | 1     |
| 4688 | bash | WAITING | 1     |
| 4689 | bash | WAITING | 1     |
| 4690 | bash | WAITING | 1     |
| 4691 | bash | WAITING | 1     |
| 4692 | bash | WAITING | 1     |
| 4693 | bash | WAITING | 1     |
| 4694 | bash | WAITING | 1     |
| 4695 | bash | WAITING | 1     |
| 4696 | bash | WAITING | 1     |
| 4697 | bash | WAITING | 1     |
| 4698 | bash | WAITING | 1     |
| 4699 | bash | WAITING | 1     |
| 4700 | bash | WAITING | 1     |
| 4701 | bash | WAITING | 1     |
| 4702 | bash | WAITING | 1     |
| 4703 | bash | WAITING | 1     |
| 4704 | bash | WAITING | 1     |
| 4705 | bash | WAITING | 1     |
| 4706 | bash | WAITING | 1     |
| 4707 | bash | WAITING | 1     |
| 4708 | bash | WAITING | 1     |
| 4709 | bash | WAITING | 1     |
| 4710 | bash | WAITING | 1     |
+------+------+---------+-------+
There are 4710 jobs in total. Use `--all` to display all jobs.
  • the allocation queue is set with backlog of 10 and it is running
[[email protected] ~]$ hq-v0.19.0-linux-x64/hq alloc list
+----+---------+--------------+-------------------+------------------+-----------+---------+------+----------------------------------+
| ID | State   | Backlog size | Workers per alloc | Max worker count | Timelimit | Manager | Name | Args                             |
+----+---------+--------------+-------------------+------------------+-----------+---------+------+----------------------------------+
| 1  | RUNNING | 10           | 1                 | 0                | 12h       | SLURM   |      | -AOPEN-29-6,-pqcpu_preempt,-c128 |
+----+---------+--------------+-------------------+------------------+-----------+---------+------+----------------------------------+
  • the content of the debug log shows that HyperQueue cannot find workers it sent:
[2024-10-01T06:07:58.724Z DEBUG hyperqueue::server::autoalloc::process] Attempt to refresh allocations of queue 1
[2024-10-01T06:07:58.724Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739716`
[2024-10-01T06:07:58.734Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739715`
[2024-10-01T06:07:58.743Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739718`
[2024-10-01T06:07:58.753Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739717`
[2024-10-01T06:07:58.762Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739711`
[2024-10-01T06:07:58.771Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739702`
[2024-10-01T06:07:58.779Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739704`
[2024-10-01T06:07:58.787Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739710`
[2024-10-01T06:07:58.796Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739700`
[2024-10-01T06:07:58.804Z DEBUG hyperqueue::server::autoalloc::queue::slurm] Running Slurm command `scontrol show job 1739712`
[2024-10-01T06:07:58.812Z DEBUG hyperqueue::server::autoalloc::process] Allocations of 1 have been refreshed: Ok(Map({"1739710": Err(scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: ), "1739717": Err(scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: ), "1739711": Err(scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: ), "1739700": Err(scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: ), "1739715": Err(scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: ), "1739704": Err(scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: ), "1739718": Err(scontrol execution failed
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: ), "1739702": Err(scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: ), "1739712": Err(scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: ), "1739716": Err(scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: )}))
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739716: scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739715: scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739718: scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739717: scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739711: scontrol execution failed
 
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739702: scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739704: scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739710: scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739700: scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z WARN  hyperqueue::server::autoalloc::process] Could not get status of allocation 1739712: scontrol execution failed
    
    Caused by:
        Exit code: 1
        Stderr: slurm_load_jobs error: Invalid job id specified
        Stdout: 
[2024-10-01T06:07:58.812Z DEBUG hyperqueue::server::autoalloc::process] Task state: ServerTaskState { jobs: Map({JobId(4663): 1, JobId(4596): 1, JobId(4657): 1, JobId(4651): 1, JobId(4584): 1, JobId(4645): 1, JobId(4706): 1, JobId(4639): 1, JobId(4700): 1, JobId(4694): 1, JobId(4688): 1, JobId(4682): 1, JobId(4609): 1, JobId(4670): 1, JobId(4664): 1, JobId(4597): 1, JobId(4652): 1, JobId(4707): 1, JobId(4640): 1, JobId(4573): 1, JobId(4634): 1, JobId(4567): 1, JobId(4701): 1, JobId(4695): 1, JobId(4689): 1, JobId(4683): 1, JobId(4677): 1, JobId(4659): 1, JobId(4653): 1, JobId(4708): 1, JobId(4641): 1, JobId(4702): 1, JobId(4635): 1, JobId(4696): 1, JobId(4690): 1, JobId(4684): 1, JobId(4666): 1, JobId(4660): 1, JobId(4654): 1, JobId(4648): 1, JobId(4709): 1, JobId(4642): 1, JobId(4703): 1, JobId(4636): 1, JobId(4697): 1, JobId(4691): 1, JobId(4685): 1, JobId(4618): 1, JobId(4679): 1, JobId(4600): 1, JobId(4655): 1, JobId(4710): 1, JobId(4704): 1, JobId(4637): 1, JobId(4698): 1, JobId(4692): 1, JobId(4680): 1, JobId(4674): 1, JobId(4668): 1, JobId(4656): 1, JobId(4644): 1, JobId(4705): 1, JobId(4638): 1, JobId(4699): 1, JobId(4565): 1, JobId(4626): 1, JobId(4693): 1}) }, max. workers to spawn: 18446744073709551615
  • checking one job details (1739716), I see it quickly failed:
[[email protected] ~]$ sacct -j 1739716 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,Reqmem,nnodes,ncpus,nodelist
     User JobID           JobName  Partition      State  Timelimit               Start                 End    Elapsed     MaxRSS  MaxVMSize     ReqMem   NNodes      NCPUS        NodeList 
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- ---------- -------- ---------- --------------- 
  svatosm 1739716       hq-1-2221 qcpu_pree+     FAILED   12:00:00 2024-09-30T07:10:19 2024-09-30T07:10:20   00:00:01                             230G        1        128           cn123 
          1739716.bat+      batch                FAILED            2024-09-30T07:10:19 2024-09-30T07:10:20   00:00:01          0      2136K                   1        128           cn123 
          1739716.ext+     extern             COMPLETED            2024-09-30T07:10:19 2024-09-30T07:10:20   00:00:01          0        52K                   1        128           cn123
  • The reason why the job failed so fast (according to the stderr), were Permission Denied errors on /tmp (i.e. something wrong with the node). But this should not stop/block the submission forever. Creating a new allocation queue makes the submission start instantly.
@Kobzol
Copy link
Collaborator

Kobzol commented Oct 9, 2024

Hi, sorry for the late reply, I'm quite busy at the moment. The automatic allocator has an internal rate limiter that can cause allocations to be stopped if too many failures occur. I'm not sure if that's what is causing this issue though. Are you still using HQ_AUTOALLOC_MAX_ALLOCATION_FAILS?

@svatosFZU
Copy link
Author

Yes, the HQ server is started by

RUST_LOG=hyperqueue=debug HQ_AUTOALLOC_MAX_ALLOCATION_FAILS=100000 /home/svatosm/hq-v0.19.0-linux-x64/hq server start --journal /home/svatosm/hqJournal 2> /home/svatosm/hq-debug-output.log &

@Kobzol
Copy link
Collaborator

Kobzol commented Oct 17, 2024

Could you please send me the full log?

@svatosFZU
Copy link
Author

Sure, log from that time period is at https://www.fzu.cz/~svatosm/hq-debug-output.log

@Kobzol
Copy link
Collaborator

Kobzol commented Oct 19, 2024

Thank you. It indeed looks like at least one of the issues is rate limiting. Except for the crash count, there is an additional rate limiter that kicks in after you have too many allocation or submission failures in a row. This rate limiter will then not allow further submissions until one hour has elapsed. This is clearly not ideal for the case of pre-emptible allocations.

One additional issue that I noticed in the log is that some of your tasks are hitting the crash limit. This is used to prevent re-running tasks that might crash their workers too many times. In your case, if workers fail often, the crash counter of tasks will be increased quite a lot, which can then fail these tasks.

Could you try setting the crash limit of your tasks to 0, to see if it helps with the issue?

@svatosFZU
Copy link
Author

Hi,
the second limiter is fine (at least for me) - one hour is no problem. The situation I have observed is that nothing is submitted at all for days.
For the crash limit, all my jobs are submitted with --crash-limit=1 as re-running jobs is causing us problems (data corruption, etc.). So, I would like to try to avoid disabling this limit, as for us this is one of the main advantages of running jobs in HyperQueue.

@spirali
Copy link
Collaborator

spirali commented Oct 20, 2024

Btw: v0.20.0 does not increase crash limit when worker is explicitly stopped via "hq server stop", so it may happen that a task is restarted even --crash-limit=1 and we were thinking also about introducing other situations when crash limit is not icreased. Would it be useful for you to have a way that ensure no restart occurs for whatever reason?

@svatosFZU
Copy link
Author

Well, I need to ensure that jobs run only once. The crash limit is the only way I know how to do that but if there is other way, I could switch to that.

@spirali
Copy link
Collaborator

spirali commented Oct 21, 2024

Crash counter is now the only was and if you are not using hq worker stop then it is also correct one.
However our idea was that crash counter is designed for detecting tasks that crash workers and it may have slightly different semantics in the future. So I am collecting the use cases and in the future we will probably introduce something like --no-restart to maintain your usecase. Thank you and sorry for hijacking the discussion.

@Kobzol
Copy link
Collaborator

Kobzol commented Oct 21, 2024

Could you please specify the time period where the allocations weren't being created? I tried going through the log, and it seems that the allocations have been submitted at a steady state, e.g.:

[2024-10-03T11:42:27.541Z DEBUG hyperqueue::server::autoalloc::queue::common] Running command `sbatch /home/svatosm/.hq-server/001/autoalloc/2-praguelcg2/1363/hq-submit.sh`
[2024-10-03T11:42:27.553Z DEBUG hyperqueue::server::autoalloc::queue::common] Running command `sbatch /home/svatosm/.hq-server/001/autoalloc/2-praguelcg2/1364/hq-submit.sh`
[2024-10-03T11:42:27.817Z DEBUG hyperqueue::server::autoalloc::queue::common] Running command `sbatch /home/svatosm/.hq-server/001/autoalloc/2-praguelcg2/1365/hq-submit.sh`
[2024-10-03T11:42:27.830Z DEBUG hyperqueue::server::autoalloc::queue::common] Running command `sbatch /home/svatosm/.hq-server/001/autoalloc/2-praguelcg2/1366/hq-submit.sh`

It also seems that new workers were being added and task were being started periodically.

@svatosFZU
Copy link
Author

Well, when I submitted this ticket, I was referring to time period between around 3 a 9 (when I made a new allocation queue) on 1.10.2024.

@svatosFZU
Copy link
Author

svatosFZU commented Nov 8, 2024

If it helps, I can provide another example. In this log:

https://www.fzu.cz/~svatosm/hq-debug-output.log.1

HQ does not submit batch jobs since the middle of October.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants