Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HQ crashes #718

Open
com-data opened this issue Jun 30, 2024 · 3 comments
Open

HQ crashes #718

com-data opened this issue Jun 30, 2024 · 3 comments

Comments

@com-data
Copy link

Hello,

Firstly thank you for developing HQ.

I recently came across a crash while testing HQ v0.19.0. The submission script which is submitted to SLURM manager is as follows:

hq server start &
until hq job list &>/dev/null ; do sleep 1 ; done

srun --overlap --cpu-bind=none --mpi=none hq worker start --heartbeat 10min \
    --manager slurm \
    --on-server-lost finish-running \
    --cpus="$SLURM_CPUS_PER_TASK" &

hq worker wait "$SLURM_NTASKS"

hq submit --wait --stderr=none --stdout=none sleep 259000

The error message is pasted below. Interestingly, this error message appears during only some of the identical runs tested. Please let me know if I should provide more information for debugging.

thread 'main' panicked at crates/hyperqueue/src/worker/bootstrap.rs:180:10:
Could not get remaining time from scontrol: Unexpected token found while attempting to parse number, expected something else:
  INVALID
  |
  --- Unexpected token `I`

stack backtrace:
   0:     0x564bb4e37bf9 - std::backtrace_rs::backtrace::libunwind::trace::hbee8a7973eeb6c93
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5
   1:     0x564bb4e37bf9 - std::backtrace_rs::backtrace::trace_unsynchronized::hc8ac75eea3aa6899
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x564bb4e37bf9 - std::sys_common::backtrace::_print_fmt::hc7f3e3b5298b1083
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:68:5
   3:     0x564bb4e37bf9 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hbb235daedd7c6190
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x564bb4b82b60 - core::fmt::rt::Argument::fmt::h76c38a80d925a410
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/fmt/rt.rs:142:9
   5:     0x564bb4b82b60 - core::fmt::write::h3ed6aeaa977c8e45
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/fmt/mod.rs:1120:17
   6:     0x564bb4e0087e - std::io::Write::write_fmt::h78b18af5775fedb5
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/io/mod.rs:1810:15
   7:     0x564bb4e39c2e - std::sys_common::backtrace::_print::h5d645a07e0fcfdbb
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x564bb4e39c2e - std::sys_common::backtrace::print::h85035a511aafe7a8
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x564bb4e394d7 - std::panicking::default_hook::{{closure}}::hcce8cea212785a25
  10:     0x564bb4e390bf - std::panicking::default_hook::hf5fcb0f213fe709a
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:292:9
  11:     0x564bb4ae6eeb - call<(&core::panic::panic_info::PanicInfo), (dyn core::ops::function::Fn<(&core::panic::panic_info::PanicInfo), Output=()> + core::marker::Send + core::marker::Sync), alloc::alloc::Global>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/boxed.rs:2029:9
  12:     0x564bb4ae6eeb - {closure#0}
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/bin/hq.rs:360:9
  13:     0x564bb4e3a21a - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::hbc5ccf4eb663e1e5
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/boxed.rs:2029:9
  14:     0x564bb4e3a21a - std::panicking::rust_panic_with_hook::h095fccf1dc9379ee
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:783:13
  15:     0x564bb4e39fa0 - std::panicking::begin_panic_handler::{{closure}}::h032ba12139b353db
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:657:13
  16:     0x564bb4e39ef6 - std::sys_common::backtrace::__rust_end_short_backtrace::h9259bc2ff8fd0f76
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:171:18
  17:     0x564bb4e39eef - rust_begin_unwind
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
  18:     0x564bb49c1074 - core::panicking::panic_fmt::h784f20a50eaab275
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
  19:     0x564bb49c15e2 - core::result::unwrap_failed::h03d8a5018196e1cd
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/result.rs:1649:5
  20:     0x564bb4c21262 - expect<core::time::Duration, anyhow::Error>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/result.rs:1030:23
  21:     0x564bb4c21262 - try_get_slurm_info
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/worker/bootstrap.rs:179:20
  22:     0x564bb4ae30d9 - gather_manager_info
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/client/commands/worker.rs:315:39
  23:     0x564bb4ae30d9 - gather_configuration
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/client/commands/worker.rs:255:24
  24:     0x564bb4ae30d9 - {async_fn#0}
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/client/commands/worker.rs:145:29
  25:     0x564bb4ae30d9 - {async_fn#0}
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/bin/hq.rs:189:38
  26:     0x564bb4ae30d9 - {async_block#0}
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/bin/hq.rs:389:54
  27:     0x564bb4acc83d - poll<&mut hq::main::{async_block_env#0}>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/future/future.rs:124:9
  28:     0x564bb4acc83d - {closure#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/current_thread/mod.rs:659:57
  29:     0x564bb4acc83d - with_budget<core::task::poll::Poll<core::result::Result<(), hyperqueue::common::error::HqError>>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/coop.rs:107:5
  30:     0x564bb4acc83d - budget<core::task::poll::Poll<core::result::Result<(), hyperqueue::common::error::HqError>>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/coop.rs:73:5
  31:     0x564bb4acc83d - {closure#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/current_thread/mod.rs:659:25
  32:     0x564bb4acc83d - enter<core::task::poll::Poll<core::result::Result<(), hyperqueue::common::error::HqError>>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/current_thread/mod.rs:404:19
  33:     0x564bb4acc83d - {closure#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/current_thread/mod.rs:658:36
  34:     0x564bb4acc83d - {closure#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/current_thread/mod.rs:737:68
  35:     0x564bb4acc83d - set<tokio::runtime::scheduler::Context, tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>)>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context/scoped.rs:40:9
  36:     0x564bb4acc83d - {closure#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context.rs:176:26
  37:     0x564bb4acc83d - try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>)>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/thread/local.rs:270:16
  38:     0x564bb4acc83d - with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>)>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/thread/local.rs:246:9
  39:     0x564bb4acc83d - set_scheduler<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context.rs:176:17
  40:     0x564bb4acc83d - enter<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut hq::main::{async_block_env#0}>>, core::option::Option<core::result::Result<(), hyperqueue::common::error::HqError>>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/current_thread/mod.rs:737:27
  41:     0x564bb4acc83d - block_on<core::pin::Pin<&mut hq::main::{async_block_env#0}>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/current_thread/mod.rs:646:19
  42:     0x564bb4acc83d - {closure#0}<hq::main::{async_block_env#0}>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/current_thread/mod.rs:175:28
  43:     0x564bb4acc83d - enter_runtime<tokio::runtime::scheduler::current_thread::{impl#0}::block_on::{closure_env#0}<hq::main::{async_block_env#0}>, core::result::Result<(), hyperqueue::common::error::HqError>>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/context/runtime.rs:65:16
  44:     0x564bb4acc83d - block_on<hq::main::{async_block_env#0}>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/scheduler/current_thread/mod.rs:167:9
  45:     0x564bb4acc83d - block_on<hq::main::{async_block_env#0}>
                               at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/runtime.rs:348:47
  46:     0x564bb4acc83d - main
                               at /__w/hyperqueue/hyperqueue/crates/hyperqueue/src/bin/hq.rs:456:5
  47:     0x564bb4a58203 - call_once<fn() -> core::result::Result<(), hyperqueue::common::error::HqError>, ()>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ops/function.rs:250:5
  48:     0x564bb4a58203 - __rust_begin_short_backtrace<fn() -> core::result::Result<(), hyperqueue::common::error::HqError>, core::result::Result<(), hyperqueue::common::error::HqError>>
                               at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:155:18
  49:     0x564bb4ae7500 - main
  50:     0x14ec8023feb0 - __libc_start_call_main
  51:     0x14ec8023ff60 - __libc_start_main@GLIBC_2.2.5
  52:     0x564bb4a01049 - <unknown>
  53:                0x0 - <unknown>
@Kobzol
Copy link
Collaborator

Kobzol commented Jun 30, 2024

Hi! Thanks for the report. Here is what is happening:

  1. You start a HQ worker
  2. You specify that the allocation manager should be Slurm, therefore the worker tries to detect some Slurm values from the environment
  3. It tries to ask Slurm what is its remaining time limit via scontrol show job <SLURM_JOB_ID>. Unexpectedly, the server returns INVALID, which cannot be parsed as a duration (obviously). Not sure why Slurm returns this 🤷‍♂️

Now, arguably HQ should probably skip reading these values instead of crashing here. On the other hand, if you do specify that you want to use Slurm, and it is not possible to find the remaining time limit, the worker could start without a time limit, which could be annoying in some cases. So crashing here tells you that something went wrong.

As a hotfix, you can try --manager none.

As a sort of a separate question, do you have a specific reason for running the HQ server within a Slurm allocation? It's a valid use-case, but normally you can also run it on login nodes, which should be more ergonomic to use.

@com-data
Copy link
Author

Thank you so much for your help, it works now without losing workers. Starting HQ server on login node would be a good choice but when I do that and try to start workers using Slurm jobs, I get an error which is along the lines of "access token found but server is not reachable". This makes me think that in such a case the HQ server and workers do not communicate. Looking into HQ documentation, it seems that for a user without admin rights fixing the communication problems may not be possible.

Thank you again for designing and improving HQ.

@Kobzol
Copy link
Collaborator

Kobzol commented Jun 30, 2024

I see. On most HPC clusters that we have tried it, login and compute nodes can communicate without problems. But if this is not possible on your cluster, then indeed you'll need to run the server inside allocations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants