[Bug] crash about `c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout` #1693

zeng-zc · 2024-10-17T03:16:49Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

When I use v0.3.3.post1-cu121-srt docker image to deploy qwen2-72B model, it crashes after a few requests served. I think the stability is still a serious problem. This crash also occurs on v0.3.0 version, also both on A100 and A800 hardware.

Reproduction

server side:

python3 -m sglang.launch_server --model-pathQwen/Qwen2-72B-Instruct \
        --port 8000 --host 0.0.0.0 --tp-size 8 \
        --show-time-cost --mem-fraction-static 0.8 \
        --disable-cuda-graph --disable-flashinfer-sampling

After I do some load testing with heavy requests, it crashes.
The screen logs:

INFO:     10.119.49.166:58178 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     10.119.49.166:58188 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     10.119.49.166:58198 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     10.119.49.166:58208 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     10.119.49.166:58224 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     10.119.49.166:58226 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     10.119.49.166:58234 - "POST /v1/completions HTTP/1.1" 200 OK
[rank2]:[E1016 03:31:38.859877994 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.
[rank2]:[E1016 03:31:38.860643459 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 9141, last enqueued NCCL work: 9141, last completed NCCL work: 9140.
[rank2]:[E1016 03:31:38.860656122 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 2] Timeout at NCCL work: 9141, last enqueued NCCL work: 9141, last completed NCCL work: 9140.
[rank2]:[E1016 03:31:38.860663442 ProcessGroupNCCL.cpp:621] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1016 03:31:38.860669714 ProcessGroupNCCL.cpp:627] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E1016 03:31:38.860716229 ProcessGroupNCCL.cpp:607] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.
[rank6]:[E1016 03:31:38.861305629 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 6] Exception (either an error or timeout) detected by watchdog at work: 9141, last enqueued NCCL work: 9141, last completed NCCL work: 9140.
[rank6]:[E1016 03:31:38.861316932 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 6] Timeout at NCCL work: 9141, last enqueued NCCL work: 9141, last completed NCCL work: 9140.
[rank6]:[E1016 03:31:38.861323525 ProcessGroupNCCL.cpp:621] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E1016 03:31:38.861328503 ProcessGroupNCCL.cpp:627] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1016 03:31:38.862129042 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3b6e71a8d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b6e721313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b6e7236fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 3 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3b6e71a8d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b6e721313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b6e7236fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f3b6e3aca84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank6]:[E1016 03:31:38.862681999 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3b6e71a8d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b6e721313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b6e7236fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 3 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3b6e71a8d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b6e721313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b6e7236fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f3b6e3aca84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank0]:[E1016 03:31:38.876550536 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600022 milliseconds before timing out.
[rank0]:[E1016 03:31:38.877020212 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 9141, last enqueued NCCL work: 9141, last completed NCCL work: 9140.
[rank0]:[E1016 03:31:38.877030881 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 0] Timeout at NCCL work: 9141, last enqueued NCCL work: 9141, last completed NCCL work: 9140.
[rank0]:[E1016 03:31:38.877036871 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1016 03:31:38.877040580 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank5]:[E1016 03:31:38.877368433 ProcessGroupNCCL.cpp:607] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600020 milliseconds before timing out.
[rank0]:[E1016 03:31:38.878382438 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600022 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3b6e71a8d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b6e721313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b6e7236fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
[rank5]:[E1016 03:31:38.878440675 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 5] Exception (either an error or timeout) detected by watchdog at work: 9141, last enqueued NCCL work: 9141, last completed NCCL work: 9140.
[rank5]:[E1016 03:31:38.878455740 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 5] Timeout at NCCL work: 9141, last enqueued NCCL work: 9141, last completed NCCL work: 9140.
[rank5]:[E1016 03:31:38.878464399 ProcessGroupNCCL.cpp:621] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank5]:[E1016 03:31:38.878471762 ProcessGroupNCCL.cpp:627] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
  what():  [PG 3 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600022 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3b6e71a8d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b6e721313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b6e7236fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f3b6e3aca84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank5]:[E1016 03:31:38.879947014 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600020 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3b6e71a8d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b6e721313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b6e7236fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 3 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600020 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3b6e71a8d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b6e721313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b6e7236fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f3b6e3aca84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank7]:[E1016 03:31:38.910650666 ProcessGroupNCCL.cpp:607] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600053 milliseconds before timing out.
[rank7]:[E1016 03:31:38.911296315 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 9141, last enqueued NCCL work: 9141, last completed NCCL work: 9140.
[rank7]:[E1016 03:31:38.911304843 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 7] Timeout at NCCL work: 9141, last enqueued NCCL work: 9141, last completed NCCL work: 9140.
[rank7]:[E1016 03:31:38.911310964 ProcessGroupNCCL.cpp:621] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E1016 03:31:38.911317965 ProcessGroupNCCL.cpp:627] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E1016 03:31:38.912655147 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600053 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3b6e71a8d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b6e721313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b6e7236fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 3 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600053 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3b6e71a8d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b6e721313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b6e7236fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f3b6e3aca84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E1016 03:31:38.919469923 ProcessGroupNCCL.cpp:607] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600062 milliseconds before timing out.
[rank3]:[E1016 03:31:38.920009066 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 9141, last enqueued NCCL work: 9141, last completed NCCL work: 9140.
[rank3]:[E1016 03:31:38.920018298 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 3] Timeout at NCCL work: 9141, last enqueued NCCL work: 9141, last completed NCCL work: 9140.
[rank3]:[E1016 03:31:38.920024804 ProcessGroupNCCL.cpp:621] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E1016 03:31:38.920031875 ProcessGroupNCCL.cpp:627] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E1016 03:31:38.921362434 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600062 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3b6e71a8d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b6e721313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b6e7236fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 3 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600062 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3b6e71a8d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b6e721313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b6e7236fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f3b6e3aca84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E1016 03:31:38.928019586 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600071 milliseconds before timing out.
[rank4]:[E1016 03:31:38.928463388 ProcessGroupNCCL.cpp:607] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600071 milliseconds before timing out.
[rank4]:[E1016 03:31:38.929220797 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 4] Exception (either an error or timeout) detected by watchdog at work: 9141, last enqueued NCCL work: 9141, last completed NCCL work: 9140.
[rank4]:[E1016 03:31:38.929234295 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 4] Timeout at NCCL work: 9141, last enqueued NCCL work: 9141, last completed NCCL work: 9140.
[rank4]:[E1016 03:31:38.929241154 ProcessGroupNCCL.cpp:621] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E1016 03:31:38.929246900 ProcessGroupNCCL.cpp:627] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E1016 03:31:38.930579993 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600071 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3b6e71a8d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b6e721313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b6e7236fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 3 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600071 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3b6e71a8d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b6e721313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b6e7236fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f3b6e3aca84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E1016 03:31:38.936030480 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 9141, last enqueued NCCL work: 9141, last completed NCCL work: 9140.
[rank1]:[E1016 03:31:38.936047558 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 1] Timeout at NCCL work: 9141, last enqueued NCCL work: 9141, last completed NCCL work: 9140.
[rank1]:[E1016 03:31:38.936055676 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1016 03:31:38.936060820 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E1016 03:31:38.937407207 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600071 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3b6e71a8d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b6e721313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b6e7236fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 3 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=9141, OpType=_ALLGATHER_BASE, NumelIn=19008, NumelOut=152064, Timeout(ms)=600000) ran for 600071 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3b6e71a8d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f3b6e721313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3b6e7236fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f3b6d41df86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f3b6e3aca84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd6df4 (0x7f3bca9d5df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f3c74f76609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f3c750b0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Environment

python3 -m sglang.check_env

Python: 3.10.14 (main, Apr 6 2024, 18:45:05) [GCC 9.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA A800-SXM4-40GB
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 535.86.10
PyTorch: 2.4.0+cu121
sglang: 0.3.0
flashinfer: 0.1.6+cu124torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.5
fastapi: 0.112.2
hf_transfer: 0.1.8
huggingface_hub: 0.24.6
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.8.2
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 26.2.0
vllm: 0.5.5
multipart: 0.0.9
openai: 1.43.0
anthropic: 0.34.1
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV8 NV8 NV8 NV8 NV8 NV8 NV8 PXB NODE SYS SYS NODE 0-27,56-83 0 N/A
GPU1 NV8 X NV8 NV8 NV8 NV8 NV8 NV8 PXB NODE SYS SYS NODE 0-27,56-83 0 N/A
GPU2 NV8 NV8 X NV8 NV8 NV8 NV8 NV8 NODE PXB SYS SYS NODE 0-27,56-83 0 N/A
GPU3 NV8 NV8 NV8 X NV8 NV8 NV8 NV8 NODE PXB SYS SYS NODE 0-27,56-83 0 N/A
GPU4 NV8 NV8 NV8 NV8 X NV8 NV8 NV8 SYS SYS PXB NODE SYS 28-55,84-111 1 N/A
GPU5 NV8 NV8 NV8 NV8 NV8 X NV8 NV8 SYS SYS PXB NODE SYS 28-55,84-111 1 N/A
GPU6 NV8 NV8 NV8 NV8 NV8 NV8 X NV8 SYS SYS NODE PXB SYS 28-55,84-111 1 N/A
GPU7 NV8 NV8 NV8 NV8 NV8 NV8 NV8 X SYS SYS NODE PXB SYS 28-55,84-111 1 N/A
NIC0 PXB PXB NODE NODE SYS SYS SYS SYS X NODE SYS SYS NODE
NIC1 NODE NODE PXB PXB SYS SYS SYS SYS NODE X SYS SYS NODE
NIC2 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS X NODE SYS
NIC3 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS NODE X SYS
NIC4 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE SYS SYS X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_2
NIC1: mlx5_3
NIC2: mlx5_4
NIC3: mlx5_5
NIC4: mlx5_bond_0

ulimit soft: 1048576

The text was updated successfully, but these errors were encountered:

m0g1cian · 2024-10-17T03:41:14Z

Similar symptoms were mentioned in #1270 as well. There are several tips that help to alleivate the stability issue for now.

Try to disable custom all reduce by --disable-custom-all-reduce. Custom allreduce is mostly for small batch sizes so it may not affect the throughput a lot.

Try --disable-cuda-graph-padding. Again, this won't hurt peak throughput.

Try NCCL_ALGO=Tree

Try to match cuda driver versions.

zeng-zc · 2024-10-17T07:28:27Z

Similar symptoms were mentioned in #1270 as well. There are several tips that help to alleivate the stability issue for now.

Try to disable custom all reduce by --disable-custom-all-reduce. Custom allreduce is mostly for small batch sizes so it may not affect the throughput a lot.

Try --disable-cuda-graph-padding. Again, this won't hurt peak throughput.

Try NCCL_ALGO=Tree

Try to match cuda driver versions.

I've tested your method on the latest docker image v0.3.3.post1-cu121-srt, the following is the results:

server side adds --disable-custom-all-reduce, also hangs, very quickly
server side adds --disable-cuda-graph-padding, server errors: "sglang | [23:56:42 TP4] Detected errors during sampling! NaN in the probability."
server add NCCL_ALGO: Tree, server errors: "sglang | [23:56:42 TP4] Detected errors during sampling! NaN in the probability."

It seems the second method may disrupt the machine cuda environment, so the third test also failed. I have to hardly reboot the machine to recovery it.

zeng-zc mentioned this issue Oct 17, 2024

[Bug] deadlock or hang on Qwen2-7B models #1691

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] crash about `c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout` #1693

[Bug] crash about `c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout` #1693

zeng-zc commented Oct 17, 2024

m0g1cian commented Oct 17, 2024

zeng-zc commented Oct 17, 2024

[Bug] crash about c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout #1693

[Bug] crash about c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout #1693

Comments

zeng-zc commented Oct 17, 2024

Checklist

Describe the bug

Reproduction

Environment

python3 -m sglang.check_env

m0g1cian commented Oct 17, 2024

zeng-zc commented Oct 17, 2024

[Bug] crash about `c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout` #1693

[Bug] crash about `c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout` #1693