multi gpus run error after 1 epoch #314

zzm-hl · 2022-04-24T03:31:14Z

`2022-04-24 09:26:36,204 - INFO - Epoch [1/20][3860/3862] lr: 0.00013, eta: 2 days, 11:49:01, time: 3.606, data_time: 0.654, transfer_time: 0.165, forward_time: 1.555, loss_parse_time: 0.002 memory: 27917,
2022-04-24 09:26:36,239 - INFO - task : ['car'], loss: 1.5316, hm_loss: 1.0401, loc_loss: 1.9661, loc_loss_elem: ['0.1882', '0.1916', '0.2169', '0.0773', '0.0694', '0.0938', '0.5669', '0.8691', '0.4311', '0.4106'], num_positive: 209.6000
2022-04-24 09:26:36,239 - INFO - task : ['truck', 'construction_vehicle'], loss: 2.2867, hm_loss: 1.6434, loc_loss: 2.5733, loc_loss_elem: ['0.2141', '0.2157', '0.3811', '0.1663', '0.1680', '0.1678', '0.3224', '0.5616', '0.5250', '0.5585'], num_positive: 126.8000
2022-04-24 09:26:36,240 - INFO - task : ['bus', 'trailer'], loss: 2.2827, hm_loss: 1.5935, loc_loss: 2.7572, loc_loss_elem: ['0.2271', '0.2145', '0.4656', '0.1020', '0.1414', '0.1205', '0.8358', '1.2225', '0.5222', '0.5521'], num_positive: 90.8000
2022-04-24 09:26:36,240 - INFO - task : ['barrier'], loss: 1.6930, hm_loss: 1.1514, loc_loss: 2.1661, loc_loss_elem: ['0.1637', '0.1844', '0.1940', '0.1647', '0.2657', '0.1325', '0.0365', '0.0479', '0.5859', '0.4582'], num_positive: 93.8000
2022-04-24 09:26:36,241 - INFO - task : ['motorcycle', 'bicycle'], loss: 1.5902, hm_loss: 1.0459, loc_loss: 2.1772, loc_loss_elem: ['0.1616', '0.1612', '0.1969', '0.1860', '0.1172', '0.1472', '0.4495', '0.6558', '0.4626', '0.5233'], num_positive: 160.8000
2022-04-24 09:26:36,241 - INFO - task : ['pedestrian', 'traffic_cone'], loss: 1.5811, hm_loss: 0.9833, loc_loss: 2.3910, loc_loss_elem: ['0.1535', '0.1580', '0.2172', '0.2156', '0.2590', '0.1522', '0.2787', '0.3207', '0.5800', '0.5356'], num_positive: 144.2000

2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
[E ProcessGroupNCCL.cpp:719] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=583940, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1800257 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=583940, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1800257 milliseconds before timing out.
/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

FutureWarning,
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58033 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 58028) of binary: /public/home/u212040344/.conda/envs/centerpoint/bin/python
Traceback (most recent call last):
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
)(*cmd_args)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./tools/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-04-24_09:59:14
host : node191
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 58028)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 58028

`
run on 2 gpus A100 with batch_size 16 num_workers 8*2
enviroment: pytorch1.11 cuda11.3 spconv2.x
Could you help me ? thank you

The text was updated successfully, but these errors were encountered:

tianweiy · 2022-04-24T03:39:40Z

see #224 (comment) and #203

Unfortunately, I don't have any more suggestions about how to fix this as I can't reproduce this error in my setup

tianweiy · 2022-04-24T03:41:38Z

maybe also check out open-mmlab/OpenPCDet#696

Please let me know if you find any of these work and I will update the code accordingly

zzm-hl · 2022-04-25T17:53:26Z

the same problem in https://www.zhihu.com/question/512132168
https://discuss.pytorch.org/t/gpu-startup-is-way-too-slow/147956/12

I think it need to close the IOMMU, Unfortunately, GPU is on the cluster of the school, I do not have the permission to try to shut down IOMMU through BIOS, so this problem still exists. But I tried, and it is normal to train on a single GPU, but the speed is relatively slow

tianweiy · 2022-04-25T18:50:58Z

do you still use apex or the native syncbn ?

check #224 (comment) (they seem to fix the problem by switching.

zzm-hl · 2022-04-26T00:04:56Z

I have use sync instead .and I have decreased the number workers and batchsize ,but it dosn't work too.

…

---Original--- From: "Tianwei ***@***.***> Date: Tue, Apr 26, 2022 02:51 AM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [tianweiy/CenterPoint] multi gpus run error after 1 epoch (Issue#314) do you still use apex or the native syncbn ? check #224 (comment) (they seem to fix the problem by switching. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

tianweiy · 2022-04-26T03:19:03Z

i see. Uhm... Maybe you can try the cuda 10.0 + torch 1.1 + spconv 1.x version? I remembered a year ago, no one has issues with multi gpu training. Most problems appear a few months ago with the newer torch and cuda versions.

tianweiy · 2022-04-26T03:19:48Z

worse case scenario, you can just use openpcdet's centerpoint implementation (they also look good and have comparable performance)

zzm-hl · 2022-04-30T14:08:57Z

worse case scenario, you can just use openpcdet's centerpoint implementation (they also look good and have comparable performance)

thank you ,I will have a try!

tianweiy · 2022-05-22T02:03:35Z

see #203 (comment)

zzm-hl · 2022-10-11T09:18:00Z

I checked to tesla A100 does not seem to support CUDA10.X, this is more troublesome, I will try open3d next, thank you very much for your reply!

…

------------------ 原始邮件 ------------------ 发件人: "tianweiy/CenterPoint" ***@***.***>; 发送时间: 2022年4月26日(星期二) 中午11:19 ***@***.***>; ***@***.******@***.***>; 主题: Re: [tianweiy/CenterPoint] multi gpus run error after 1 epoch (Issue #314) i see. Uhm... Maybe you can try the cuda 10.0 + torch 1.1 + spconv 1.x version? I remembered a year ago, no one has issues with multi gpu training. Most problems appear a few months ago with the newer torch and cuda versions. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

ZecCheng · 2023-07-04T14:33:46Z

did you solve the problem? cause i have faced the same problem like you before~ can you share some idea about this nccl Time-out error. btw, my environment is cuda 11.3, spconv 2.1.21, torch 1.12.1,numba 0.57.0. And I got this error too before one iteration starts.

JerryDaHeLian · 2023-11-24T07:35:30Z

假设有两个node，每个node 上有8张GPU，各个node 上的batch size 的值可以设置不一样吗？

tianweiy mentioned this issue May 14, 2022

Get stuck after one epoch of training (Multi GPU DDP) See this! #203

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi gpus run error after 1 epoch #314

multi gpus run error after 1 epoch #314

zzm-hl commented Apr 24, 2022

tianweiy commented Apr 24, 2022

tianweiy commented Apr 24, 2022

zzm-hl commented Apr 25, 2022 •

edited

Loading

tianweiy commented Apr 25, 2022

zzm-hl commented Apr 26, 2022 via email

tianweiy commented Apr 26, 2022

tianweiy commented Apr 26, 2022 •

edited

Loading

zzm-hl commented Apr 30, 2022

tianweiy commented May 22, 2022

zzm-hl commented Oct 11, 2022 via email

ZecCheng commented Jul 4, 2023

JerryDaHeLian commented Nov 24, 2023

multi gpus run error after 1 epoch #314

multi gpus run error after 1 epoch #314

Comments

zzm-hl commented Apr 24, 2022

./tools/train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2022-04-24_09:59:14 host : node191 rank : 0 (local_rank: 0) exitcode : -6 (pid: 58028) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 58028

tianweiy commented Apr 24, 2022

tianweiy commented Apr 24, 2022

zzm-hl commented Apr 25, 2022 • edited Loading

tianweiy commented Apr 25, 2022

zzm-hl commented Apr 26, 2022 via email

tianweiy commented Apr 26, 2022

tianweiy commented Apr 26, 2022 • edited Loading

zzm-hl commented Apr 30, 2022

tianweiy commented May 22, 2022

zzm-hl commented Oct 11, 2022 via email

ZecCheng commented Jul 4, 2023

JerryDaHeLian commented Nov 24, 2023

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-04-24_09:59:14
host : node191
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 58028)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 58028

zzm-hl commented Apr 25, 2022 •

edited

Loading

tianweiy commented Apr 26, 2022 •

edited

Loading