Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi gpus run error after 1 epoch #314

Open
zzm-hl opened this issue Apr 24, 2022 · 12 comments
Open

multi gpus run error after 1 epoch #314

zzm-hl opened this issue Apr 24, 2022 · 12 comments

Comments

@zzm-hl
Copy link

zzm-hl commented Apr 24, 2022

`2022-04-24 09:26:36,204 - INFO - Epoch [1/20][3860/3862] lr: 0.00013, eta: 2 days, 11:49:01, time: 3.606, data_time: 0.654, transfer_time: 0.165, forward_time: 1.555, loss_parse_time: 0.002 memory: 27917,
2022-04-24 09:26:36,239 - INFO - task : ['car'], loss: 1.5316, hm_loss: 1.0401, loc_loss: 1.9661, loc_loss_elem: ['0.1882', '0.1916', '0.2169', '0.0773', '0.0694', '0.0938', '0.5669', '0.8691', '0.4311', '0.4106'], num_positive: 209.6000
2022-04-24 09:26:36,239 - INFO - task : ['truck', 'construction_vehicle'], loss: 2.2867, hm_loss: 1.6434, loc_loss: 2.5733, loc_loss_elem: ['0.2141', '0.2157', '0.3811', '0.1663', '0.1680', '0.1678', '0.3224', '0.5616', '0.5250', '0.5585'], num_positive: 126.8000
2022-04-24 09:26:36,240 - INFO - task : ['bus', 'trailer'], loss: 2.2827, hm_loss: 1.5935, loc_loss: 2.7572, loc_loss_elem: ['0.2271', '0.2145', '0.4656', '0.1020', '0.1414', '0.1205', '0.8358', '1.2225', '0.5222', '0.5521'], num_positive: 90.8000
2022-04-24 09:26:36,240 - INFO - task : ['barrier'], loss: 1.6930, hm_loss: 1.1514, loc_loss: 2.1661, loc_loss_elem: ['0.1637', '0.1844', '0.1940', '0.1647', '0.2657', '0.1325', '0.0365', '0.0479', '0.5859', '0.4582'], num_positive: 93.8000
2022-04-24 09:26:36,241 - INFO - task : ['motorcycle', 'bicycle'], loss: 1.5902, hm_loss: 1.0459, loc_loss: 2.1772, loc_loss_elem: ['0.1616', '0.1612', '0.1969', '0.1860', '0.1172', '0.1472', '0.4495', '0.6558', '0.4626', '0.5233'], num_positive: 160.8000
2022-04-24 09:26:36,241 - INFO - task : ['pedestrian', 'traffic_cone'], loss: 1.5811, hm_loss: 0.9833, loc_loss: 2.3910, loc_loss_elem: ['0.1535', '0.1580', '0.2172', '0.2156', '0.2590', '0.1522', '0.2787', '0.3207', '0.5800', '0.5356'], num_positive: 144.2000

2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
[E ProcessGroupNCCL.cpp:719] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=583940, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1800257 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=583940, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1800257 milliseconds before timing out.
/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

FutureWarning,
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58033 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 58028) of binary: /public/home/u212040344/.conda/envs/centerpoint/bin/python
Traceback (most recent call last):
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
)(*cmd_args)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./tools/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-04-24_09:59:14
host : node191
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 58028)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 58028

`
run on 2 gpus A100 with batch_size 16 num_workers 8*2
enviroment: pytorch1.11 cuda11.3 spconv2.x
Could you help me ? thank you

@tianweiy
Copy link
Owner

see #224 (comment) and #203

Unfortunately, I don't have any more suggestions about how to fix this as I can't reproduce this error in my setup

@tianweiy
Copy link
Owner

maybe also check out open-mmlab/OpenPCDet#696

Please let me know if you find any of these work and I will update the code accordingly

@zzm-hl
Copy link
Author

zzm-hl commented Apr 25, 2022

the same problem in https://www.zhihu.com/question/512132168
https://discuss.pytorch.org/t/gpu-startup-is-way-too-slow/147956/12

I think it need to close the IOMMU, Unfortunately, GPU is on the cluster of the school, I do not have the permission to try to shut down IOMMU through BIOS, so this problem still exists. But I tried, and it is normal to train on a single GPU, but the speed is relatively slow

@tianweiy
Copy link
Owner

do you still use apex or the native syncbn ?

check #224 (comment) (they seem to fix the problem by switching.

@zzm-hl
Copy link
Author

zzm-hl commented Apr 26, 2022 via email

@tianweiy
Copy link
Owner

i see. Uhm... Maybe you can try the cuda 10.0 + torch 1.1 + spconv 1.x version? I remembered a year ago, no one has issues with multi gpu training. Most problems appear a few months ago with the newer torch and cuda versions.

@tianweiy
Copy link
Owner

tianweiy commented Apr 26, 2022

worse case scenario, you can just use openpcdet's centerpoint implementation (they also look good and have comparable performance)

@zzm-hl
Copy link
Author

zzm-hl commented Apr 30, 2022

worse case scenario, you can just use openpcdet's centerpoint implementation (they also look good and have comparable performance)

thank you ,I will have a try!

@tianweiy
Copy link
Owner

see #203 (comment)

@zzm-hl
Copy link
Author

zzm-hl commented Oct 11, 2022 via email

@ZecCheng
Copy link

ZecCheng commented Jul 4, 2023

did you solve the problem? cause i have faced the same problem like you before~ can you share some idea about this nccl Time-out error. btw, my environment is cuda 11.3, spconv 2.1.21, torch 1.12.1,numba 0.57.0. And I got this error too before one iteration starts.

@JerryDaHeLian
Copy link

假设有两个node,每个node 上有8张GPU,各个node 上的batch size 的值可以设置不一样吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants