-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi gpus run error after 1 epoch #314
Comments
see #224 (comment) and #203 Unfortunately, I don't have any more suggestions about how to fix this as I can't reproduce this error in my setup |
maybe also check out open-mmlab/OpenPCDet#696 Please let me know if you find any of these work and I will update the code accordingly |
the same problem in https://www.zhihu.com/question/512132168 I think it need to close the IOMMU, Unfortunately, GPU is on the cluster of the school, I do not have the permission to try to shut down IOMMU through BIOS, so this problem still exists. But I tried, and it is normal to train on a single GPU, but the speed is relatively slow |
do you still use apex or the native syncbn ? check #224 (comment) (they seem to fix the problem by switching. |
I have use sync instead .and I have decreased the number workers and batchsize ,but it dosn't work too.
…---Original---
From: "Tianwei ***@***.***>
Date: Tue, Apr 26, 2022 02:51 AM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [tianweiy/CenterPoint] multi gpus run error after 1 epoch (Issue#314)
do you still use apex or the native syncbn ?
check #224 (comment) (they seem to fix the problem by switching.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
i see. Uhm... Maybe you can try the cuda 10.0 + torch 1.1 + spconv 1.x version? I remembered a year ago, no one has issues with multi gpu training. Most problems appear a few months ago with the newer torch and cuda versions. |
worse case scenario, you can just use openpcdet's centerpoint implementation (they also look good and have comparable performance) |
thank you ,I will have a try! |
see #203 (comment) |
I checked to tesla A100 does not seem to support CUDA10.X, this is more troublesome, I will try open3d next, thank you very much for your reply!
…------------------ 原始邮件 ------------------
发件人: "tianweiy/CenterPoint" ***@***.***>;
发送时间: 2022年4月26日(星期二) 中午11:19
***@***.***>;
***@***.******@***.***>;
主题: Re: [tianweiy/CenterPoint] multi gpus run error after 1 epoch (Issue #314)
i see. Uhm... Maybe you can try the cuda 10.0 + torch 1.1 + spconv 1.x version? I remembered a year ago, no one has issues with multi gpu training. Most problems appear a few months ago with the newer torch and cuda versions.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
did you solve the problem? cause i have faced the same problem like you before~ can you share some idea about this nccl Time-out error. btw, my environment is cuda 11.3, spconv 2.1.21, torch 1.12.1,numba 0.57.0. And I got this error too before one iteration starts. |
假设有两个node,每个node 上有8张GPU,各个node 上的batch size 的值可以设置不一样吗? |
`2022-04-24 09:26:36,204 - INFO - Epoch [1/20][3860/3862] lr: 0.00013, eta: 2 days, 11:49:01, time: 3.606, data_time: 0.654, transfer_time: 0.165, forward_time: 1.555, loss_parse_time: 0.002 memory: 27917,
2022-04-24 09:26:36,239 - INFO - task : ['car'], loss: 1.5316, hm_loss: 1.0401, loc_loss: 1.9661, loc_loss_elem: ['0.1882', '0.1916', '0.2169', '0.0773', '0.0694', '0.0938', '0.5669', '0.8691', '0.4311', '0.4106'], num_positive: 209.6000
2022-04-24 09:26:36,239 - INFO - task : ['truck', 'construction_vehicle'], loss: 2.2867, hm_loss: 1.6434, loc_loss: 2.5733, loc_loss_elem: ['0.2141', '0.2157', '0.3811', '0.1663', '0.1680', '0.1678', '0.3224', '0.5616', '0.5250', '0.5585'], num_positive: 126.8000
2022-04-24 09:26:36,240 - INFO - task : ['bus', 'trailer'], loss: 2.2827, hm_loss: 1.5935, loc_loss: 2.7572, loc_loss_elem: ['0.2271', '0.2145', '0.4656', '0.1020', '0.1414', '0.1205', '0.8358', '1.2225', '0.5222', '0.5521'], num_positive: 90.8000
2022-04-24 09:26:36,240 - INFO - task : ['barrier'], loss: 1.6930, hm_loss: 1.1514, loc_loss: 2.1661, loc_loss_elem: ['0.1637', '0.1844', '0.1940', '0.1647', '0.2657', '0.1325', '0.0365', '0.0479', '0.5859', '0.4582'], num_positive: 93.8000
2022-04-24 09:26:36,241 - INFO - task : ['motorcycle', 'bicycle'], loss: 1.5902, hm_loss: 1.0459, loc_loss: 2.1772, loc_loss_elem: ['0.1616', '0.1612', '0.1969', '0.1860', '0.1172', '0.1472', '0.4495', '0.6558', '0.4626', '0.5233'], num_positive: 160.8000
2022-04-24 09:26:36,241 - INFO - task : ['pedestrian', 'traffic_cone'], loss: 1.5811, hm_loss: 0.9833, loc_loss: 2.3910, loc_loss_elem: ['0.1535', '0.1580', '0.2172', '0.2156', '0.2590', '0.1522', '0.2787', '0.3207', '0.5800', '0.5356'], num_positive: 144.2000
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
[E ProcessGroupNCCL.cpp:719] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=583940, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1800257 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=583940, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1800257 milliseconds before timing out.
/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects
--local_rank
argument to be set, pleasechange it to read from
os.environ['LOCAL_RANK']
instead. Seehttps://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning,
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58033 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 58028) of binary: /public/home/u212040344/.conda/envs/centerpoint/bin/python
Traceback (most recent call last):
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
)(*cmd_args)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./tools/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-04-24_09:59:14
host : node191
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 58028)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 58028
`
run on 2 gpus A100 with batch_size 16 num_workers 8*2
enviroment: pytorch1.11 cuda11.3 spconv2.x
Could you help me ? thank you
The text was updated successfully, but these errors were encountered: