Get stuck after one epoch of training (Multi GPU DDP) See this! #203

wusize · 2021-10-07T08:20:45Z

When training with multiple gpus, the programme stops at "INFO - finding looplift candidates" after one epoch of training? This info sentence might come from numba, but I am not able to exactly locate it. Is there anyone who meets the same problem?

See #203 (comment) for solution

Charrrrrlie · 2021-12-01T10:38:41Z

I also met this problem and I wonder have you fixed it yet?

wusize · 2021-12-03T07:41:51Z

Nope. Maybe you can try mmdetection3d.

tianweiy · 2021-12-03T07:56:54Z

I also have no clue (my school server only gets cuda 10.0 so I am still using torch 1.1.0 for training and there is no issue in that version). I feel the problem is related to apex according to a few other recent issues and you may want to replace Apex syncbn with torch's sync bn. Will check this more after I finish this semester ( in two weeks)

Charrrrrlie · 2021-12-03T13:58:29Z

I think I fix it tonight with 'cuda 10.1, torch=1.4.0, numba=0.53.1'.
And it's worth mentioning that iou3d_nms module can't be directly set up from the .sh file since it changed the root path(?).

Thanks again for the prompt replies @wusize @tianweiy !!!

kagecom · 2021-12-18T08:38:31Z

I also encounter this problem.

How to fix it ??

Charrrrrlie · 2021-12-18T08:56:11Z

I also encounter this problem.

How to fix it ??

I think numba version may be incompatible with other dependencies. You can try my version plan above or carefully follow the author's instructions for each package.

kagecom · 2021-12-18T09:21:35Z

But there is no cuda 10.1, torch=1.4.0, i.e. torch=1.4.0+cu101

Charrrrrlie · 2021-12-18T09:30:32Z

It's said that "PyTorch 1.4.0 shipped with CUDA 10.1 by default, so there is no separate package with the cu101 suffix, those are only for alternative versions. "
And I suggest conda to install it. You can find the command in pytorch.org

tianweiy · 2021-12-19T00:21:10Z

The problem is related to ddp in recent torch versions.

It should be fixed now e30f768

And you should be able to use most recent torch versions. Let me know if there are any further problems

kagecom · 2021-12-19T03:57:44Z

I am still stuck in the training process after merge the last two commits https://github.com/tianweiy/CenterPoint/commit/e30f768a36427029b1fa055563583aafd9b58db2
and https://github.com/tianweiy/CenterPoint/commit/a32fb02723011c84e500e16991b7ede43c8b5097.

My environment is torch 1.7.0+cu101, V100-SXM2 16G.

tianweiy · 2021-12-19T04:15:05Z

oh, interesting, do you get timeout error ? Because I also noticed a slightly large delay between epochs, but it does proceed after some time.

Could you try a simple example? Basically, I push a new cfg just now to simulate the training process.

Could you run

python -m torch.distributed.launch --nproc_per_node 2 tools/train.py configs/mvp/nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_scale_debug.py

it will only take a minute or so for one epoch. I want to know if you still get stuck with this cfg ?

kagecom · 2021-12-19T14:39:09Z

oh, interesting, do you get timeout error ? Because I also noticed a slightly large delay between epochs, but it does proceed after some time.

Could you try a simple example? Basically, I push a new cfg just now to simulate the training process.

Could you run
python -m torch.distributed.launch --nproc_per_node 2 tools/train.py configs/mvp/nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_scale_debug.py
it will only take a minute or so for one epoch. I want to know if you still get stuck with this cfg ?

Hi there, i use waymo dataset and i don't know the differences in your debug setting.

But i test load_interval=1000 training on waymo dataset, the stuck disappear.

I don't know why.

tianweiy · 2021-12-19T14:51:04Z

Got it. Yeah, I only change the interval to subsample dataset.

Uhm, weird then. Maybe just use 1.4 if it is fine for your case. I will look into this further

kagecom · 2021-12-19T15:02:47Z

When load_interval = 5, it stuck, load_interval = 1000, it woks.

It confuses me.

tianweiy · 2021-12-19T15:15:04Z

thanks, another thing you can try is

changing

CenterPoint/det3d/torchie/apis/train.py

Line 268 in 3fd0b87

model = apex.parallel.convert_syncbn_model(model)

to

        model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)

use spconv2.x

I am now running a few experiments to see if there are any performance differences due to these two changes and will update soon. (Update: results with spconv 2.x + torch nn syncbn is similar to the original version.)

I am able to train the full nuScenes dataset with 8 GPUs ddp (titan v) and latest torch (1.10.1 version + CUDA 11.3).

kagecom · 2021-12-26T11:35:26Z

I have tried several couples of combinations but still stuck.

The only way work for me is to add env NCCL_BLOCKING_WAIT=1 to start the training process.

However, it slows the training process and i don't know why.

tianweiy · 2022-01-03T02:18:48Z

maybe related to multiprocessing, like adding a line like this to train.py before init ddp

    if mp.get_start_method(allow_none=True) is None:
        mp.set_start_method('spawn')

no clue if this works or not though because I just could not reproduce your error

zzm-hl · 2022-04-23T09:53:10Z

I have the same problem ,have you solved it ? I guess is it related to num_workers?

Liaoqing-up · 2022-05-14T04:05:23Z

I have the same problem！！！！

Liaoqing-up · 2022-05-14T04:06:16Z

I have the same problem ,have you solved it ? I guess is it related to num_workers?

Do you have any ideas

Liaoqing-up · 2022-05-14T04:14:24Z

hello, i'am confused about the effect of load_interval.... Can you explain what does this parameter mean?

tianweiy · 2022-05-14T06:22:40Z

load_interval probably is not the root cause. It defines how we subsample the dataset (10 then we use 1/10 of the dataset).

Unfortunately, I am not able to reproduce this issue ... Also see #314

Liaoqing-up · 2022-05-16T01:25:46Z

thanks, another thing you can try is

changing

CenterPoint/det3d/torchie/apis/train.py

Line 268 in 3fd0b87

model = apex.parallel.convert_syncbn_model(model)

to
        model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
use spconv2.x

I am now running a few experiments to see if there are any performance differences due to these two changes and will update soon. (Update: results with spconv 2.x + torch nn syncbn is similar to the original version.)

I am able to train the full nuScenes dataset with 8 GPUs ddp (titan v) and latest torch (1.10.1 version + CUDA 11.3).

i still stuck by way 1 .....TAT

Liaoqing-up · 2022-05-16T01:57:04Z

load_interval probably is not the root cause. It defines how we subsample the dataset (10 then we use 1/10 of the dataset).

Unfortunately, I am not able to reproduce this issue ... Also see #314

i guess the problem come from 'finding looplift candidates', i always stuck in this step after train an epoch, do you know what that means?

tianweiy · 2022-05-16T02:01:22Z

it has nothing to do with finding looplift candidates.

It is the byproduct of starting a new epoch.

Unfortunately, I don't know what the root cause is (someone gets this issue and some one doesn;t...)

Liaoqing-up · 2022-05-16T02:05:38Z

i tried to change the load_interval from 1 to 100 just now, and seems to no stuck.

kagecom · 2022-05-16T02:13:42Z

i tried to change the load_interval from 1 to 100 just now, and seems to no stuck.

i have try several ways including change the load_interval as mentioned above this issue.

i suggest you to try

"I have tried several couples of combinations but still stuck.

The only way work for me is to add env NCCL_BLOCKING_WAIT=1 to start the training process.

However, it slows the training process and i don't know why.
"

Liaoqing-up · 2022-05-16T02:42:05Z

i tried to change the load_interval from 1 to 100 just now, and seems to no stuck.

i have try several ways including change the load_interval as mentioned above this issue.

i suggest you to try

"I have tried several couples of combinations but still stuck.

The only way work for me is to add env NCCL_BLOCKING_WAIT=1 to start the training process.

However, it slows the training process and i don't know why. "

ok, i'll try, thankyou~

Liaoqing-up · 2022-05-16T02:42:12Z

i tried to change the load_interval from 1 to 100 just now, and seems to no stuck.

i have try several ways including change the load_interval as mentioned above this issue.

i suggest you to try

"I have tried several couples of combinations but still stuck.

The only way work for me is to add env NCCL_BLOCKING_WAIT=1 to start the training process.

However, it slows the training process and i don't know why. "

ok, i'll try, thankyou~

Liaoqing-up · 2022-05-22T02:01:37Z

NCCL_BLOCKING_WAIT=1

hello, i come back~ i have tried this way for train recently, it seems no stuck anymore, and the speed seems normal

tianweiy · 2022-05-22T02:02:23Z

wow, that's amazing!

tianweiy · 2022-05-22T02:04:50Z

this one may also be relevant for other people pytorch/pytorch#50820

dk-liang · 2022-10-23T05:34:24Z

I try to utilize NCCL_BLOCKING_WAIT=1, but it does not work for me

tianweiy mentioned this issue Apr 24, 2022

multi gpus run error after 1 epoch #314

Open

tianweiy changed the title ~~Stuck at "INFO - finding looplift candidates"~~ Get stuck after one epoch of training (Multi GPU DDP) See this! May 22, 2022

tianweiy pinned this issue May 22, 2022

tianweiy mentioned this issue Jul 20, 2022

about test results #230

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get stuck after one epoch of training (Multi GPU DDP) See this! #203

Get stuck after one epoch of training (Multi GPU DDP) See this! #203

wusize commented Oct 7, 2021 •

edited by tianweiy

Loading

Charrrrrlie commented Dec 1, 2021

wusize commented Dec 3, 2021

tianweiy commented Dec 3, 2021

Charrrrrlie commented Dec 3, 2021

kagecom commented Dec 18, 2021

Charrrrrlie commented Dec 18, 2021

kagecom commented Dec 18, 2021

Charrrrrlie commented Dec 18, 2021

tianweiy commented Dec 19, 2021

kagecom commented Dec 19, 2021 •

edited

Loading

tianweiy commented Dec 19, 2021 •

edited

Loading

kagecom commented Dec 19, 2021

tianweiy commented Dec 19, 2021

kagecom commented Dec 19, 2021

tianweiy commented Dec 19, 2021 •

edited

Loading

kagecom commented Dec 26, 2021

tianweiy commented Jan 3, 2022

zzm-hl commented Apr 23, 2022

Liaoqing-up commented May 14, 2022

Liaoqing-up commented May 14, 2022

Liaoqing-up commented May 14, 2022

tianweiy commented May 14, 2022

Liaoqing-up commented May 16, 2022 •

edited

Loading

Liaoqing-up commented May 16, 2022

tianweiy commented May 16, 2022

Liaoqing-up commented May 16, 2022

kagecom commented May 16, 2022

Liaoqing-up commented May 16, 2022

Liaoqing-up commented May 16, 2022

Liaoqing-up commented May 22, 2022

tianweiy commented May 22, 2022

tianweiy commented May 22, 2022

dk-liang commented Oct 23, 2022

Get stuck after one epoch of training (Multi GPU DDP) See this! #203

Get stuck after one epoch of training (Multi GPU DDP) See this! #203

Comments

wusize commented Oct 7, 2021 • edited by tianweiy Loading

Charrrrrlie commented Dec 1, 2021

wusize commented Dec 3, 2021

tianweiy commented Dec 3, 2021

Charrrrrlie commented Dec 3, 2021

kagecom commented Dec 18, 2021

Charrrrrlie commented Dec 18, 2021

kagecom commented Dec 18, 2021

Charrrrrlie commented Dec 18, 2021

tianweiy commented Dec 19, 2021

kagecom commented Dec 19, 2021 • edited Loading

tianweiy commented Dec 19, 2021 • edited Loading

kagecom commented Dec 19, 2021

tianweiy commented Dec 19, 2021

kagecom commented Dec 19, 2021

tianweiy commented Dec 19, 2021 • edited Loading

kagecom commented Dec 26, 2021

tianweiy commented Jan 3, 2022

zzm-hl commented Apr 23, 2022

Liaoqing-up commented May 14, 2022

Liaoqing-up commented May 14, 2022

Liaoqing-up commented May 14, 2022

tianweiy commented May 14, 2022

Liaoqing-up commented May 16, 2022 • edited Loading

Liaoqing-up commented May 16, 2022

tianweiy commented May 16, 2022

Liaoqing-up commented May 16, 2022

kagecom commented May 16, 2022

Liaoqing-up commented May 16, 2022

Liaoqing-up commented May 16, 2022

Liaoqing-up commented May 22, 2022

tianweiy commented May 22, 2022

tianweiy commented May 22, 2022

dk-liang commented Oct 23, 2022

wusize commented Oct 7, 2021 •

edited by tianweiy

Loading

kagecom commented Dec 19, 2021 •

edited

Loading

tianweiy commented Dec 19, 2021 •

edited

Loading

tianweiy commented Dec 19, 2021 •

edited

Loading

Liaoqing-up commented May 16, 2022 •

edited

Loading