Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get stuck after one epoch of training (Multi GPU DDP) See this! #203

Open
wusize opened this issue Oct 7, 2021 · 33 comments
Open

Get stuck after one epoch of training (Multi GPU DDP) See this! #203

wusize opened this issue Oct 7, 2021 · 33 comments

Comments

@wusize
Copy link

wusize commented Oct 7, 2021

When training with multiple gpus, the programme stops at "INFO - finding looplift candidates" after one epoch of training? This info sentence might come from numba, but I am not able to exactly locate it. Is there anyone who meets the same problem?

See #203 (comment) for solution

@Charrrrrlie
Copy link

I also met this problem and I wonder have you fixed it yet?

@wusize
Copy link
Author

wusize commented Dec 3, 2021

Nope. Maybe you can try mmdetection3d.

@tianweiy
Copy link
Owner

tianweiy commented Dec 3, 2021

I also have no clue (my school server only gets cuda 10.0 so I am still using torch 1.1.0 for training and there is no issue in that version). I feel the problem is related to apex according to a few other recent issues and you may want to replace Apex syncbn with torch's sync bn. Will check this more after I finish this semester ( in two weeks)

@Charrrrrlie
Copy link

I think I fix it tonight with 'cuda 10.1, torch=1.4.0, numba=0.53.1'.
And it's worth mentioning that iou3d_nms module can't be directly set up from the .sh file since it changed the root path(?).

Thanks again for the prompt replies @wusize @tianweiy !!!

@kagecom
Copy link

kagecom commented Dec 18, 2021

I also encounter this problem.

How to fix it ??

@Charrrrrlie
Copy link

I also encounter this problem.

How to fix it ??

I think numba version may be incompatible with other dependencies. You can try my version plan above or carefully follow the author's instructions for each package.

@kagecom
Copy link

kagecom commented Dec 18, 2021

But there is no cuda 10.1, torch=1.4.0, i.e. torch=1.4.0+cu101

@Charrrrrlie
Copy link

It's said that "PyTorch 1.4.0 shipped with CUDA 10.1 by default, so there is no separate package with the cu101 suffix, those are only for alternative versions. "
And I suggest conda to install it. You can find the command in pytorch.org

@tianweiy
Copy link
Owner

The problem is related to ddp in recent torch versions.

It should be fixed now e30f768

And you should be able to use most recent torch versions. Let me know if there are any further problems

@kagecom
Copy link

kagecom commented Dec 19, 2021

I am still stuck in the training process after merge the last two commits https://github.com/tianweiy/CenterPoint/commit/e30f768a36427029b1fa055563583aafd9b58db2
and https://github.com/tianweiy/CenterPoint/commit/a32fb02723011c84e500e16991b7ede43c8b5097.

My environment is torch 1.7.0+cu101, V100-SXM2 16G.

@tianweiy
Copy link
Owner

tianweiy commented Dec 19, 2021

oh, interesting, do you get timeout error ? Because I also noticed a slightly large delay between epochs, but it does proceed after some time.

Could you try a simple example? Basically, I push a new cfg just now to simulate the training process.

Could you run

python -m torch.distributed.launch --nproc_per_node 2 tools/train.py configs/mvp/nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_scale_debug.py

it will only take a minute or so for one epoch. I want to know if you still get stuck with this cfg ?

@kagecom
Copy link

kagecom commented Dec 19, 2021

oh, interesting, do you get timeout error ? Because I also noticed a slightly large delay between epochs, but it does proceed after some time.

Could you try a simple example? Basically, I push a new cfg just now to simulate the training process.

Could you run

python -m torch.distributed.launch --nproc_per_node 2 tools/train.py configs/mvp/nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_scale_debug.py

it will only take a minute or so for one epoch. I want to know if you still get stuck with this cfg ?

Hi there, i use waymo dataset and i don't know the differences in your debug setting.

But i test load_interval=1000 training on waymo dataset, the stuck disappear.

I don't know why.

@tianweiy
Copy link
Owner

Got it. Yeah, I only change the interval to subsample dataset.

Uhm, weird then. Maybe just use 1.4 if it is fine for your case. I will look into this further

@kagecom
Copy link

kagecom commented Dec 19, 2021

When load_interval = 5, it stuck, load_interval = 1000, it woks.

It confuses me.

@tianweiy
Copy link
Owner

tianweiy commented Dec 19, 2021

thanks, another thing you can try is

  1. changing

model = apex.parallel.convert_syncbn_model(model)

to

        model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
  1. use spconv2.x

I am now running a few experiments to see if there are any performance differences due to these two changes and will update soon. (Update: results with spconv 2.x + torch nn syncbn is similar to the original version.)

I am able to train the full nuScenes dataset with 8 GPUs ddp (titan v) and latest torch (1.10.1 version + CUDA 11.3).

@kagecom
Copy link

kagecom commented Dec 26, 2021

I have tried several couples of combinations but still stuck.

The only way work for me is to add env NCCL_BLOCKING_WAIT=1 to start the training process.

However, it slows the training process and i don't know why.

@tianweiy
Copy link
Owner

tianweiy commented Jan 3, 2022

maybe related to multiprocessing, like adding a line like this to train.py before init ddp

    if mp.get_start_method(allow_none=True) is None:
        mp.set_start_method('spawn')

no clue if this works or not though because I just could not reproduce your error

@zzm-hl
Copy link

zzm-hl commented Apr 23, 2022

I have the same problem ,have you solved it ? I guess is it related to num_workers?

@Liaoqing-up
Copy link

I have the same problem!!!!

@Liaoqing-up
Copy link

I have the same problem ,have you solved it ? I guess is it related to num_workers?

Do you have any ideas

@Liaoqing-up
Copy link

hello, i'am confused about the effect of load_interval.... Can you explain what does this parameter mean?

@tianweiy
Copy link
Owner

load_interval probably is not the root cause. It defines how we subsample the dataset (10 then we use 1/10 of the dataset).

Unfortunately, I am not able to reproduce this issue ... Also see #314

@Liaoqing-up
Copy link

Liaoqing-up commented May 16, 2022

thanks, another thing you can try is

  1. changing

model = apex.parallel.convert_syncbn_model(model)

to

        model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
  1. use spconv2.x

I am now running a few experiments to see if there are any performance differences due to these two changes and will update soon. (Update: results with spconv 2.x + torch nn syncbn is similar to the original version.)

I am able to train the full nuScenes dataset with 8 GPUs ddp (titan v) and latest torch (1.10.1 version + CUDA 11.3).

i still stuck by way 1 .....TAT

@Liaoqing-up
Copy link

load_interval probably is not the root cause. It defines how we subsample the dataset (10 then we use 1/10 of the dataset).

Unfortunately, I am not able to reproduce this issue ... Also see #314

i guess the problem come from 'finding looplift candidates', i always stuck in this step after train an epoch, do you know what that means?

@tianweiy
Copy link
Owner

it has nothing to do with finding looplift candidates.

It is the byproduct of starting a new epoch.

Unfortunately, I don't know what the root cause is (someone gets this issue and some one doesn;t...)

@Liaoqing-up
Copy link

i tried to change the load_interval from 1 to 100 just now, and seems to no stuck.

@kagecom
Copy link

kagecom commented May 16, 2022

i tried to change the load_interval from 1 to 100 just now, and seems to no stuck.

i have try several ways including change the load_interval as mentioned above this issue.

i suggest you to try

"I have tried several couples of combinations but still stuck.

The only way work for me is to add env NCCL_BLOCKING_WAIT=1 to start the training process.

However, it slows the training process and i don't know why.
"

@Liaoqing-up
Copy link

i tried to change the load_interval from 1 to 100 just now, and seems to no stuck.

i have try several ways including change the load_interval as mentioned above this issue.

i suggest you to try

"I have tried several couples of combinations but still stuck.

The only way work for me is to add env NCCL_BLOCKING_WAIT=1 to start the training process.

However, it slows the training process and i don't know why. "

ok, i'll try, thankyou~

1 similar comment
@Liaoqing-up
Copy link

i tried to change the load_interval from 1 to 100 just now, and seems to no stuck.

i have try several ways including change the load_interval as mentioned above this issue.

i suggest you to try

"I have tried several couples of combinations but still stuck.

The only way work for me is to add env NCCL_BLOCKING_WAIT=1 to start the training process.

However, it slows the training process and i don't know why. "

ok, i'll try, thankyou~

@Liaoqing-up
Copy link

NCCL_BLOCKING_WAIT=1

hello, i come back~ i have tried this way for train recently, it seems no stuck anymore, and the speed seems normal

@tianweiy
Copy link
Owner

wow, that's amazing!

@tianweiy
Copy link
Owner

this one may also be relevant for other people pytorch/pytorch#50820

@tianweiy tianweiy changed the title Stuck at "INFO - finding looplift candidates" Get stuck after one epoch of training (Multi GPU DDP) See this! May 22, 2022
@tianweiy tianweiy pinned this issue May 22, 2022
@dk-liang
Copy link

I try to utilize NCCL_BLOCKING_WAIT=1, but it does not work for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants