Got stuck when training with multiple GPU using dist_train.sh #696

xiazhongyv · 2021-12-04T11:01:42Z

All child threads getting stuck when training with multiple GPU using dist_train.sh
With CUDA == 11.3, Pytorch == 1.10
After diagnosis, I found it was stuck at https://github.com/open-mmlab/OpenPCDet/blob/master/pcdet/utils/common_utils.py#L166-L171

I modified the code from

dist.init_process_group(
        backend=backend,
        init_method='tcp://127.0.0.1:%d' % tcp_port,
        rank=local_rank,
        world_size=num_gpus
)

to

dist.init_process_group(
        backend=backend
)

and it worked.

I'm curious why this is so, and if someone else is having the same problem, you can try to do the same.

The text was updated successfully, but these errors were encountered:

dk-liang · 2021-12-08T07:34:59Z

Thanks. I have the same problem, and I solved it using your method.

Eaphan · 2022-02-02T06:33:26Z

@sshaoshuai After you fix bug in this way, the tcp_port is not used actually.
Can you fix it in a more decent way?

sshaoshuai · 2022-02-02T10:59:43Z

Thank you for the bug report. It has been fixed in #784.

Can you help to double check whether it works now?

Eaphan · 2022-02-03T13:17:07Z

Thank you for the bug report. It has been fixed in #784.

Can you help to double check whether it works now?

@sshaoshuai Thanks for your work. It's ok now.

aotiansysu · 2022-02-06T17:36:15Z

For single-machine multi-GPU training, I also modified the local_rank to rank in torch.cuda.set_device() to be able to train properly. Otherwise it throws this error: Duplicate GPU detected : rank 0 and rank 1 both on CUDA device a000.
Modified:

def init_dist_pytorch(tcp_port, local_rank, backend='nccl'):
    if mp.get_start_method(allow_none=True) is None:
        mp.set_start_method('spawn')
    num_gpus = torch.cuda.device_count()

    dist.init_process_group(
        backend=backend,
    )

    rank = dist.get_rank()
    torch.cuda.set_device(rank % num_gpus)
    return num_gpus, rank

jiaminglei-lei · 2022-02-18T02:31:45Z

@sshaoshuai
torch=1.9.0 cuda=11.1.
Got stuck at dist.init_process_group and the code is latest....
In other distribued training project having the same code for init_process_group, it ran successfully. ......

jiaminglei-lei · 2022-02-18T02:43:42Z

@sshaoshuai torch=1.9.0 cuda=11.1. Got stuck at dist.init_process_group and the code is latest.... In other distribued training project having the same code for init_process_group, it ran successfully. ......

after I uncomment the lines mentioned in #784 (comment), it works.

sshaoshuai · 2022-02-19T23:32:05Z

I have submitted a new PR to solve this issue in #815.

Please pull the latest master branch if you still get block when training with dist_train.sh.

Liaoqing-up · 2022-05-14T10:25:45Z

@sshaoshuai torch=1.9.0 cuda=11.1. Got stuck at dist.init_process_group and the code is latest.... In other distribued training project having the same code for init_process_group, it ran successfully. ......

after I uncomment the lines mentioned in #784 (comment), it works.

So what is the cause of this stuck? I also counter this and will try your way...

sshaoshuai added the bug Something isn't working label Dec 8, 2021

sshaoshuai mentioned this issue Dec 27, 2021

bugfixed: stuck when training with dist_train.sh #728

Merged

sshaoshuai closed this as completed in #728 Dec 27, 2021

sshaoshuai mentioned this issue Dec 27, 2021

bugfixed: stuck when training with dist_train.sh, support tcp_port #729

Merged

sshaoshuai mentioned this issue Feb 2, 2022

bugfixed: stuck when training with dist_train.sh, support tcp_port #784

Merged

sshaoshuai pinned this issue Feb 19, 2022

tianweiy mentioned this issue Apr 24, 2022

multi gpus run error after 1 epoch tianweiy/CenterPoint#314

Open

LiewFeng mentioned this issue Dec 26, 2022

About the speed of multi-gpu training Cc-Hy/CMKD#9

Closed

zllxot mentioned this issue Apr 1, 2024

Question about multi-gpu training #1476

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Got stuck when training with multiple GPU using dist_train.sh #696

Got stuck when training with multiple GPU using dist_train.sh #696

xiazhongyv commented Dec 4, 2021

dk-liang commented Dec 8, 2021

Eaphan commented Feb 2, 2022

sshaoshuai commented Feb 2, 2022

Eaphan commented Feb 3, 2022

aotiansysu commented Feb 6, 2022

jiaminglei-lei commented Feb 18, 2022

jiaminglei-lei commented Feb 18, 2022

sshaoshuai commented Feb 19, 2022 •

edited

Loading

Liaoqing-up commented May 14, 2022

Got stuck when training with multiple GPU using dist_train.sh #696

Got stuck when training with multiple GPU using dist_train.sh #696

Comments

xiazhongyv commented Dec 4, 2021

dk-liang commented Dec 8, 2021

Eaphan commented Feb 2, 2022

sshaoshuai commented Feb 2, 2022

Eaphan commented Feb 3, 2022

aotiansysu commented Feb 6, 2022

jiaminglei-lei commented Feb 18, 2022

jiaminglei-lei commented Feb 18, 2022

sshaoshuai commented Feb 19, 2022 • edited Loading

Liaoqing-up commented May 14, 2022

sshaoshuai commented Feb 19, 2022 •

edited

Loading