-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Got stuck when training with multiple GPU using dist_train.sh #696
Comments
Thanks. I have the same problem, and I solved it using your method. |
@sshaoshuai After you fix bug in this way, the tcp_port is not used actually. |
Thank you for the bug report. It has been fixed in #784. Can you help to double check whether it works now? |
@sshaoshuai Thanks for your work. It's ok now. |
For single-machine multi-GPU training, I also modified the def init_dist_pytorch(tcp_port, local_rank, backend='nccl'):
if mp.get_start_method(allow_none=True) is None:
mp.set_start_method('spawn')
num_gpus = torch.cuda.device_count()
dist.init_process_group(
backend=backend,
)
rank = dist.get_rank()
torch.cuda.set_device(rank % num_gpus)
return num_gpus, rank |
@sshaoshuai |
after I uncomment the lines mentioned in #784 (comment), it works. |
I have submitted a new PR to solve this issue in #815. Please pull the latest master branch if you still get block when training with dist_train.sh. |
So what is the cause of this stuck? I also counter this and will try your way... |
All child threads getting stuck when training with multiple GPU using dist_train.sh
With CUDA == 11.3, Pytorch == 1.10
After diagnosis, I found it was stuck at https://github.com/open-mmlab/OpenPCDet/blob/master/pcdet/utils/common_utils.py#L166-L171
I modified the code from
to
and it worked.
I'm curious why this is so, and if someone else is having the same problem, you can try to do the same.
The text was updated successfully, but these errors were encountered: