Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge_sort: failed to synchronize during training #67

Open
gitouni opened this issue Sep 6, 2022 · 0 comments
Open

merge_sort: failed to synchronize during training #67

gitouni opened this issue Sep 6, 2022 · 0 comments

Comments

@gitouni
Copy link

gitouni commented Sep 6, 2022

Thank you for sharing your advanced work.
I met CUDA merge_sort error when training on 3DMatch.

Environment:

Ubuntu 20.04
CUDA 11.1
Python 3.8
MinkwoskiEngine v0.5.3

command

python train.py --voxel_size 0.05 --threed_match_dir threedmatch

output

INFO - 2022-09-06 10:57:42,406 - data_loaders - Resetting the data loader seed to 0                                                                                              
INFO - 2022-09-06 10:57:54,158 - trainer - Validation iter 101 / 400 : Data Loading Time: 0.054, Feature Extraction Time: 0.023, Matching Time: 0.035, Loss: 0.578, RTE: 1.279, R
RE: 0.498, Hit Ratio: 0.071, Feat Match Ratio: 0.495                                                                                                                             
INFO - 2022-09-06 10:58:05,445 - trainer - Validation iter 201 / 400 : Data Loading Time: 0.052, Feature Extraction Time: 0.023, Matching Time: 0.034, Loss: 0.576, RTE: 1.186, R
RE: 0.488, Hit Ratio: 0.067, Feat Match Ratio: 0.478                                                                                                                             
INFO - 2022-09-06 10:58:17,300 - trainer - Validation iter 301 / 400 : Data Loading Time: 0.056, Feature Extraction Time: 0.022, Matching Time: 0.035, Loss: 0.556, RTE: 1.133, R
RE: 0.471, Hit Ratio: 0.073, Feat Match Ratio: 0.502                                                                                                                             
INFO - 2022-09-06 10:58:28,882 - trainer - Final Loss: 0.554, RTE: 1.140, RRE: 0.458, Hit Ratio: 0.072, Feat Match Ratio: 0.490 
Traceback (most recent call last):                                                                                                                                        [0/903]
  File "train.py", line 81, in <module>
    main(config)
  File "train.py", line 57, in main
    trainer.train()
  File "/home/bit/CODE/Research/Point_Cloud_Reg/FCGF/lib/trainer.py", line 130, in train 
    self._train_epoch(epoch)
  File "/home/bit/CODE/Research/Point_Cloud_Reg/FCGF/lib/trainer.py", line 495, in _train_epoch
    loss.backward()
  File "/home/bit/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/bit/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward( 
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

With torch.autograd.detect_anomaly():

[W python_anomaly_mode.cpp:104] Warning: Error detected in IndexBackward. Traceback of forward call that caused the error:
  File "/home/bit/CODE/Research/Point_Cloud_Reg/FCGF/train.py", line 78, in <module>
    main(config)
  File "/home/bit/CODE/Research/Point_Cloud_Reg/FCGF/train.py", line 55, in main
    trainer.train()
  File "/home/bit/CODE/Research/Point_Cloud_Reg/FCGF/lib/trainer.py", line 130, in train
    self._train_epoch(epoch)
  File "/home/bit/CODE/Research/Point_Cloud_Reg/FCGF/lib/trainer.py", line 485, in _train_epoch
    pos_loss, neg_loss = self.contrastive_hardest_negative_loss(
  File "/home/bit/CODE/Research/Point_Cloud_Reg/FCGF/lib/trainer.py", line 447, in contrastive_hardest_negative_loss
    neg_loss1 = F.relu(self.neg_thresh - D10min[mask1]).pow(2)

I have changed the branch version to v0.5 and that error really confuses me.

@gitouni gitouni changed the title merge_sort error wheen training erge_sort: failed to synchronize during training Sep 6, 2022
@gitouni gitouni changed the title erge_sort: failed to synchronize during training merge_sort: failed to synchronize during training Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant