Stage-2 Training Error #50

Ha-coding-user · 2024-04-19T15:04:10Z

Hello @RoboticsYimingLi , Now I am reproducing Your fantastic VoxFormer
But, when stage-1 study is well done, stage-2 make error like below

my compute spec is
Tesla A100(80G) 4 GPUS
CUDA 11.3 torch 1.10.1

Please Tell me solution... I can't do anything now...

Error Message

error in ms_deformable_im2col_cuda: an illegal memory access was encountered
Traceback (most recent call last):
File "./tools/train.py", line 261, in
main()
File "./tools/train.py", line 250, in main
custom_train_model(
File "/home/hamyo/SSC/mmdetection3d/VoxFormer/projects/mmdet3d_plugin/voxformer/apis/train.py", line 27, in custom_train_model
custom_train_detector(
File "/home/hamyo/SSC/mmdetection3d/VoxFormer/projects/mmdet3d_plugin/voxformer/apis/mmdet_train.py", line 200, in custom_train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 237, in train_step
losses = self(**data)
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hamyo/SSC/mmdetection3d/VoxFormer/projects/mmdet3d_plugin/voxformer/detectors/voxformer.py", line 108, in forward
return self.forward_train(**kwargs)
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
return old_func(*args, **kwargs)
File "/home/hamyo/SSC/mmdetection3d/VoxFormer/projects/mmdet3d_plugin/voxformer/detectors/voxformer.py", line 138, in forward_train
losses_pts = self.forward_pts_train(img_feats, img_metas, target)
File "/home/hamyo/SSC/mmdetection3d/VoxFormer/projects/mmdet3d_plugin/voxformer/detectors/voxformer.py", line 93, in forward_pts_train
outs = self.pts_bbox_head(img_feats, img_metas, target)
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hamyo/SSC/mmdetection3d/VoxFormer/projects/mmdet3d_plugin/voxformer/dense_heads/voxformer_head.py", line 95, in forward
seed_feats = self.cross_transformer.get_vox_features(
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
return old_func(*args, **kwargs)
File "/home/hamyo/SSC/mmdetection3d/VoxFormer/projects/mmdet3d_plugin/voxformer/modules/transformer.py", line 136, in get_vox_features
bev_embed = self.encoder(
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
return old_func(*args, **kwargs)
File "/home/hamyo/SSC/mmdetection3d/VoxFormer/projects/mmdet3d_plugin/voxformer/modules/encoder.py", line 205, in forward
output = layer(
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hamyo/SSC/mmdetection3d/VoxFormer/projects/mmdet3d_plugin/voxformer/modules/encoder.py", line 372, in forward
query = self.attentions[attn_index](
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 186, in new_func
return old_func(*args, kwargs)
File "/home/hamyo/SSC/mmdetection3d/VoxFormer/projects/mmdet3d_plugin/voxformer/modules/deformable_cross_attention.py", line 171, in forward
slots[j, index_query_per_img] += queries[j, i, :len(index_query_per_img)]
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1639180588308/work/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fdede183d62 in /home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c613 (0x7fdf236a1613 in /home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x1a2 (0x7fdf236a2022 in /home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7fdede16d314 in /home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0x295359 (0x7fdf776a3359 in /home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xadb231 (0x7fdf77ee9231 in /home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object) + 0x292 (0x7fdf77ee9532 in /home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: /home/hamyo/anaconda3/envs/voxformer/bin/python() [0x4d386f]
frame #8: /home/hamyo/anaconda3/envs/voxformer/bin/python() [0x4e55cb]
frame #9: /home/hamyo/anaconda3/envs/voxformer/bin/python() [0x4e55cb]
frame #10: /home/hamyo/anaconda3/envs/voxformer/bin/python() [0x4e0800]
frame #11: /home/hamyo/anaconda3/envs/voxformer/bin/python() [0x4f16a8]
frame #12: /home/hamyo/anaconda3/envs/voxformer/bin/python() [0x4f1691]
frame #13: /home/hamyo/anaconda3/envs/voxformer/bin/python() [0x4f1691]
frame #14: /home/hamyo/anaconda3/envs/voxformer/bin/python() [0x4f1691]
frame #15: /home/hamyo/anaconda3/envs/voxformer/bin/python() [0x4f1691]
frame #16: /home/hamyo/anaconda3/envs/voxformer/bin/python() [0x4f1691]
frame #17: /home/hamyo/anaconda3/envs/voxformer/bin/python() [0x4f1691]
frame #18: /home/hamyo/anaconda3/envs/voxformer/bin/python() [0x4f1691]
frame #19: /home/hamyo/anaconda3/envs/voxformer/bin/python() [0x4c9280]
frame #20: PyDict_SetItemString + 0x52 (0x5823a2 in /home/hamyo/anaconda3/envs/voxformer/bin/python)
frame #21: PyImport_Cleanup + 0x93 (0x5a7623 in /home/hamyo/anaconda3/envs/voxformer/bin/python)
frame #22: Py_FinalizeEx + 0x71 (0x5a6751 in /home/hamyo/anaconda3/envs/voxformer/bin/python)
frame #23: Py_RunMain + 0x112 (0x5a21f2 in /home/hamyo/anaconda3/envs/voxformer/bin/python)
frame #24: Py_BytesMain + 0x39 (0x57a799 in /home/hamyo/anaconda3/envs/voxformer/bin/python)
frame #25: __libc_start_main + 0xe7 (0x7fdfb1900c87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #26: /home/hamyo/anaconda3/envs/voxformer/bin/python() [0x57a64d]

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53769 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53771 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53772 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 53770) of binary: /home/hamyo/anaconda3/envs/voxformer/bin/python
Traceback (most recent call last):
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/hamyo/anaconda3/envs/voxformer/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./tools/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-04-20_00:00:00
host : server-45
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 53770)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 53770

codeimdy · 2024-05-17T08:24:35Z

Hello, after completing the first phase of training, how should I proceed with the second phase? Are there any additional steps required, or do I only need to enter the training command in the terminal? Directly entering the training command results in an error, as shown in the picture.

VitaLemonTea1 · 2024-06-13T06:55:10Z

I got same problem but it fixed after I change CUDA version to 11.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage-2 Training Error #50

Stage-2 Training Error #50

Ha-coding-user commented Apr 19, 2024

codeimdy commented May 17, 2024

VitaLemonTea1 commented Jun 13, 2024

Stage-2 Training Error #50

Stage-2 Training Error #50

Comments

Ha-coding-user commented Apr 19, 2024

Error Message

./tools/train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-04-20_00:00:00 host : server-45 rank : 1 (local_rank: 1) exitcode : -6 (pid: 53770) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 53770

codeimdy commented May 17, 2024

VitaLemonTea1 commented Jun 13, 2024

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-04-20_00:00:00
host : server-45
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 53770)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 53770