Training exiting suddenly #21

iraj465 · 2023-03-16T13:34:47Z

I'm training on 3 A-100 GPUs with 40GB memory. The training is not starting, what could be the issue? I have included the error report below.

DATA_PATH :  SHOG2TA2CKZ7ERNU                                                                                                                          
COLMAP_PATH :  /usr/local/bin/colmap                                                                                                                   
CONFIG_PATH :  /nvdiffrecmc/configs/manual/shoe.json                                                                                                   
NUMBER OF GPUS:  3                                                                                                                                     
TRAINING STARTED..                                                                                                                                     
WARNING:torch.distributed.run:                                                                                                                         
*****************************************                                                                                                              
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variab
le for optimal performance in your application as needed.                                                                                              
*****************************************                                                                                                              
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:                                                                     
  entrypoint       : train.py                                                                                                                          
  min_nodes        : 1                                                                                                                                 
  max_nodes        : 1                                                                                                                                 
  nproc_per_node   : 3                                                                                                                                 
  run_id           : none                                                                                                                              
  rdzv_backend     : static                                                                                                                            
  rdzv_endpoint    : 127.0.0.1:29500                                                                                                                   
  rdzv_configs     : {'rank': 0, 'timeout': 900}                                                                                                       
  max_restarts     : 0                                                                                                                                 
  monitor_interval : 5                                                                                                                                 
  log_dir          : None                                                                                                                              
  metrics_cfg      : {}                                                                                                                                
                                                                                                                                                       
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_fjb7yrke/none_gz0e2q6p                         
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3.8                                                   
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group                                                                  
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:                                                     
  restart_count=0                                                                                                                                      
  master_addr=127.0.0.1                                                                                                                                
  master_port=29500                                                                                                                                    
  group_rank=0                                                                                                                                         
  group_world_size=1                                                                                                                                   
  local_ranks=[0, 1, 2]                                                                                                                                
  role_ranks=[0, 1, 2]                                                                                                                                 
  global_ranks=[0, 1, 2]                                                                                                                               
  role_world_sizes=[3, 3, 3]                                                                                                                           
  global_world_sizes=[3, 3, 3]                                                                                                                         
                                                                                                                                                       
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group      
/opt/conda/lib/python3.8/site-packages/tinycudann/modules.py:52: UserWarning: tinycudann was built for lower compute capability (70) than the system's 
(80). Performance may be suboptimal.
  warnings.warn(f"tinycudann was built for lower compute capability ({cc}) than the system's ({system_compute_capability}). Performance may be suboptim
al.")
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
/opt/conda/lib/python3.8/site-packages/tinycudann/modules.py:52: UserWarning: tinycudann was built for lower compute capability (70) than the system's
(80). Performance may be suboptimal.
  warnings.warn(f"tinycudann was built for lower compute capability ({cc}) than the system's ({system_compute_capability}). Performance may be suboptim
al.")
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/optixutils_plugin/build.ninja...
Building extension module optixutils_plugin...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module optixutils_plugin...
/opt/conda/lib/python3.8/site-packages/tinycudann/modules.py:52: UserWarning: tinycudann was built for lower compute capability (70) than the system's
(80). Performance may be suboptimal.
  warnings.warn(f"tinycudann was built for lower compute capability ({cc}) than the system's ({system_compute_capability}). Performance may be suboptim
al.")
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/optixutils_plugin/build.ninja...
Building extension module optixutils_plugin...
Config / Flags:                                                                                                                                        
---------                                                                                                                                              
iter 500                                                                                                                                               
batch 4                                                                                                                                                
spp 1
layers 1
train_res [2048, 2048]
display_res [2048, 2048]
texture_res [2048, 2048]
display_interval 0
save_interval 100
learning_rate [0.03, 0.005]
custom_mip False
background white
loss logl1
out_dir out/SHOG2TA2CKZ7ERNU
config /nvdiffrecmc/configs/manual/shoe.json
ref_mesh SHOG2TA2CKZ7ERNU
base_mesh None
validate True
n_samples 12
bsdf pbr
denoiser bilateral
denoiser_demodulate True
save_custom 3D/vertical/footwear
vertical Footwear
mtl_override None
dmtet_grid 128
mesh_scale 2.5
envlight None
env_scale 1.0
probe_res 256
learn_lighting True
display [{'bsdf': 'kd'}, {'bsdf': 'ks'}, {'bsdf': 'normal'}]
transparency False
lock_light False
lock_pos False
sdf_regularizer 0.2
laplace relative                                                                                                                                       
laplace_scale 3000.0                                                                                                                                   
pre_load True                                                                                                                                          
no_perturbed_nrm False                                                                                                                                 
decorrelated False                                                                                                                                     
kd_min [0.03, 0.03, 0.03]                                                                                                                              
kd_max [0.8, 0.8, 0.8]
ks_min [0, 0.08, 0]
ks_max [0, 1, 1]
nrm_min [-1.0, -1.0, 0.0]
nrm_max [1.0, 1.0, 1.0]
clip_max_norm 0.0
cam_near_far [0.1, 1000.0]
lambda_kd 0.1
lambda_ks 0.05
lambda_nrm 0.025
lambda_nrm2 0.25
lambda_chroma 0.025
lambda_diffuse 0.15
lambda_specular 0.0025
local_rank 0
multi_gpu True
random_textures True
---------
DatasetLLFF: 92 images with shape [1080, 1920]
DatasetLLFF: auto-centering at [ 0.24934715  0.38134477 -0.13031025]
/opt/conda/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the i$
dexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Cuda path /usr/local/cuda
/opt/conda/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the i$
dexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Cuda path /usr/local/cuda
/opt/conda/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the i$
dexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Cuda path /usr/local/cuda
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5527 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5529 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 5527 via 15, forcefully exitting via 9
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 5529 via 15, forcefully exitting via 9
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 1 (pid: 5528) of binary: /opt/conda/bin/python3.8
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0007698535919189453 seconds
INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 1 FAILED with no error file. Decorate your entrypoint fn with @record for traceback in
fo. See: https://pytorch.org/docs/stable/elastic/errors.html
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
train.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-16_11:59:10
  host      : fea5089b80d7
  rank      : 1 (local_rank: 1)
  exitcode  : -11 (pid: 5528)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 5528
======================================================

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training exiting suddenly #21

Training exiting suddenly #21

iraj465 commented Mar 16, 2023 •

edited

Loading

Training exiting suddenly #21

Training exiting suddenly #21

Comments

iraj465 commented Mar 16, 2023 • edited Loading

iraj465 commented Mar 16, 2023 •

edited

Loading