You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm training on 3 A-100 GPUs with 40GB memory. The training is not starting, what could be the issue? I have included the error report below.
DATA_PATH : SHOG2TA2CKZ7ERNU
COLMAP_PATH : /usr/local/bin/colmap
CONFIG_PATH : /nvdiffrecmc/configs/manual/shoe.json
NUMBER OF GPUS: 3
TRAINING STARTED..
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variab
le for optimal performance in your application as needed.
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : train.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 3
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 0
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_fjb7yrke/none_gz0e2q6p
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3.8
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2]
role_ranks=[0, 1, 2]
global_ranks=[0, 1, 2]
role_world_sizes=[3, 3, 3]
global_world_sizes=[3, 3, 3]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
/opt/conda/lib/python3.8/site-packages/tinycudann/modules.py:52: UserWarning: tinycudann was built for lower compute capability (70) than the system's
(80). Performance may be suboptimal.
warnings.warn(f"tinycudann was built for lower compute capability ({cc}) than the system's ({system_compute_capability}). Performance may be suboptim
al.")
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
/opt/conda/lib/python3.8/site-packages/tinycudann/modules.py:52: UserWarning: tinycudann was built for lower compute capability (70) than the system's
(80). Performance may be suboptimal.
warnings.warn(f"tinycudann was built for lower compute capability ({cc}) than the system's ({system_compute_capability}). Performance may be suboptim
al.")
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/optixutils_plugin/build.ninja...
Building extension module optixutils_plugin...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module optixutils_plugin...
/opt/conda/lib/python3.8/site-packages/tinycudann/modules.py:52: UserWarning: tinycudann was built for lower compute capability (70) than the system's
(80). Performance may be suboptimal.
warnings.warn(f"tinycudann was built for lower compute capability ({cc}) than the system's ({system_compute_capability}). Performance may be suboptim
al.")
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/optixutils_plugin/build.ninja...
Building extension module optixutils_plugin...
Config / Flags:
---------
iter 500
batch 4
spp 1
layers 1
train_res [2048, 2048]
display_res [2048, 2048]
texture_res [2048, 2048]
display_interval 0
save_interval 100
learning_rate [0.03, 0.005]
custom_mip False
background white
loss logl1
out_dir out/SHOG2TA2CKZ7ERNU
config /nvdiffrecmc/configs/manual/shoe.json
ref_mesh SHOG2TA2CKZ7ERNU
base_mesh None
validate True
n_samples 12
bsdf pbr
denoiser bilateral
denoiser_demodulate True
save_custom 3D/vertical/footwear
vertical Footwear
mtl_override None
dmtet_grid 128
mesh_scale 2.5
envlight None
env_scale 1.0
probe_res 256
learn_lighting True
display [{'bsdf': 'kd'}, {'bsdf': 'ks'}, {'bsdf': 'normal'}]
transparency False
lock_light False
lock_pos False
sdf_regularizer 0.2
laplace relative
laplace_scale 3000.0
pre_load True
no_perturbed_nrm False
decorrelated False
kd_min [0.03, 0.03, 0.03]
kd_max [0.8, 0.8, 0.8]
ks_min [0, 0.08, 0]
ks_max [0, 1, 1]
nrm_min [-1.0, -1.0, 0.0]
nrm_max [1.0, 1.0, 1.0]
clip_max_norm 0.0
cam_near_far [0.1, 1000.0]
lambda_kd 0.1
lambda_ks 0.05
lambda_nrm 0.025
lambda_nrm2 0.25
lambda_chroma 0.025
lambda_diffuse 0.15
lambda_specular 0.0025
local_rank 0
multi_gpu True
random_textures True
---------
DatasetLLFF: 92 images with shape [1080, 1920]
DatasetLLFF: auto-centering at [ 0.24934715 0.38134477 -0.13031025]
/opt/conda/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the i$
dexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2894.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Cuda path /usr/local/cuda
/opt/conda/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the i$
dexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2894.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Cuda path /usr/local/cuda
/opt/conda/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the i$
dexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2894.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Cuda path /usr/local/cuda
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5527 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 5529 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 5527 via 15, forcefully exitting via 9
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 5529 via 15, forcefully exitting via 9
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 1 (pid: 5528) of binary: /opt/conda/bin/python3.8
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0007698535919189453 seconds
INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 1 FAILED with no error file. Decorate your entrypoint fn with @record for traceback in
fo. See: https://pytorch.org/docs/stable/elastic/errors.html
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
train.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-16_11:59:10
host : fea5089b80d7
rank : 1 (local_rank: 1)
exitcode : -11 (pid: 5528)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 5528
======================================================
The text was updated successfully, but these errors were encountered:
I'm training on 3 A-100 GPUs with 40GB memory. The training is not starting, what could be the issue? I have included the error report below.
The text was updated successfully, but these errors were encountered: