Bare-metal multi-GPU training fails launching subprocesses due to unexpected args #13

PatrickESA · 2024-11-20T09:48:39Z

What happened?

Launching

HYDRA_FULL_ERROR=1 ANEMOI_BASE_SEED=1 anemoi-training train --config-name happy_little_config --config-dir=/pathToConfigs/config

for training models in a multi-GPU setup (on a single machine, outside of a SLURM environment) fails to launch sub-processes successfully. Specifically, the first sub-process is initiated properly but processes afterwards error (see: Relevant log output).
Training runs successful with a single-GPU setup but fails when using multiple devices. The desired behavior is for multi-GPU training to be feasible outside of a SLURM environment. The issue might require further looking into, but it may be related to this or that.

What are the steps to reproduce the bug?

On bare-metal (or: without using any resource scheduler):

In the configurations, if using

hardware:
	num_gpus_per_node: 1
	num_nodes: 1
	num_gpus_per_model: 1

the process launches successfully and trains as expected. However, when changing to

hardware:
	num_gpus_per_node: 2
	num_nodes: 1
	num_gpus_per_model: 1

or

hardware:
	num_gpus_per_node: 2
	num_nodes: 1
	num_gpus_per_model: 2

the process crashes during creation of child processes while parsing arguments that are not recognized.

Version

anemoi-training 0.3.0 (from pip)

Platform (OS and architecture)

Linux eohpc-phigpu27 5.4.0-125-generic ecmwf/anemoi-training#141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Relevant log output

...
[2024-11-19 16:44:28,847][anemoi.models.layers.attention][WARNING] - Flash attention not available, falling back to pytorch scaled_dot_product_attention
[2024-11-19 16:44:33,689][anemoi.training.train.forecaster][INFO] - Pressure level scaling: use scaler ReluPressureLevelScaler with slope 0.0010 and minimum 0.20 

[2024-11-19 16:44:35,481][lightning_fabric.utilities.distributed][INFO] - Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
usage: .__main__.py-train [--help] [--hydra-help] [--version] [--cfg {job,hydra,all}] [--resolve] [--package PACKAGE] [--run] [--multirun] [--shell-completion] [--config-path CONFIG_PATH] [--config-name CONFIG_NAME] [--config-dir CONFIG_DIR] [--experimental-rerun EXPERIMENTAL_RERUN] [--info [{all,config,defaults,defaults-tree,plugins,searchpath}]] [overrides ...]
.anemoi-training-train: error: unrecognized arguments: hydra.run.dir="outputs/2024-11-19/16-43-03" hydra.job.name=train_ddp_process_1 hyra.output_subdir=null

[2024-11-19 16:44:43,698][lightning_fabric.strategies.launchers.subprocess_script][INFO] - [rank: 1] Child process with PID 763811 terminated with code 2. Forcefully terminating all other processes to avoid zombies 🧟 
Killed

Accompanying data

No response

Organisation

No response

The text was updated successfully, but these errors were encountered:

PatrickESA · 2024-11-20T09:58:58Z

In case additional platform information is relevant:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    On   | 00000000:00:05.0 Off |                    0 |
| 30%   37C    P8    27W / 300W |      0MiB / 45634MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    On   | 00000000:00:06.0 Off |                    0 |
| 30%   36C    P8    29W / 300W |      0MiB / 45634MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

ssmmnn11 · 2024-11-20T12:51:41Z

seems to be similar to what was fixed here : ecmwf/anemoi-training#82 (comment)

gmertes · 2024-11-21T14:12:45Z

Can you try running without --config-dir and report back:

Just cd into the directory where happy_little_config.yaml exists, and then:

HYDRA_FULL_ERROR=1 ANEMOI_BASE_SEED=1 anemoi-training train --config-name=happy_little_config

We did fix multi-GPU training in ecmwf/anemoi-training#82 and that is merged into 0.3, but we may have a regression somewhere.

PatrickESA · 2024-11-21T16:13:42Z

Thanks for the reference, this is insightful. However, doing a cd to .../lib/python3.11/site-packages/anemoi/training/config and running HYDRA_FULL_ERROR=1 ANEMOI_BASE_SEED=1 anemoi-training train --config-name=happy_little_config still gives the same error as in the original post. Just let me know what additional info might be useful, some of the relevant packages in my environment are:

anemoi-training 0.3.0 pypi_0 pypi
anemoi-utils 0.4.8 pypi_0 pypi
hydra-core 1.3.2 pypi_0 pypi
pytorch-lightning 2.4.0 pypi_0 pypi

gmertes · 2024-11-22T15:10:14Z

Thanks, will investigate and get back to you asap.

PatrickESA added the bug Something isn't working label Nov 20, 2024

JesperDramsch added the training label Dec 19, 2024

JesperDramsch transferred this issue from ecmwf/anemoi-training Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bare-metal multi-GPU training fails launching subprocesses due to unexpected args #13

Bare-metal multi-GPU training fails launching subprocesses due to unexpected args #13

PatrickESA commented Nov 20, 2024

PatrickESA commented Nov 20, 2024

ssmmnn11 commented Nov 20, 2024

gmertes commented Nov 21, 2024 •

edited

Loading

PatrickESA commented Nov 21, 2024

gmertes commented Nov 22, 2024

Bare-metal multi-GPU training fails launching subprocesses due to unexpected args #13

Bare-metal multi-GPU training fails launching subprocesses due to unexpected args #13

Comments

PatrickESA commented Nov 20, 2024

What happened?

What are the steps to reproduce the bug?

Version

Platform (OS and architecture)

Relevant log output

Accompanying data

Organisation

PatrickESA commented Nov 20, 2024

ssmmnn11 commented Nov 20, 2024

gmertes commented Nov 21, 2024 • edited Loading

PatrickESA commented Nov 21, 2024

gmertes commented Nov 22, 2024

gmertes commented Nov 21, 2024 •

edited

Loading