Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bare-metal multi-GPU training fails launching subprocesses due to unexpected args #13

Open
PatrickESA opened this issue Nov 20, 2024 · 5 comments
Labels
bug Something isn't working training

Comments

@PatrickESA
Copy link

What happened?

Launching

HYDRA_FULL_ERROR=1 ANEMOI_BASE_SEED=1 anemoi-training train --config-name happy_little_config --config-dir=/pathToConfigs/config

for training models in a multi-GPU setup (on a single machine, outside of a SLURM environment) fails to launch sub-processes successfully. Specifically, the first sub-process is initiated properly but processes afterwards error (see: Relevant log output).
Training runs successful with a single-GPU setup but fails when using multiple devices. The desired behavior is for multi-GPU training to be feasible outside of a SLURM environment. The issue might require further looking into, but it may be related to this or that.

What are the steps to reproduce the bug?

On bare-metal (or: without using any resource scheduler):

In the configurations, if using

hardware:
	num_gpus_per_node: 1
	num_nodes: 1
	num_gpus_per_model: 1

the process launches successfully and trains as expected. However, when changing to

hardware:
	num_gpus_per_node: 2
	num_nodes: 1
	num_gpus_per_model: 1

or

hardware:
	num_gpus_per_node: 2
	num_nodes: 1
	num_gpus_per_model: 2

the process crashes during creation of child processes while parsing arguments that are not recognized.

Version

anemoi-training 0.3.0 (from pip)

Platform (OS and architecture)

Linux eohpc-phigpu27 5.4.0-125-generic ecmwf/anemoi-training#141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Relevant log output

...
[2024-11-19 16:44:28,847][anemoi.models.layers.attention][WARNING] - Flash attention not available, falling back to pytorch scaled_dot_product_attention
[2024-11-19 16:44:33,689][anemoi.training.train.forecaster][INFO] - Pressure level scaling: use scaler ReluPressureLevelScaler with slope 0.0010 and minimum 0.20 

[2024-11-19 16:44:35,481][lightning_fabric.utilities.distributed][INFO] - Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
usage: .__main__.py-train [--help] [--hydra-help] [--version] [--cfg {job,hydra,all}] [--resolve] [--package PACKAGE] [--run] [--multirun] [--shell-completion] [--config-path CONFIG_PATH] [--config-name CONFIG_NAME] [--config-dir CONFIG_DIR] [--experimental-rerun EXPERIMENTAL_RERUN] [--info [{all,config,defaults,defaults-tree,plugins,searchpath}]] [overrides ...]
.anemoi-training-train: error: unrecognized arguments: hydra.run.dir="outputs/2024-11-19/16-43-03" hydra.job.name=train_ddp_process_1 hyra.output_subdir=null

[2024-11-19 16:44:43,698][lightning_fabric.strategies.launchers.subprocess_script][INFO] - [rank: 1] Child process with PID 763811 terminated with code 2. Forcefully terminating all other processes to avoid zombies 🧟 
Killed

Accompanying data

No response

Organisation

No response

@PatrickESA PatrickESA added the bug Something isn't working label Nov 20, 2024
@PatrickESA
Copy link
Author

In case additional platform information is relevant:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    On   | 00000000:00:05.0 Off |                    0 |
| 30%   37C    P8    27W / 300W |      0MiB / 45634MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    On   | 00000000:00:06.0 Off |                    0 |
| 30%   36C    P8    29W / 300W |      0MiB / 45634MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@ssmmnn11
Copy link
Member

seems to be similar to what was fixed here : ecmwf/anemoi-training#82 (comment)

@gmertes
Copy link
Member

gmertes commented Nov 21, 2024

Can you try running without --config-dir and report back:

Just cd into the directory where happy_little_config.yaml exists, and then:

HYDRA_FULL_ERROR=1 ANEMOI_BASE_SEED=1 anemoi-training train --config-name=happy_little_config

We did fix multi-GPU training in ecmwf/anemoi-training#82 and that is merged into 0.3, but we may have a regression somewhere.

@PatrickESA
Copy link
Author

Thanks for the reference, this is insightful. However, doing a cd to .../lib/python3.11/site-packages/anemoi/training/config and running HYDRA_FULL_ERROR=1 ANEMOI_BASE_SEED=1 anemoi-training train --config-name=happy_little_config still gives the same error as in the original post. Just let me know what additional info might be useful, some of the relevant packages in my environment are:

anemoi-training 0.3.0 pypi_0 pypi
anemoi-utils 0.4.8 pypi_0 pypi
hydra-core 1.3.2 pypi_0 pypi
pytorch-lightning 2.4.0 pypi_0 pypi

@gmertes
Copy link
Member

gmertes commented Nov 22, 2024

Thanks, will investigate and get back to you asap.

@JesperDramsch JesperDramsch transferred this issue from ecmwf/anemoi-training Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

4 participants