-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bare-metal multi-GPU training fails launching subprocesses due to unexpected args #13
Comments
In case additional platform information is relevant:
|
seems to be similar to what was fixed here : ecmwf/anemoi-training#82 (comment) |
Can you try running without --config-dir and report back: Just
We did fix multi-GPU training in ecmwf/anemoi-training#82 and that is merged into 0.3, but we may have a regression somewhere. |
Thanks for the reference, this is insightful. However, doing a cd to
|
Thanks, will investigate and get back to you asap. |
What happened?
Launching
HYDRA_FULL_ERROR=1 ANEMOI_BASE_SEED=1 anemoi-training train --config-name happy_little_config --config-dir=/pathToConfigs/config
for training models in a multi-GPU setup (on a single machine, outside of a SLURM environment) fails to launch sub-processes successfully. Specifically, the first sub-process is initiated properly but processes afterwards error (see: Relevant log output).
Training runs successful with a single-GPU setup but fails when using multiple devices. The desired behavior is for multi-GPU training to be feasible outside of a SLURM environment. The issue might require further looking into, but it may be related to this or that.
What are the steps to reproduce the bug?
On bare-metal (or: without using any resource scheduler):
In the configurations, if using
the process launches successfully and trains as expected. However, when changing to
or
the process crashes during creation of child processes while parsing arguments that are not recognized.
Version
anemoi-training 0.3.0 (from pip)
Platform (OS and architecture)
Linux eohpc-phigpu27 5.4.0-125-generic ecmwf/anemoi-training#141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Relevant log output
Accompanying data
No response
Organisation
No response
The text was updated successfully, but these errors were encountered: