Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU Support for rlaunch multi #495

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

sivonxay
Copy link
Contributor

There is currently no way to distribute GPUs among fireworks when running small jobs in parallel on one system.

An example: On NERSC, you get exclusive access to 1 Perlmutter nodes with 4 A100 GPUs. If you were to run 4 fireworks that require 1 GPU each, using rlaunch multi 4, each firework would be responsible for determining which GPUs to run on. Most python code will default to checking the CUDA_VISIBLE_DEVICES and either taking the first or all gpus resulting in an oversubscription leading to poor performance or an error.

I don't believe this implementation would work for systems with non-NVIDIA/CUDA GPUs. I believe AMD devices require setting the HIP_VISIBLE_DEVICES variable, but I don't have access to any system with multiple AMD GPUs to test that.

This might not be the best way to implement this, but it does raise a question about whether or not there is a need for a more general way to distribute non-CPU devices (GPU and TPU) among sub-jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant