-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Multi-node GPU summissions for greatlakes and picotte. #702
Conversation
Fixes logic where the --ntasks-per-node would not normalize based on number of nodes for GPU submissions where the number of tasks is often the number of GPUs.
Codecov Report
@@ Coverage Diff @@
## master #702 +/- ##
=======================================
Coverage 68.57% 68.57%
=======================================
Files 41 41
Lines 4162 4162
Branches 1025 1025
=======================================
Hits 2854 2854
Misses 1097 1097
Partials 211 211 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just one quick question.
@@ -12,7 +12,7 @@ | |||
{% set nn = nn|default((nn_cpu, nn_gpu)|max, true) %} | |||
{% if partition == 'gpu' %} | |||
#SBATCH --nodes={{ nn|default(1, true) }} | |||
#SBATCH --ntasks-per-node={{ (gpu_tasks, cpu_tasks)|max }} | |||
#SBATCH --ntasks-per-node={{ ((gpu_tasks, cpu_tasks)|max / nn)|int }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only be correct if the number of nodes is evenly divisible by the number of tasks, right? Would this be an issue? If yes, should we protect against that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the tasks cannot be evenly distributed across an integer number of nodes (e.g. nranks=13
), then there is no way to request this with --ntasks-per-node
- if you rounded up you would be providing the user with more ranks than they requested. Use --ntasks={nranks}
instead in these cases.
Preserve the --nodes=
, --ntasks-per-node=
, request when tasks can be evenly distributed across nodes. It will provide a more efficient communication pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so I think we should raise an error in case that the requested configuration cannot be provisioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently round in other templates for GPU partitions e.g. expanse
. We could change that. Generally that won't matter since we take the ceiling and charges for GPU nodes are usually just for GPUs I if I understand correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they would likely be charged correctly. However, if the users launches their app with srun
or mpiexec
without arguments, then the autodetected number of tasks detected from the job configuration. If this is rounded up from what the user requested, then the user's script may fail (e.g. when it is coded to work with a specific number of ranks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, in the case of GPU jobs does it make sense to have CPU tasks not be a multiple of GPUs? I feel that is something we should error at then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible for systems to support a different number of CPU tasks and GPUs. For example, NCSA Delta does:
$ srun --account=bbgw-delta-gpu --partition=gpuA40x4 --tasks=7 --mem=48g --gpus=5 --pty zsh
$ echo $SLURM_NTASKS
7
$ echo $SLURM_TASKS_PER_NODE
4,3
In this test, it assigned all 4 GPUs on the first node and 1 GPU on the 2nd.
Just because it is possible doesn't need that signac-flow needs to support it. I can't think of any reasonable workflows that would need this. Also, you would need to check each system separately whether it has configured SLURM to allow for this uneven task distribution.
I think there are more problems with this template. I ask for 4 ranks and 4 GPUs, but get 8 ranks spread across 4 GPUs: import flow
class Project(flow.FlowProject):
pass
@Project.operation(directives=dict(nranks=4, ngpu=4))
def job(job):
pass
if __name__ == "__main__":
Project.get_project().main() python project.py submit --pretend --partition gpu
Using environment configuration: GreatLakesEnvironment
Querying scheduler...
Submitting cluster job 'Project/99914b932bd37a50b983c5e7c90ae93b/job/86cd9023f6b91d9e9107a4de7a492fbe':
- Group: job(99914b932bd37a50b983c5e7c90ae93b)
# Submit command: sbatch
#!/bin/bash
#SBATCH --job-name="Project/99914b932bd37a50b983c5e7c90ae93b/job/86cd9023f6b91d9e9107a4de7a492fbe"
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gpus=4 I get the correct output (nodes=1, ntasks-per-node=N, gpus=N) for N=1, 2, I get incorrect output (nodes=2) for N=3,4. I also tested N=8 and got:
!
|
Thanks @joaander I will get working on this after the holidays. |
Given the shear complexities involved and problems in most templates given specific (corner case) resource requests. I am planning on moving much of this logic to Python in the environment classes. Much of this can be default, and just go into the base class so it doesn't offer much of a jump of complexity to users creating new environments. |
You used this approach on #708, right? That is very clear and appears easy to maintain. |
@b-butler Can we push this through or close it? |
I believe #722 fixes this. |
Description
Fixes logic where the --ntasks-per-node would not normalize based on
number of nodes for GPU submissions where the number of tasks is often
the number of GPUs.
Motivation and Context
Resolved: #566
Checklist: