Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Multi-node GPU summissions for greatlakes and picotte. #702

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion flow/templates/drexel-picotte.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
{% set nn = nn|default((nn_cpu, nn_gpu)|max, true) %}
{% if partition == 'gpu' %}
#SBATCH --nodes={{ nn|default(1, true) }}
#SBATCH --ntasks-per-node={{ (gpu_tasks, cpu_tasks)|max }}
#SBATCH --ntasks-per-node={{ ((gpu_tasks, cpu_tasks)|max / nn)|int }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only be correct if the number of nodes is evenly divisible by the number of tasks, right? Would this be an issue? If yes, should we protect against that?

Copy link
Member

@joaander joaander Dec 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the tasks cannot be evenly distributed across an integer number of nodes (e.g. nranks=13), then there is no way to request this with --ntasks-per-node - if you rounded up you would be providing the user with more ranks than they requested. Use --ntasks={nranks} instead in these cases.

Preserve the --nodes=, --ntasks-per-node=, request when tasks can be evenly distributed across nodes. It will provide a more efficient communication pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so I think we should raise an error in case that the requested configuration cannot be provisioned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently round in other templates for GPU partitions e.g. expanse. We could change that. Generally that won't matter since we take the ceiling and charges for GPU nodes are usually just for GPUs I if I understand correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they would likely be charged correctly. However, if the users launches their app with srun or mpiexec without arguments, then the autodetected number of tasks detected from the job configuration. If this is rounded up from what the user requested, then the user's script may fail (e.g. when it is coded to work with a specific number of ranks).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, in the case of GPU jobs does it make sense to have CPU tasks not be a multiple of GPUs? I feel that is something we should error at then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible for systems to support a different number of CPU tasks and GPUs. For example, NCSA Delta does:

$ srun --account=bbgw-delta-gpu --partition=gpuA40x4 --tasks=7 --mem=48g --gpus=5 --pty zsh
$ echo $SLURM_NTASKS                                                                                                       
7
$ echo $SLURM_TASKS_PER_NODE
4,3

In this test, it assigned all 4 GPUs on the first node and 1 GPU on the 2nd.

Just because it is possible doesn't need that signac-flow needs to support it. I can't think of any reasonable workflows that would need this. Also, you would need to check each system separately whether it has configured SLURM to allow for this uneven task distribution.

#SBATCH --gres=gpu:{{ gpu_tasks }}
{% else %}{# def partition #}
#SBATCH --nodes={{ nn }}
Expand Down
2 changes: 1 addition & 1 deletion flow/templates/umich-greatlakes.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
{% set nn = nn|default((nn_cpu, nn_gpu)|max, true) %}
{% if partition == 'gpu' %}
#SBATCH --nodes={{ nn|default(1, true) }}
#SBATCH --ntasks-per-node={{ (gpu_tasks, cpu_tasks)|max }}
#SBATCH --ntasks-per-node={{ ((gpu_tasks, cpu_tasks)|max / nn)|int }}
#SBATCH --gpus={{ gpu_tasks }}
{% else %}{# standard compute partition #}
#SBATCH --nodes={{ nn }}
Expand Down