Leveraging multiple GPUs in a CUDA program with MPI is supported by CUDA aware MPI installations on Satori. CUDA aware MPI from Slurm batch scripts are supported through system modules based on OpenMPI. To use CUDA aware MPI, source codes and libraries that involve MPI may need to be recompiled with the correct OpenMPI modules.
The following modules are needed to work with CUDA aware MPI codes
module purge all
module add spack
module add cuda/10.1
module load openmpi/3.1.4-pmi-cuda-ucx
Codes and libraries that make MPI calls against CUDA device memory pointers need
to be compiled using the MPI compilation wrappers e.g. mpicc
, mpiCC
, mpicxx
, mpic++
,
mpif77
, mpif90
, mpifort
from the openmpi/3.1.4-pmi-cuda-ucx
OpenMPI
module. The CUDA runtime library needs to be added as a link
library, e.g. -lcudart
.
A typical compilation setup is
module purge all
module add spack
module add cuda/10.1
module load openmpi/3.1.4-pmi-cuda-ucx
mpiCC MYFILE.cc -lcudart
The following example SLURM batch script heading illustrates requesting 8 GPUs on 2 nodes with exclusive access. In this
example the #SBATCH control commands are requesting one MPI rank for each GPU, so cpus-per-task
, ntasks-per-core
and threads-per-core
are set to 1
. The start of the batch scripts selects the modules needed for OpenMPI CUDA aware MPI with SLURM integration.
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-core=1
#SBATCH --threads-per-core=1
#SBATCH --mem=1T
#SBATCH --exclusive
#SBATCH --time 00:05:00
module purge all
module add spack
module add cuda/10.1
module load openmpi/3.1.4-pmi-cuda-ucx
The batch script will be allocated 4 GPUs on each node in the batch session. Individual MPI ranks then need to attach to specific GPUs to run in parallel. There are two ways to do this.
Attach GPU to a rank using a bash script.
In this approach a bash script is written that is used as a launcher for the MPI program to be run. This bash script modifies the environment variable
CUDA_VISIBLE_DEVICES
so that the MPI program will only see the GPU it has been allocated. An example script is shown below:
#!/bin/bash
#
# Choose a CUDA device based on ${SLURM_LOCALID}
#
ngpu=`nvidia-smi -L | grep UUID | wc -l`
mygpu=$((${SLURM_LOCALID} % ${ngpu} ))
export CUDA_VISIBLE_DEVICES=${mygpu}
exec $*
Attach a GPU to a rank using CUDA library runtime code.
In this approach the MPI program source must be modified to include GPU device selection code before
MPI_Init()
is invoked. An example code fragment for GPU device selection (based on the environment variable SLURM_LOCALID) is shown below:
#include <mpi.h>
#include <stdio.h>
#include <cuda_runtime.h>
int main(int argc, char** argv, char *envp[]) {
char * localRankStr = NULL;
int localrank = 0, devCount = 0, mydev;
// We extract the local rank initialization using an environment variable
if ((localRankStr = getenv("SLURM_LOCALID")) != NULL) {
localrank = atoi(localRankStr);
}
cudaGetDeviceCount(&devCount);
mydev=localrank % devCount;
cudaSetDevice(mydev);
:
:
MPI_Init(NULL, NULL);
:
:
To run the MPI program the SLURM command srun
is used (and not mpirun
or mpiexec
). The srun
command
works like the MPI run or exec commands but it creates the environment variables needed to select which rank
works with which GPU prior to any calls to MIT_Init(). An example of using srun with a launch script is shown
below.
srun ./launch.sh ./a.out
the equivalent without a launch script is
srun ./a.out
The script below shows a working full example of the steps for CUDA and MPI using multiple GPUs on multiple nodes under SLURM. The example shows both the bash script launcher and the CUDA runtime call approaches for assigning GPUs to ranks. Only one of these approaches is needed in practice, both are shown to illustrate the two approaches.
#!/bin/bash
#
# Example SLURM batch script to run example CUDA aware MPI program with one rank on
# each GPU, using two nodes with 4 GPUs on each node.
#
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-core=1
#SBATCH --threads-per-core=1
#SBATCH --mem=1T
#SBATCH --exclusive
#SBATCH --time 00:05:00
module purge all
module add spack
module add cuda/10.1
module load openmpi/3.1.4-pmi-cuda-ucx
cat > launch.sh <<'EOFA'
#!/bin/bash
# Choose a CUDA device number ($mygpu) based on ${SLURM_LOCALID}, cycling through
# the available GPU devices ($ngpu) on the node.
ngpu=`nvidia-smi -L | grep UUID | wc -l`
mygpu=$((${SLURM_LOCALID} % ${ngpu} ))
export CUDA_VISIBLE_DEVICES=${mygpu}
# Run MPI program with any arguments
exec $*
EOFA
cat > x.cc <<'EOFA'
#include <mpi.h>
#include <stdio.h>
#include <cuda_runtime.h>
int main(int argc, char** argv, char *envp[]) {
char * localRankStr = NULL;
int localrank = 0, devCount = 0, mydev;
// We extract the local rank initialization using an environment variable
if ((localRankStr = getenv("SLURM_LOCALID")) != NULL) {
localrank = atoi(localRankStr);
}
cudaGetDeviceCount(&devCount);
mydev=localrank % devCount;
cudaSetDevice(mydev);
MPI_Init(NULL, NULL);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
// Let check which CUDA device we have
char pciBusId[13];
cudaDeviceGetPCIBusId ( pciBusId, 13, mydev );
printf("MPI rank %d of %d on host %s is using GPU with PCI id %s.\n",world_rank,world_size,processor_name,pciBusId);
MPI_Finalize();
}
EOFA
mpic++ x.cc -lcudart
srun ./launch.sh ./a.out
It is also possible to build custom MPI modules in individual user accounts using the spack ( https://spack.readthedocs.io/en/latest/ ) package management tool. These builds should use the UCX communcation features and PMI job management features to integrate with SLURM and the Satori high-speed network.