An introductory supercomputing course developed by the SHAO SKA team.
China SKA Regional Centre Prototype (CSRC-P) systems can use Slurm for resource and job management to avoid mutual interference and improve operational efficiency. All jobs that need to be run, whether for program debugging or business calculations, must be submitted through interactive parallel srun, batch submittion sbatch, or distributed salloc commands, and related commands can be used to query the job status after submission. Please do not directly run jobs (except compiling) on the login node, so as not to affect the normal use of other users.
- Querying SLURM Partitions
- Querying the Queue
- Job Request
- Jobscripts
- Serial python example
- Lauching Paralle Programs
- GPU Example
- Interactive Jobs
- Job cancel
A SLURM partition is a queue, which is all the partition that are managed by a single SLURM daemon.
To list the partition when logged into a machine:
sinfo
To get all partitions in all local clusters:
sinfo –M all
For example:
[blao@x86-logon01 ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
arm up infinite 1 down* taishan-arm-cpu10
arm up infinite 1 drain taishan-arm-cpu01
arm up infinite 8 idle taishan-arm-cpu[02-09]
purley-cpu* up infinite 1 alloc purley-x86-cpu01
purley-cpu* up infinite 7 idle purley-x86-cpu[02-08]
sugon-gpu up infinite 1 idle sugon-gpu01
inspur-gpu-opa up infinite 1 mix inspur-gpu02
inspur-gpu-ib up infinite 1 down* inspur-gpu01
knm up infinite 4 down* knm-x86-cpu[01-04]
knl up infinite 1 down* knl-x86-cpu02
all-gpu up infinite 1 down* inspur-gpu01
all-gpu up infinite 1 mix inspur-gpu02
all-gpu up infinite 1 idle sugon-gpu01
Where PARTITION represents the partition name. AVAIL represents the status of the partition, the up is available, and the down is not available. TIMELIMIT represents the maximum time the program runs, and infinite represents unlimited, the limit format is days-houres:minutes:seconds. NODES represents the number of nodes. NODELIST is Node list. STATE represents the running state of the node.
STATE: node state, possible states include:
-allocated, alloc: allocated
-completing, comp: completing
-down: Down
-drained, drain: has lost vitality
-fail: failure
-idle: idle
-mix: mix, the node is running jobs, but some idle CPU cores can accept new jobs
-reserved, resv: resource reservation
-unknown, unk: unknown reason
It is important to use the correct system and partition for each part of a workflow:
Partition | Purpose |
---|---|
arm |
Many core workflows, many serial of shared memory (OpenMP) jobs, large distributed memory (MPI) jobs |
sugon-gpu , inspur-gpu-opa , inspur-gpu-ib , all-gpu |
GPU-accelerated jobs, artificial Intelligence jobs |
purley-cpu ,hw ,all-x86-cpu |
Many serial of shared memory (OpenMP) jobs, large distributed memory (MPI) jobs |
squeue is used to view job and job step information for jobs managed by Slurm.
squeue <options>
squeue -u username
squeue -j jobid
squeue -p purley-cpu
Exmaple
[blao@x86-logon01 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
17810 inspur-gp step.bat xzj R 3-03:15:24 1 inspur-gpu02
17921 inspur-gp test.bat xzj R 1-19:45:57 1 inspur-gpu02
17931 purley-cp test.sh feirui20 R 1:04:06 1 purley-x86-cpu01
Information interprets
JOBID: ID of job.
PARTITION: partition name used of the job.
NAME: job name.
USER: user name.
ST: job state. R=running. PD=pending. CA=cancelled. CG=completing. CD=completed.
TIME: time used by the job
NODES: the actual number of nodes allocated to the job.
NODESLIST: list of nodes allocated to the job or job step.
To view more options of squeue:
squeue --help
Individual Job Information
scontrol show job jobid
[blao@x86-logon01 ~]$ scontrol show job 17810
JobId=17810 JobName=step.batch
UserId=xzj(10015) GroupId=xzj(10019) MCS_label=N/A
Priority=4294901731 Nice=0 Account=xzj QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=3-03:41:48 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2020-09-23T11:16:25 EligibleTime=2020-09-23T11:16:25
AccrueTime=2020-09-23T11:16:25
StartTime=2020-09-23T11:16:26 EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2020-09-23T11:16:26
Partition=inspur-gpu-opa AllocNode:Sid=x86-logon01:339306
ReqNodeList=(null) ExcNodeList=(null)
NodeList=inspur-gpu02
BatchHost=inspur-gpu02
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=3001M,node=1,billing=1,gres/gpu=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=3001M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/xzj/STEP/AI/step.batch
WorkDir=/home/xzj/STEP/AI
StdErr=/home/xzj/STEP/AI/step.out
StdIn=/dev/null
StdOut=/home/xzj/STEP/AI/step.out
Power=
TresPerNode=gpu:1
SLURM needs to know two things from you:
- You cannot submit an application directly to SLURM. Instead, SLURM executes on your behalf a list of shell commands.
- In
batch mode
. SLURM executes ajobscript
which contains the commands. - In
interactive mode
, type in commands just like when you log in. - These commands can include launching programs onto compute nodes assigned for the job.
- Directive lines start with
#SBATCH
- These are quivalent to sabtch command-line arguments.
- Directives are usually more convenient and reproducible than command-line arguments. Put your resource request into the
jobscript
.
Thejobscript
will execute on one of the allocated compute node
#SBATCH directives are comments, so only subsequent commands are executed.
Example jobscript (hostname.sh)
#!/bin/bash
#SBATCH --job-name=myjob #makes it easier to find in squeue
#SBATCH --partition=purley-cpu # partition name
#SBATCH --nodes=2 # number of nodes
#SBATCH --tasks-per-node=1 #processes per node
#SBATCH --cpus-per-task=1 #cores per process
#SBATCH --time=00:05:00 # walltime requested
#SBATCH --export=NONE # start with a clean environment. This improves reproducibility and avoids contamination of the environment.
#The next line is executed on the compute node
srun /usr/bin/hostname
sbatch hostname.sh
sbatch --help
Standard output and standard error from your jobscript are collected by SLURM, and written to a file in the directory you submitted the job when the job finishes/dies.
slurm-jobid.out
cat slurm-jobid.out
This job script will run on a single core of SHAO’s cluster for up to 5 minutes:
#!/bin/bash
#SBATCH --job-name=hello-serial
#SBATCH –partition=purley-cpu
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:05:00
#SBATCH --export=NONE
# load modules
module use /home/app/modulefiles
module load python/cpu-3.7.4
# launch serial python script
python3 hello-serial.py
The script can be submitted to the scheduler with:
sbatch hello-serial.sh
Parallel applications are launched using srun.
The arguments determine the parallelism:
-N number of nodes
-n number of tasks (for process parallelism e.g. MPI)
-c cores per task (for thread parallelism e.g. OpenMP)
While these are already provided in the SBATCH directives, they should be provided again in the srun arguments.
OpenMP example (hello-openmp.sh) This will run 1 process with 24 threads on 1 purley-cpu compute node, using 24 cores for up to 5 minutes:
#!/bin/bash
#SBATCH --job-name=hello-openmp
#SBATCH --partition=purley-cpu
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH --time=00:05:00
#SBATCH --export=NONE
# set OpenMP environment variables
export OMP_NUM_THREADS=24
export OMP_PLACES=cores
export OMP_PROC_BIND=close
# launch OpenMP program
srun --export=all -n 1 -c ${OMP_NUM_THREADS} ./hello-openmp-gcc
The program can be compiled and the script can be submitted to the scheduler with:
cd hello-openmp
make
sbatch hello-openmp.sh
MPI example
This will run 24 MPI processes on 1 node on purley-cpu:
#!/bin/bash
#SBATCH --job-name=hello-mpi
#SBATCH --partition=purley-cpu
#SBATCH --nodes=1
#SBATCH --tasks-per-node=24
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4GB
#SBATCH --time=00:05:00
#SBATCH --export=NONE
# prepare MPI environment
module use /home/app/modulefiles
module load mpich/cpu-3.2.1-gcc-7.3.0
# launch MPI program
srun --export=all --mpi=pmi2 -N 1 -n 24 ./hello-mpi
The script can be submitted to the scheduler with:
cd hello-mpi
module load mpich/cpu-3.2.1-gcc-7.3.0
make
sbatch hello-mpi.sh
This will run one gpu device on a GPU node:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=00:05:00
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1 #number of gpu devices
#SBATCH --mem=8g
#SBATCH --partition=inspur-gpu-opa
#SBATCH --export=NONE
# prepare GPU environment
module use /home/app/modulefiles
module load cuda/9.0
# launch GPU program
srun --export=all -N 1 -n 1 ./hello-gpu
The script can be submitted to the scheduler with:
cd hello-gpu
module load cuda/9.0
nvcc -o hello-gpu hello-gpu.cu
sbatch hello-gpu.sh
All scripts can be downloaded from this git repository:
https://github.com/lao19881213/Introductory-CSRC-P
Sometimes you need them:
- Debugging
- Compiling
- Pre/post-processing
Use salloc instead of sbatch.
You still need srun to place jobs onto compute nodes.
#SBATCH directives must be included as command line arguments.
For example (compiling):
salloc --tasks=16 --partition purley-cpu --time=00:10:00
srun make -j 16
Run hello-serial.py interactively on a purley-cpu compute node.
Start an interactive session (you may need to wait while it is in the queue):
salloc --partition=purley-cpu --tasks=1 --time=00:10:00
Prepare the environment:
module use /home/app/modulefiles
module load python/cpu-3.7.4
Launch the program:
srun --export=all -n 1 python3 hello-serial.py
Exit the interactive session:
exit
You can also run a hello-serial interactively only using srun:
Prepare the environment:
module use /home/app/modulefiles
module load python/cpu-3.7.4
Launch the program:
srun --export=all -N 1 -n 1 -p purley-cpu python3 hello-serial.py
scancel JOBID
To prevent conflicts between software names and versions, applications and libraries are not installed in the standard directory locations. Modules modify the environment to easily locate software, libraries, documentation, or particular versions of the software.
Module system to manage these variables for each application
Command | Description |
---|---|
module avail | Show available modules |
module list | List loaded modules |
module load modulename |
Load a module into the current environment |
module unload modulename |
Unload a module from the environment |
module swap module1 module2 |
Swap a loaded module with another |
module show modulename |
Give help for a particular module |
module help | Show module specific help |
The modulefiles for users are in /home/app/modulefiles directory. Example of load gcc compiler version of 9.30.
module use /home/app/modulefiles
module load gcc/cpu-9.3.0
The module name of software are known as software/cpu-version
, software/arm-version
or software/gpu-version
, which are represent the software will run on 3 corresponding machine: x86 machine, ARM machine and gpu machine.