Skip to content

Commit

Permalink
Add example of using topology with pytorch
Browse files Browse the repository at this point in the history
  • Loading branch information
samskillman committed Aug 23, 2024
1 parent 45a1dda commit 0e9a7ce
Show file tree
Hide file tree
Showing 4 changed files with 330 additions and 0 deletions.
174 changes: 174 additions & 0 deletions examples/machine-learning/a3-megagpu-8g/topological_pytorch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# Topologically-aware Pytorch Distributed

This example demonstrates how to incorporate topology information into a
pytorch distributed workload.

Note: This requires that your nodes were created using a compact placement
policy.

The main concept is that once you have a topologically sorted list of Slurm
nodes, pytorch needs to incorporate that information into its
`dist.init_process_group` function.

Note: If you use torchrun, you may need to alter how this information is
incorporated. Using `torchrun ... --node-rank=${SLURM_NODEID}` does not seem to
properly initialize ranks based on the correct node sorting. For that reason,
we suggest using the
[env://](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization)
initialization process, which is slightly more manual but enables the fine grain
control that we want.

## Quickstart

Run the following commands to demonstrate topologically aware pytorch:

# Creates a local python3 env and installs pytorch
jobid=$(sbatch --parsable install.sh)

# Run an example of setting SLURM_HOSTFILE based on topology
sbatch --dependency=afterok:$jobid topological_pytorch.sh

Once submitted, you should be able to view the state of the jobs with `sinfo`:

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
124 a3mega topologi username PD 0:00 8 (Dependency)
123 a3mega install. username R 2:14 1 a3mega-a3meganodeset-0

Wait until job 124 is complete, then review the output in `slurm-124.out`. It
will look something like this (illustative values used, your physical host will
have random characters):

Standard
rank hostname physical_host
0 a3mega-a3meganodeset-0.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb/00000000000000000000000000000000
8 a3mega-a3meganodeset-1.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/dddddddddddddddddddddddddddddddd/11111111111111111111111111111111
16 a3mega-a3meganodeset-2.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/22222222222222222222222222222222
24 a3mega-a3meganodeset-3.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/cccccccccccccccccccccccccccccccc/33333333333333333333333333333333
32 a3mega-a3meganodeset-4.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee/44444444444444444444444444444444
40 a3mega-a3meganodeset-5.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/ffffffffffffffffffffffffffffffff/55555555555555555555555555555555
48 a3mega-a3meganodeset-6.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb/66666666666666666666666666666666
54 a3mega-a3meganodeset-7.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/ffffffffffffffffffffffffffffffff/77777777777777777777777777777777
Sorted by topology
rank hostname physical_host
0 a3mega-a3meganodeset-2.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/22222222222222222222222222222222
8 a3mega-a3meganodeset-0.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb/00000000000000000000000000000000
16 a3mega-a3meganodeset-6.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb/66666666666666666666666666666666
24 a3mega-a3meganodeset-3.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/cccccccccccccccccccccccccccccccc/33333333333333333333333333333333
32 a3mega-a3meganodeset-1.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/dddddddddddddddddddddddddddddddd/11111111111111111111111111111111
40 a3mega-a3meganodeset-4.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee/44444444444444444444444444444444
48 a3mega-a3meganodeset-5.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/ffffffffffffffffffffffffffffffff/55555555555555555555555555555555
56 a3mega-a3meganodeset-7.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/ffffffffffffffffffffffffffffffff/77777777777777777777777777777777

Which shows that the ranks are ordered by the "rack" component of the `physical_host`.
See [here](https://cloud.google.com/compute/docs/instances/use-compact-placement-policies#verify-vm-location)
for more information on compact placement policies.

## Detailed Explanation

### Setup

First we need to install pytorch. While these same concepts transfer to using
enroot/pyxis to launch containerized workloads, in this example we will just
use a local python environment:

# Creates a local python3 env and installs pytorch
sbatch install.sh

### Job Submission Script
Now let's review the `topological_pytorch.sh` batch job submission script.

First we set the requisite GPUDirect-TCPXO environment variables:

NCCL_LIB_DIR="/var/lib/tcpxo/lib64" source /var/lib/tcpxo/lib64/nccl-env-profile.sh
export NCCL_FASTRAK_CTRL_DEV=enp0s12
export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0
export NCCL_SOCKET_IFNAME=enp0s12
export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY=/dev/aperture_devices

and activate our python environment:

source env/bin/activate

Next we demonstrate the standard behavior that torchrun would use, which does
not incorporate topology into how it orders ranks among the nodes.

# Demonstrate standard behavior
echo "Standard"
# Set the MASTER_ADDR to the first node in the Slurm Job Nodelist
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
# For torchrun, we only launch 1 task per node, and instruct torchrun to create
# 8 (SLURM_GPUS_PER_NODE) processes per node.
srun --ntasks-per-node=1 --nodes $SLURM_NNODES \
python -m torch.distributed.run \
--nproc_per_node ${SLURM_GPUS_PER_NODE} \
--rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
--rdzv_backend c10d \
--nnodes $SLURM_NNODES topological_pytorch.py

torchrun will launch 8 tasks per node, and assign ranks lexiconographically
across nodes according to the hostnames.

For topologically-aware behavior, we launch all the tasks using Slurm's `srun`,
and will use the Slurm environment variables to initialize the torch distributed
process group, as we'll describe in the next section. To incorporate topology
into Slurm, we first sort the nodes in the job:

# Demonstrate how to incorporate topology
# First sort hosts by topology
srun bash -c 'curl -s "http://metadata.google.internal/computeMetadata/v1/instance/attributes/physical_host" -H "Metadata-Flavor: Google"; echo /$SLURMD_NODENAME' |
sort -t / -s -k 1,4 |
awk -F "/" '{print $NF}' >/var/tmp/topo_sorted_hostfile
# The SLURM_HOSTFILE signals to Slurm how to order nodes
export SLURM_HOSTFILE=/var/tmp/topo_sorted_hostfile

Setting the `SLURM_HOSTFILE` controls the order in which Slurm sets the
`SLURM_PROCID`, which we will later use to order NCCL ranks in Pytorch.
The last thing we need to do is reset the `MASTER_ADDR` environment variable
to match the first node in the hostfile, and launch the job, adding the
`--topology` to the script arguments to trigger the topology logic.

# Set the MASTER_ADDR to the first node in the hostfile
export MASTER_ADDR=`head -n 1 ${SLURM_HOSTFILE}`

srun python topological_pytorch.py --topology

### Test Script
Next review the `topological_pytorch.py` script. There is a top level flag of
`--topology`, which controls whether pytorch is initialized using torchrun (when
`False`) or using Slurm (when `True`). The Slurm environment variables ensure
that the node ordering that Slurm uses gets translated to the Pytorch ranks.

if args.topology:
# These are populated by Slurm
local_rank = int(os.environ["SLURM_LOCALID"])
global_rank = int(os.environ["SLURM_PROCID"])
world_size = int(os.environ["SLURM_NPROCS"])
procs_per_node = int(os.environ["SLURM_NTASKS_PER_NODE"])

# Must set rank and world_size based on SLURM_PROCID and SLURM_NPROCS
dist.init_process_group("nccl", rank=global_rank, world_size=world_size)
else:
# These are populated by torchrun
local_rank = int(os.environ["LOCAL_RANK"])
global_rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
procs_per_node = int(os.environ["LOCAL_WORLD_SIZE"])

# Torchrun handles rank allocation
dist.init_process_group("nccl")

The remainder of the script is meant to demonstrate functionality. We use
`dist.all_gather_object` to collect the rank, hostname, and `physical_host` from
each pytorch worker, and then print the order out from global rank 0. What you
should see is that depending on the topology that Slurm uses to launch the jobs,
the ordering of this output will vary.

### Running the Test

Run the following commands to demonstrate topologically aware pytorch:

# Run an example of setting SLURM_HOSTFILE based on topology
sbatch topological_pytorch.sh

The output shows the before/after of the standard vs topologically sorted. See
the Quickstart section above for an example.
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#!/bin/bash
# Copyright 2024 "Google LLC"
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#filename: install.sh
#submit with `sbatch install.sh`

#SBATCH --partition=a3mega
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=1
#SBATCH --nodes 1

python3 -m venv env
source env/bin/activate
pip3 install --pre torch torchvision torchaudio
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
#!/usr/bin/env python
# Copyright 2024 "Google LLC"
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#filename: topological_pytorch.py
import os
import torch
import torch.distributed as dist
import socket
import subprocess
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--topology", action=argparse.BooleanOptionalAction)
args = parser.parse_args()

hostname = socket.getfqdn()
if args.topology:
# These are populated by Slurm
local_rank = int(os.environ["SLURM_LOCALID"])
global_rank = int(os.environ["SLURM_PROCID"])
world_size = int(os.environ["SLURM_NPROCS"])
procs_per_node = int(os.environ["SLURM_NTASKS_PER_NODE"])

# Must set rank and world_size based on SLURM_PROCID and SLURM_NPROCS
dist.init_process_group("nccl", rank=global_rank, world_size=world_size)
else:
# These are populated by torchrun
local_rank = int(os.environ["LOCAL_RANK"])
global_rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
procs_per_node = int(os.environ["LOCAL_WORLD_SIZE"])

# Torchrun handles rank allocation
dist.init_process_group("nccl")

# Must attach device based on the local rank.
torch.cuda.set_device(local_rank)

# Get the physical host for the current task to print later
physical_host = subprocess.check_output([
"curl", "-s",
"http://metadata.google.internal/computeMetadata/v1/instance/attributes/physical_host",
"-H", "Metadata-Flavor: Google"
]).decode('utf-8')

# Create an output to collect from the all-gather
output = [None for _ in range(world_size)]
dist.all_gather_object(output, [global_rank, hostname, physical_host])
if global_rank == 0:
# Print out ordered set of hostnames from all-gather
print("rank\thostname\tphysical_host")
# Skip to print every procs_per_node to keep output compact
for result in output[::procs_per_node]:
print("\t".join(map(str,result)))

dist.destroy_process_group()
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#!/bin/bash
# Copyright 2024 "Google LLC"
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# shellcheck disable=SC2016

#filename: topological_pytorch.sh
#submit with `sbatch topological_pytorch.sh`
#SBATCH --partition=a3mega
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=8
#SBATCH --nodes 8

NCCL_LIB_DIR="/var/lib/tcpxo/lib64" source /var/lib/tcpxo/lib64/nccl-env-profile.sh
export NCCL_FASTRAK_CTRL_DEV=enp0s12
export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0
export NCCL_SOCKET_IFNAME=enp0s12
export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY=/dev/aperture_devices

source env/bin/activate

export MASTER_PORT=12345
export OMP_NUM_THREADS=12

# Demonstrate standard behavior
echo "Standard"
# Set the MASTER_ADDR to the first node in the Slurm Job Nodelist
export MASTER_ADDR=$(scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -n 1)
# For torchrun, we only launch 1 task per node, and instruct torchrun to create
# 8 (SLURM_GPUS_PER_NODE) processes per node.
srun --ntasks-per-node=1 --nodes "${SLURM_NNODES}" \
python -m torch.distributed.run \
--nproc_per_node "${SLURM_GPUS_PER_NODE}" \
--rdzv_endpoint "${MASTER_ADDR}":"${MASTER_PORT}" \
--rdzv_backend c10d \
--nnodes "${SLURM_NNODES}" topological_pytorch.py

# Demonstrate how to incorporate topology
# First sort hosts by topology
srun bash -c 'curl -s "http://metadata.google.internal/computeMetadata/v1/instance/attributes/physical_host" -H "Metadata-Flavor: Google"; echo /${SLURMD_NODENAME}' |
sort -t / -s -k 1,4 |
awk -F "/" '{print $NF}' >/var/tmp/topo_sorted_hostfile
# The SLURM_HOSTFILE signals to Slurm how to order nodes
export SLURM_HOSTFILE=/var/tmp/topo_sorted_hostfile
# Set the MASTER_ADDR to the first node in the hostfile
export MASTER_ADDR=$(head -n 1 "${SLURM_HOSTFILE}")

echo "Topologically aware"
# Run 8 tasks per node (inherited from the job script)l, since we aren't using
# torchrun in this case. Supply the --topology flag to the script to s
srun python topological_pytorch.py --topology

0 comments on commit 0e9a7ce

Please sign in to comment.