-
Notifications
You must be signed in to change notification settings - Fork 139
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add example of using topology with pytorch
- Loading branch information
1 parent
45a1dda
commit 0e9a7ce
Showing
4 changed files
with
330 additions
and
0 deletions.
There are no files selected for viewing
174 changes: 174 additions & 0 deletions
174
examples/machine-learning/a3-megagpu-8g/topological_pytorch/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,174 @@ | ||
# Topologically-aware Pytorch Distributed | ||
|
||
This example demonstrates how to incorporate topology information into a | ||
pytorch distributed workload. | ||
|
||
Note: This requires that your nodes were created using a compact placement | ||
policy. | ||
|
||
The main concept is that once you have a topologically sorted list of Slurm | ||
nodes, pytorch needs to incorporate that information into its | ||
`dist.init_process_group` function. | ||
|
||
Note: If you use torchrun, you may need to alter how this information is | ||
incorporated. Using `torchrun ... --node-rank=${SLURM_NODEID}` does not seem to | ||
properly initialize ranks based on the correct node sorting. For that reason, | ||
we suggest using the | ||
[env://](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization) | ||
initialization process, which is slightly more manual but enables the fine grain | ||
control that we want. | ||
|
||
## Quickstart | ||
|
||
Run the following commands to demonstrate topologically aware pytorch: | ||
|
||
# Creates a local python3 env and installs pytorch | ||
jobid=$(sbatch --parsable install.sh) | ||
|
||
# Run an example of setting SLURM_HOSTFILE based on topology | ||
sbatch --dependency=afterok:$jobid topological_pytorch.sh | ||
|
||
Once submitted, you should be able to view the state of the jobs with `sinfo`: | ||
|
||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
124 a3mega topologi username PD 0:00 8 (Dependency) | ||
123 a3mega install. username R 2:14 1 a3mega-a3meganodeset-0 | ||
|
||
Wait until job 124 is complete, then review the output in `slurm-124.out`. It | ||
will look something like this (illustative values used, your physical host will | ||
have random characters): | ||
|
||
Standard | ||
rank hostname physical_host | ||
0 a3mega-a3meganodeset-0.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb/00000000000000000000000000000000 | ||
8 a3mega-a3meganodeset-1.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/dddddddddddddddddddddddddddddddd/11111111111111111111111111111111 | ||
16 a3mega-a3meganodeset-2.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/22222222222222222222222222222222 | ||
24 a3mega-a3meganodeset-3.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/cccccccccccccccccccccccccccccccc/33333333333333333333333333333333 | ||
32 a3mega-a3meganodeset-4.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee/44444444444444444444444444444444 | ||
40 a3mega-a3meganodeset-5.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/ffffffffffffffffffffffffffffffff/55555555555555555555555555555555 | ||
48 a3mega-a3meganodeset-6.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb/66666666666666666666666666666666 | ||
54 a3mega-a3meganodeset-7.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/ffffffffffffffffffffffffffffffff/77777777777777777777777777777777 | ||
Sorted by topology | ||
rank hostname physical_host | ||
0 a3mega-a3meganodeset-2.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/22222222222222222222222222222222 | ||
8 a3mega-a3meganodeset-0.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb/00000000000000000000000000000000 | ||
16 a3mega-a3meganodeset-6.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb/66666666666666666666666666666666 | ||
24 a3mega-a3meganodeset-3.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/cccccccccccccccccccccccccccccccc/33333333333333333333333333333333 | ||
32 a3mega-a3meganodeset-1.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/dddddddddddddddddddddddddddddddd/11111111111111111111111111111111 | ||
40 a3mega-a3meganodeset-4.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee/44444444444444444444444444444444 | ||
48 a3mega-a3meganodeset-5.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/ffffffffffffffffffffffffffffffff/55555555555555555555555555555555 | ||
56 a3mega-a3meganodeset-7.c.<project>.internal /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/ffffffffffffffffffffffffffffffff/77777777777777777777777777777777 | ||
|
||
Which shows that the ranks are ordered by the "rack" component of the `physical_host`. | ||
See [here](https://cloud.google.com/compute/docs/instances/use-compact-placement-policies#verify-vm-location) | ||
for more information on compact placement policies. | ||
|
||
## Detailed Explanation | ||
|
||
### Setup | ||
|
||
First we need to install pytorch. While these same concepts transfer to using | ||
enroot/pyxis to launch containerized workloads, in this example we will just | ||
use a local python environment: | ||
|
||
# Creates a local python3 env and installs pytorch | ||
sbatch install.sh | ||
|
||
### Job Submission Script | ||
Now let's review the `topological_pytorch.sh` batch job submission script. | ||
|
||
First we set the requisite GPUDirect-TCPXO environment variables: | ||
|
||
NCCL_LIB_DIR="/var/lib/tcpxo/lib64" source /var/lib/tcpxo/lib64/nccl-env-profile.sh | ||
export NCCL_FASTRAK_CTRL_DEV=enp0s12 | ||
export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0 | ||
export NCCL_SOCKET_IFNAME=enp0s12 | ||
export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY=/dev/aperture_devices | ||
|
||
and activate our python environment: | ||
|
||
source env/bin/activate | ||
|
||
Next we demonstrate the standard behavior that torchrun would use, which does | ||
not incorporate topology into how it orders ranks among the nodes. | ||
|
||
# Demonstrate standard behavior | ||
echo "Standard" | ||
# Set the MASTER_ADDR to the first node in the Slurm Job Nodelist | ||
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) | ||
# For torchrun, we only launch 1 task per node, and instruct torchrun to create | ||
# 8 (SLURM_GPUS_PER_NODE) processes per node. | ||
srun --ntasks-per-node=1 --nodes $SLURM_NNODES \ | ||
python -m torch.distributed.run \ | ||
--nproc_per_node ${SLURM_GPUS_PER_NODE} \ | ||
--rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \ | ||
--rdzv_backend c10d \ | ||
--nnodes $SLURM_NNODES topological_pytorch.py | ||
|
||
torchrun will launch 8 tasks per node, and assign ranks lexiconographically | ||
across nodes according to the hostnames. | ||
|
||
For topologically-aware behavior, we launch all the tasks using Slurm's `srun`, | ||
and will use the Slurm environment variables to initialize the torch distributed | ||
process group, as we'll describe in the next section. To incorporate topology | ||
into Slurm, we first sort the nodes in the job: | ||
|
||
# Demonstrate how to incorporate topology | ||
# First sort hosts by topology | ||
srun bash -c 'curl -s "http://metadata.google.internal/computeMetadata/v1/instance/attributes/physical_host" -H "Metadata-Flavor: Google"; echo /$SLURMD_NODENAME' | | ||
sort -t / -s -k 1,4 | | ||
awk -F "/" '{print $NF}' >/var/tmp/topo_sorted_hostfile | ||
# The SLURM_HOSTFILE signals to Slurm how to order nodes | ||
export SLURM_HOSTFILE=/var/tmp/topo_sorted_hostfile | ||
|
||
Setting the `SLURM_HOSTFILE` controls the order in which Slurm sets the | ||
`SLURM_PROCID`, which we will later use to order NCCL ranks in Pytorch. | ||
The last thing we need to do is reset the `MASTER_ADDR` environment variable | ||
to match the first node in the hostfile, and launch the job, adding the | ||
`--topology` to the script arguments to trigger the topology logic. | ||
|
||
# Set the MASTER_ADDR to the first node in the hostfile | ||
export MASTER_ADDR=`head -n 1 ${SLURM_HOSTFILE}` | ||
|
||
srun python topological_pytorch.py --topology | ||
|
||
### Test Script | ||
Next review the `topological_pytorch.py` script. There is a top level flag of | ||
`--topology`, which controls whether pytorch is initialized using torchrun (when | ||
`False`) or using Slurm (when `True`). The Slurm environment variables ensure | ||
that the node ordering that Slurm uses gets translated to the Pytorch ranks. | ||
|
||
if args.topology: | ||
# These are populated by Slurm | ||
local_rank = int(os.environ["SLURM_LOCALID"]) | ||
global_rank = int(os.environ["SLURM_PROCID"]) | ||
world_size = int(os.environ["SLURM_NPROCS"]) | ||
procs_per_node = int(os.environ["SLURM_NTASKS_PER_NODE"]) | ||
|
||
# Must set rank and world_size based on SLURM_PROCID and SLURM_NPROCS | ||
dist.init_process_group("nccl", rank=global_rank, world_size=world_size) | ||
else: | ||
# These are populated by torchrun | ||
local_rank = int(os.environ["LOCAL_RANK"]) | ||
global_rank = int(os.environ["RANK"]) | ||
world_size = int(os.environ["WORLD_SIZE"]) | ||
procs_per_node = int(os.environ["LOCAL_WORLD_SIZE"]) | ||
|
||
# Torchrun handles rank allocation | ||
dist.init_process_group("nccl") | ||
|
||
The remainder of the script is meant to demonstrate functionality. We use | ||
`dist.all_gather_object` to collect the rank, hostname, and `physical_host` from | ||
each pytorch worker, and then print the order out from global rank 0. What you | ||
should see is that depending on the topology that Slurm uses to launch the jobs, | ||
the ordering of this output will vary. | ||
|
||
### Running the Test | ||
|
||
Run the following commands to demonstrate topologically aware pytorch: | ||
|
||
# Run an example of setting SLURM_HOSTFILE based on topology | ||
sbatch topological_pytorch.sh | ||
|
||
The output shows the before/after of the standard vs topologically sorted. See | ||
the Quickstart section above for an example. |
26 changes: 26 additions & 0 deletions
26
examples/machine-learning/a3-megagpu-8g/topological_pytorch/install.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
#!/bin/bash | ||
# Copyright 2024 "Google LLC" | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
#filename: install.sh | ||
#submit with `sbatch install.sh` | ||
|
||
#SBATCH --partition=a3mega | ||
#SBATCH --gpus-per-node=8 | ||
#SBATCH --ntasks-per-node=1 | ||
#SBATCH --nodes 1 | ||
|
||
python3 -m venv env | ||
source env/bin/activate | ||
pip3 install --pre torch torchvision torchaudio |
68 changes: 68 additions & 0 deletions
68
examples/machine-learning/a3-megagpu-8g/topological_pytorch/topological_pytorch.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
#!/usr/bin/env python | ||
# Copyright 2024 "Google LLC" | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
#filename: topological_pytorch.py | ||
import os | ||
import torch | ||
import torch.distributed as dist | ||
import socket | ||
import subprocess | ||
import argparse | ||
|
||
parser = argparse.ArgumentParser() | ||
parser.add_argument("--topology", action=argparse.BooleanOptionalAction) | ||
args = parser.parse_args() | ||
|
||
hostname = socket.getfqdn() | ||
if args.topology: | ||
# These are populated by Slurm | ||
local_rank = int(os.environ["SLURM_LOCALID"]) | ||
global_rank = int(os.environ["SLURM_PROCID"]) | ||
world_size = int(os.environ["SLURM_NPROCS"]) | ||
procs_per_node = int(os.environ["SLURM_NTASKS_PER_NODE"]) | ||
|
||
# Must set rank and world_size based on SLURM_PROCID and SLURM_NPROCS | ||
dist.init_process_group("nccl", rank=global_rank, world_size=world_size) | ||
else: | ||
# These are populated by torchrun | ||
local_rank = int(os.environ["LOCAL_RANK"]) | ||
global_rank = int(os.environ["RANK"]) | ||
world_size = int(os.environ["WORLD_SIZE"]) | ||
procs_per_node = int(os.environ["LOCAL_WORLD_SIZE"]) | ||
|
||
# Torchrun handles rank allocation | ||
dist.init_process_group("nccl") | ||
|
||
# Must attach device based on the local rank. | ||
torch.cuda.set_device(local_rank) | ||
|
||
# Get the physical host for the current task to print later | ||
physical_host = subprocess.check_output([ | ||
"curl", "-s", | ||
"http://metadata.google.internal/computeMetadata/v1/instance/attributes/physical_host", | ||
"-H", "Metadata-Flavor: Google" | ||
]).decode('utf-8') | ||
|
||
# Create an output to collect from the all-gather | ||
output = [None for _ in range(world_size)] | ||
dist.all_gather_object(output, [global_rank, hostname, physical_host]) | ||
if global_rank == 0: | ||
# Print out ordered set of hostnames from all-gather | ||
print("rank\thostname\tphysical_host") | ||
# Skip to print every procs_per_node to keep output compact | ||
for result in output[::procs_per_node]: | ||
print("\t".join(map(str,result))) | ||
|
||
dist.destroy_process_group() |
62 changes: 62 additions & 0 deletions
62
examples/machine-learning/a3-megagpu-8g/topological_pytorch/topological_pytorch.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
#!/bin/bash | ||
# Copyright 2024 "Google LLC" | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# shellcheck disable=SC2016 | ||
|
||
#filename: topological_pytorch.sh | ||
#submit with `sbatch topological_pytorch.sh` | ||
#SBATCH --partition=a3mega | ||
#SBATCH --gpus-per-node=8 | ||
#SBATCH --ntasks-per-node=8 | ||
#SBATCH --nodes 8 | ||
|
||
NCCL_LIB_DIR="/var/lib/tcpxo/lib64" source /var/lib/tcpxo/lib64/nccl-env-profile.sh | ||
export NCCL_FASTRAK_CTRL_DEV=enp0s12 | ||
export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0 | ||
export NCCL_SOCKET_IFNAME=enp0s12 | ||
export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY=/dev/aperture_devices | ||
|
||
source env/bin/activate | ||
|
||
export MASTER_PORT=12345 | ||
export OMP_NUM_THREADS=12 | ||
|
||
# Demonstrate standard behavior | ||
echo "Standard" | ||
# Set the MASTER_ADDR to the first node in the Slurm Job Nodelist | ||
export MASTER_ADDR=$(scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -n 1) | ||
# For torchrun, we only launch 1 task per node, and instruct torchrun to create | ||
# 8 (SLURM_GPUS_PER_NODE) processes per node. | ||
srun --ntasks-per-node=1 --nodes "${SLURM_NNODES}" \ | ||
python -m torch.distributed.run \ | ||
--nproc_per_node "${SLURM_GPUS_PER_NODE}" \ | ||
--rdzv_endpoint "${MASTER_ADDR}":"${MASTER_PORT}" \ | ||
--rdzv_backend c10d \ | ||
--nnodes "${SLURM_NNODES}" topological_pytorch.py | ||
|
||
# Demonstrate how to incorporate topology | ||
# First sort hosts by topology | ||
srun bash -c 'curl -s "http://metadata.google.internal/computeMetadata/v1/instance/attributes/physical_host" -H "Metadata-Flavor: Google"; echo /${SLURMD_NODENAME}' | | ||
sort -t / -s -k 1,4 | | ||
awk -F "/" '{print $NF}' >/var/tmp/topo_sorted_hostfile | ||
# The SLURM_HOSTFILE signals to Slurm how to order nodes | ||
export SLURM_HOSTFILE=/var/tmp/topo_sorted_hostfile | ||
# Set the MASTER_ADDR to the first node in the hostfile | ||
export MASTER_ADDR=$(head -n 1 "${SLURM_HOSTFILE}") | ||
|
||
echo "Topologically aware" | ||
# Run 8 tasks per node (inherited from the job script)l, since we aren't using | ||
# torchrun in this case. Supply the --topology flag to the script to s | ||
srun python topological_pytorch.py --topology |