Add example of using topology with pytorch

GoogleCloudPlatform · Aug 23, 2024 · 0e9a7ce · 0e9a7ce
1 parent 45a1dda
commit 0e9a7ce
Show file tree

Hide file tree

Showing 4 changed files with 330 additions and 0 deletions.
diff --git a/examples/machine-learning/a3-megagpu-8g/topological_pytorch/README.md b/examples/machine-learning/a3-megagpu-8g/topological_pytorch/README.md
@@ -0,0 +1,174 @@
+# Topologically-aware Pytorch Distributed
+
+This example demonstrates how to incorporate topology information into a
+pytorch distributed workload.
+
+Note: This requires that your nodes were created using a compact placement
+policy.
+
+The main concept is that once you have a topologically sorted list of Slurm
+nodes, pytorch needs to incorporate that information into its
+`dist.init_process_group` function.
+
+Note: If you use torchrun, you may need to alter how this information is
+incorporated. Using  `torchrun ... --node-rank=${SLURM_NODEID}` does not seem to
+properly initialize ranks based on the correct node sorting.  For that reason,
+we suggest using the
+[env://](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization)
+initialization process, which is slightly more manual but enables the fine grain
+control that we want.
+
+## Quickstart
+
+Run the following commands to demonstrate topologically aware pytorch:
+
+    # Creates a local python3 env and installs pytorch
+    jobid=$(sbatch --parsable install.sh)
+
+    # Run an example of setting SLURM_HOSTFILE based on topology
+    sbatch --dependency=afterok:$jobid topological_pytorch.sh
+
+Once submitted, you should be able to view the state of the jobs with `sinfo`:
+
+    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+    124    a3mega topologi username   PD       0:00      8 (Dependency)
+    123    a3mega install. username    R       2:14      1 a3mega-a3meganodeset-0
+
+Wait until job 124 is complete, then review the output in `slurm-124.out`.  It
+will look something like this (illustative values used, your physical host will
+have random characters):
+
+    Standard
+    rank    hostname        physical_host
+    0       a3mega-a3meganodeset-0.c.<project>.internal      /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb/00000000000000000000000000000000
+    8       a3mega-a3meganodeset-1.c.<project>.internal      /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/dddddddddddddddddddddddddddddddd/11111111111111111111111111111111
+    16      a3mega-a3meganodeset-2.c.<project>.internal      /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/22222222222222222222222222222222
+    24      a3mega-a3meganodeset-3.c.<project>.internal      /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/cccccccccccccccccccccccccccccccc/33333333333333333333333333333333
+    32      a3mega-a3meganodeset-4.c.<project>.internal      /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee/44444444444444444444444444444444
+    40      a3mega-a3meganodeset-5.c.<project>.internal      /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/ffffffffffffffffffffffffffffffff/55555555555555555555555555555555
+    48      a3mega-a3meganodeset-6.c.<project>.internal      /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb/66666666666666666666666666666666
+    54      a3mega-a3meganodeset-7.c.<project>.internal      /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/ffffffffffffffffffffffffffffffff/77777777777777777777777777777777
+    Sorted by topology
+    rank    hostname        physical_host
+    0       a3mega-a3meganodeset-2.c.<project>.internal      /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/22222222222222222222222222222222
+    8       a3mega-a3meganodeset-0.c.<project>.internal      /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb/00000000000000000000000000000000
+    16      a3mega-a3meganodeset-6.c.<project>.internal      /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb/66666666666666666666666666666666
+    24      a3mega-a3meganodeset-3.c.<project>.internal      /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/cccccccccccccccccccccccccccccccc/33333333333333333333333333333333
+    32      a3mega-a3meganodeset-1.c.<project>.internal      /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/dddddddddddddddddddddddddddddddd/11111111111111111111111111111111
+    40      a3mega-a3meganodeset-4.c.<project>.internal      /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee/44444444444444444444444444444444
+    48      a3mega-a3meganodeset-5.c.<project>.internal      /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/ffffffffffffffffffffffffffffffff/55555555555555555555555555555555
+    56      a3mega-a3meganodeset-7.c.<project>.internal      /CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC/ffffffffffffffffffffffffffffffff/77777777777777777777777777777777
+
+Which shows that the ranks are ordered by the "rack" component of the `physical_host`.
+See [here](https://cloud.google.com/compute/docs/instances/use-compact-placement-policies#verify-vm-location)
+for more information on compact placement policies.
+
+## Detailed Explanation
+
+### Setup
+
+First we need to install pytorch. While these same concepts transfer to using
+enroot/pyxis to launch containerized workloads, in this example we will just
+use a local python environment:
+
+    # Creates a local python3 env and installs pytorch
+    sbatch install.sh
+
+### Job Submission Script
+Now let's review the `topological_pytorch.sh` batch job submission script.
+
+First we set the requisite GPUDirect-TCPXO environment variables:
+
+    NCCL_LIB_DIR="/var/lib/tcpxo/lib64" source /var/lib/tcpxo/lib64/nccl-env-profile.sh
+    export NCCL_FASTRAK_CTRL_DEV=enp0s12
+    export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0
+    export NCCL_SOCKET_IFNAME=enp0s12
+    export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY=/dev/aperture_devices
+
+and activate our python environment:
+
+    source env/bin/activate
+
+Next we demonstrate the standard behavior that torchrun would use, which does
+not incorporate topology into how it orders ranks among the nodes.
+
+    # Demonstrate standard behavior
+    echo "Standard"
+    # Set the MASTER_ADDR to the first node in the Slurm Job Nodelist
+    export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
+    # For torchrun, we only launch 1 task per node, and instruct torchrun to create
+    # 8 (SLURM_GPUS_PER_NODE) processes per node.
+    srun --ntasks-per-node=1 --nodes $SLURM_NNODES \
+        python -m torch.distributed.run \
+        --nproc_per_node ${SLURM_GPUS_PER_NODE}  \
+        --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
+        --rdzv_backend c10d \
+        --nnodes $SLURM_NNODES topological_pytorch.py
+
+torchrun will launch 8 tasks per node, and assign ranks lexiconographically
+across nodes according to the hostnames.
+
+For topologically-aware behavior, we launch all the tasks using Slurm's `srun`,
+and will use the Slurm environment variables to initialize the torch distributed
+process group, as we'll describe in the next section. To incorporate topology
+into Slurm, we first sort the nodes in the job:
+
+    # Demonstrate how to incorporate topology
+    # First sort hosts by topology
+    srun bash -c 'curl -s "http://metadata.google.internal/computeMetadata/v1/instance/attributes/physical_host" -H "Metadata-Flavor: Google"; echo /$SLURMD_NODENAME' |
+        sort -t / -s -k 1,4 |
+        awk -F "/" '{print $NF}' >/var/tmp/topo_sorted_hostfile
+    # The SLURM_HOSTFILE signals to Slurm how to order nodes
+    export SLURM_HOSTFILE=/var/tmp/topo_sorted_hostfile
+
+Setting the `SLURM_HOSTFILE` controls the order in which Slurm sets the
+`SLURM_PROCID`, which we will later use to order NCCL ranks in Pytorch.
+The last thing we need to do is reset the `MASTER_ADDR` environment variable
+to match the first node in the hostfile, and launch the job, adding the
+`--topology` to the script arguments to trigger the topology logic.
+
+    # Set the MASTER_ADDR to the first node in the hostfile
+    export MASTER_ADDR=`head -n 1 ${SLURM_HOSTFILE}`
+
+    srun python topological_pytorch.py --topology
+
+### Test Script
+Next review the `topological_pytorch.py` script. There is a top level flag of
+`--topology`, which controls whether pytorch is initialized using torchrun (when
+`False`) or using Slurm (when `True`). The Slurm environment variables ensure
+that the node ordering that Slurm uses gets translated to the Pytorch ranks.
+
+    if args.topology:
+        # These are populated by Slurm
+        local_rank = int(os.environ["SLURM_LOCALID"])
+        global_rank = int(os.environ["SLURM_PROCID"])
+        world_size = int(os.environ["SLURM_NPROCS"])
+        procs_per_node = int(os.environ["SLURM_NTASKS_PER_NODE"])
+
+        # Must set rank and world_size based on SLURM_PROCID and SLURM_NPROCS
+        dist.init_process_group("nccl", rank=global_rank, world_size=world_size)
+    else:
+        # These are populated by torchrun
+        local_rank = int(os.environ["LOCAL_RANK"])
+        global_rank = int(os.environ["RANK"])
+        world_size = int(os.environ["WORLD_SIZE"])
+        procs_per_node = int(os.environ["LOCAL_WORLD_SIZE"])
+
+        # Torchrun handles rank allocation
+        dist.init_process_group("nccl")
+
+The remainder of the script is meant to demonstrate functionality. We use
+`dist.all_gather_object` to collect the rank, hostname, and `physical_host` from
+each pytorch worker, and then print the order out from global rank 0.  What you
+should see is that depending on the topology that Slurm uses to launch the jobs,
+the ordering of this output will vary.
+
+### Running the Test
+
+Run the following commands to demonstrate topologically aware pytorch:
+
+    # Run an example of setting SLURM_HOSTFILE based on topology
+    sbatch topological_pytorch.sh
+
+The output shows the before/after of the standard vs topologically sorted. See
+the Quickstart section above for an example.
diff --git a/examples/machine-learning/a3-megagpu-8g/topological_pytorch/install.sh b/examples/machine-learning/a3-megagpu-8g/topological_pytorch/install.sh
@@ -0,0 +1,26 @@
+#!/bin/bash
+# Copyright 2024 "Google LLC"
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#filename: install.sh
+#submit with `sbatch install.sh`
+
+#SBATCH --partition=a3mega
+#SBATCH --gpus-per-node=8
+#SBATCH --ntasks-per-node=1
+#SBATCH --nodes 1
+
+python3 -m venv env
+source env/bin/activate
+pip3 install --pre torch torchvision torchaudio
diff --git a/examples/machine-learning/a3-megagpu-8g/topological_pytorch/topological_pytorch.py b/examples/machine-learning/a3-megagpu-8g/topological_pytorch/topological_pytorch.py
@@ -0,0 +1,68 @@
+#!/usr/bin/env python
+# Copyright 2024 "Google LLC"
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#filename: topological_pytorch.py
+import os
+import torch
+import torch.distributed as dist
+import socket
+import subprocess
+import argparse
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--topology", action=argparse.BooleanOptionalAction)
+args = parser.parse_args()
+
+hostname = socket.getfqdn()
+if args.topology:
+    # These are populated by Slurm
+    local_rank = int(os.environ["SLURM_LOCALID"])
+    global_rank = int(os.environ["SLURM_PROCID"])
+    world_size = int(os.environ["SLURM_NPROCS"])
+    procs_per_node = int(os.environ["SLURM_NTASKS_PER_NODE"])
+
+    # Must set rank and world_size based on SLURM_PROCID and SLURM_NPROCS
+    dist.init_process_group("nccl", rank=global_rank, world_size=world_size)
+else:
+    # These are populated by torchrun
+    local_rank = int(os.environ["LOCAL_RANK"])
+    global_rank = int(os.environ["RANK"])
+    world_size = int(os.environ["WORLD_SIZE"])
+    procs_per_node = int(os.environ["LOCAL_WORLD_SIZE"])
+
+    # Torchrun handles rank allocation
+    dist.init_process_group("nccl")
+
+# Must attach device based on the local rank.
+torch.cuda.set_device(local_rank)
+
+# Get the physical host for the current task to print later
+physical_host = subprocess.check_output([
+    "curl", "-s",
+    "http://metadata.google.internal/computeMetadata/v1/instance/attributes/physical_host",
+    "-H", "Metadata-Flavor: Google"
+]).decode('utf-8')
+
+# Create an output to collect from the all-gather
+output = [None for _ in range(world_size)]
+dist.all_gather_object(output, [global_rank, hostname, physical_host])
+if global_rank == 0:
+    # Print out ordered set of hostnames from all-gather
+    print("rank\thostname\tphysical_host")
+    # Skip to print every procs_per_node to keep output compact
+    for result in output[::procs_per_node]:
+        print("\t".join(map(str,result)))
+
+dist.destroy_process_group()
diff --git a/examples/machine-learning/a3-megagpu-8g/topological_pytorch/topological_pytorch.sh b/examples/machine-learning/a3-megagpu-8g/topological_pytorch/topological_pytorch.sh
@@ -0,0 +1,62 @@
+#!/bin/bash
+# Copyright 2024 "Google LLC"
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# shellcheck disable=SC2016
+
+#filename: topological_pytorch.sh
+#submit with `sbatch topological_pytorch.sh`
+#SBATCH --partition=a3mega
+#SBATCH --gpus-per-node=8
+#SBATCH --ntasks-per-node=8
+#SBATCH --nodes 8
+
+NCCL_LIB_DIR="/var/lib/tcpxo/lib64" source /var/lib/tcpxo/lib64/nccl-env-profile.sh
+export NCCL_FASTRAK_CTRL_DEV=enp0s12
+export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0
+export NCCL_SOCKET_IFNAME=enp0s12
+export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY=/dev/aperture_devices
+
+source env/bin/activate
+
+export MASTER_PORT=12345
+export OMP_NUM_THREADS=12
+
+# Demonstrate standard behavior
+echo "Standard"
+# Set the MASTER_ADDR to the first node in the Slurm Job Nodelist
+export MASTER_ADDR=$(scontrol show hostnames "${SLURM_JOB_NODELIST}" | head -n 1)
+# For torchrun, we only launch 1 task per node, and instruct torchrun to create
+# 8 (SLURM_GPUS_PER_NODE) processes per node.
+srun --ntasks-per-node=1 --nodes "${SLURM_NNODES}" \
+	python -m torch.distributed.run \
+	--nproc_per_node "${SLURM_GPUS_PER_NODE}" \
+	--rdzv_endpoint "${MASTER_ADDR}":"${MASTER_PORT}" \
+	--rdzv_backend c10d \
+	--nnodes "${SLURM_NNODES}" topological_pytorch.py
+
+# Demonstrate how to incorporate topology
+# First sort hosts by topology
+srun bash -c 'curl -s "http://metadata.google.internal/computeMetadata/v1/instance/attributes/physical_host" -H "Metadata-Flavor: Google"; echo /${SLURMD_NODENAME}' |
+	sort -t / -s -k 1,4 |
+	awk -F "/" '{print $NF}' >/var/tmp/topo_sorted_hostfile
+# The SLURM_HOSTFILE signals to Slurm how to order nodes
+export SLURM_HOSTFILE=/var/tmp/topo_sorted_hostfile
+# Set the MASTER_ADDR to the first node in the hostfile
+export MASTER_ADDR=$(head -n 1 "${SLURM_HOSTFILE}")
+
+echo "Topologically aware"
+# Run 8 tasks per node (inherited from the job script)l, since we aren't using
+# torchrun in this case. Supply the --topology flag to the script to s
+srun python topological_pytorch.py --topology