The following are complete SLURM scripts that demonstrate how to integrate various launchers with software that uses torch.distributed
(but should be easily adaptable to other distributed environments).
- torchrun - to be used with PyTorch distributed.
- accelerate - to be used with HF Accelerate.
- lightning - to be used with Lightning (“PyTorch Lightning” and “Lightning Fabric”).
- srun - to be used with the native SLURM launcher - here we have to manually preset env vars that
torch.distributed
expects.
All of these scripts use torch-distributed-gpu-test.py as the demo script, which you can copy here with just:
cp ../../../debug/torch-distributed-gpu-test.py .
assuming you cloned this repo. But you can replace it with anything else you need.