-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Install NCCL for Derecho #22
Comments
Would add this similarly for Casper but one step at a time |
Hi Daniel - some thoughts and questions:
|
Thanks Brian! The plugin offers significant performance improvement so that NCCL uses Slingshot's hsn interfaces, otherwise it tries to use Infiniband or falls back onto ethernet. So essentially, we either need the plugin added into the NCCL module or added as a required dependency to the current NCCL installs. There's an ongoing email chain addressing some of these questions with NVIDIA. I can forward you this if you'd like to review it. |
Sure. Happy to take a look. |
Would like to install NCCL as a dedicated module which can be linked into PyTorch / Tensorflow / other programs that want to use an optimized internode collective / peer communicaton.
Repos to reference
NVIDIA NCCL: https://github.com/NVIDIA/nccl
NERSC's NCCL Install with plugin: https://github.com/NERSC/nccl-ofi-plugin
AWS NCCL OFI Plugin (adapted for Slingshot): https://github.com/aws/aws-ofi-nccl
Optional but likely ideal - GDRCopy for GPU Direct RDMA (not presently possible for Lustre Storage network only): https://github.com/NVIDIA/gdrcopy
An example build script for "older" but working versions of NCCL is here
/glade/u/home/dhoward/work/nccl-ofi-plugin/build.sh
. I would like to check if later versions can function well.Most notably is that this build process must link with libfabric and CUDA. Tracking how we want to update or maintain this through later versions of these libraries will be important to consider. Installing this via Spack may be possible per the NCCL package plus the AWS OFI Plugin package. The Spack version of NCCL also depends on rdma-core which is explicitly for Infiniband so there might be issues here.
The text was updated successfully, but these errors were encountered: