Install NCCL for Derecho #22

dphow · 2024-07-24T14:15:48Z

Would like to install NCCL as a dedicated module which can be linked into PyTorch / Tensorflow / other programs that want to use an optimized internode collective / peer communicaton.

Repos to reference
NVIDIA NCCL: https://github.com/NVIDIA/nccl
NERSC's NCCL Install with plugin: https://github.com/NERSC/nccl-ofi-plugin
AWS NCCL OFI Plugin (adapted for Slingshot): https://github.com/aws/aws-ofi-nccl

Optional but likely ideal - GDRCopy for GPU Direct RDMA (not presently possible for Lustre Storage network only): https://github.com/NVIDIA/gdrcopy

An example build script for "older" but working versions of NCCL is here /glade/u/home/dhoward/work/nccl-ofi-plugin/build.sh. I would like to check if later versions can function well.

Most notably is that this build process must link with libfabric and CUDA. Tracking how we want to update or maintain this through later versions of these libraries will be important to consider. Installing this via Spack may be possible per the NCCL package plus the AWS OFI Plugin package. The Spack version of NCCL also depends on rdma-core which is explicitly for Infiniband so there might be issues here.

The text was updated successfully, but these errors were encountered:

dphow · 2024-07-24T14:16:34Z

Would add this similarly for Casper but one step at a time

vanderwb · 2024-07-24T14:51:28Z

Hi Daniel - some thoughts and questions:

NCCL is already installed as a module on both systems. Are you looking for a specific version of NCCL, a more recent version, or is something wrong with the NCCL installs currently there?
Does the AWS NCCL OFI plugin not use HPE's CXI GPU RDMA? Or to put it another way, is GDRCopy the only way to get GPU Direct RDMA with NCCL?
I had modified the AWS NCCL OFI plugin Spack package to utilize your script settings and put that on Gust. Did you ever have any success with that? It would be relatively easy to make this a load dependency for the NCCL module on Derecho.
IIRC, NCCL is a binary install, so while the Spack package wants rdma-core as a dependency, maybe it doesn't matter much in practice.
Do you want my help with any of this? Are you planning to make an install - if so, with Spack, or otherwise?

dphow · 2024-07-24T15:09:40Z

Thanks Brian!

The plugin offers significant performance improvement so that NCCL uses Slingshot's hsn interfaces, otherwise it tries to use Infiniband or falls back onto ethernet. So essentially, we either need the plugin added into the NCCL module or added as a required dependency to the current NCCL installs.

There's an ongoing email chain addressing some of these questions with NVIDIA. I can forward you this if you'd like to review it.

vanderwb · 2024-07-24T15:13:10Z

Sure. Happy to take a look.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Install NCCL for Derecho #22

Install NCCL for Derecho #22

dphow commented Jul 24, 2024 •

edited

Loading

dphow commented Jul 24, 2024

vanderwb commented Jul 24, 2024

dphow commented Jul 24, 2024

vanderwb commented Jul 24, 2024

Install NCCL for Derecho #22

Install NCCL for Derecho #22

Comments

dphow commented Jul 24, 2024 • edited Loading

dphow commented Jul 24, 2024

vanderwb commented Jul 24, 2024

dphow commented Jul 24, 2024

vanderwb commented Jul 24, 2024

dphow commented Jul 24, 2024 •

edited

Loading