Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install NCCL for Derecho #22

Open
dphow opened this issue Jul 24, 2024 · 4 comments
Open

Install NCCL for Derecho #22

dphow opened this issue Jul 24, 2024 · 4 comments

Comments

@dphow
Copy link
Member

dphow commented Jul 24, 2024

Would like to install NCCL as a dedicated module which can be linked into PyTorch / Tensorflow / other programs that want to use an optimized internode collective / peer communicaton.

Repos to reference
NVIDIA NCCL: https://github.com/NVIDIA/nccl
NERSC's NCCL Install with plugin: https://github.com/NERSC/nccl-ofi-plugin
AWS NCCL OFI Plugin (adapted for Slingshot): https://github.com/aws/aws-ofi-nccl

Optional but likely ideal - GDRCopy for GPU Direct RDMA (not presently possible for Lustre Storage network only): https://github.com/NVIDIA/gdrcopy

An example build script for "older" but working versions of NCCL is here /glade/u/home/dhoward/work/nccl-ofi-plugin/build.sh. I would like to check if later versions can function well.

Most notably is that this build process must link with libfabric and CUDA. Tracking how we want to update or maintain this through later versions of these libraries will be important to consider. Installing this via Spack may be possible per the NCCL package plus the AWS OFI Plugin package. The Spack version of NCCL also depends on rdma-core which is explicitly for Infiniband so there might be issues here.

@dphow
Copy link
Member Author

dphow commented Jul 24, 2024

Would add this similarly for Casper but one step at a time

@vanderwb
Copy link
Collaborator

Hi Daniel - some thoughts and questions:

  1. NCCL is already installed as a module on both systems. Are you looking for a specific version of NCCL, a more recent version, or is something wrong with the NCCL installs currently there?
  2. Does the AWS NCCL OFI plugin not use HPE's CXI GPU RDMA? Or to put it another way, is GDRCopy the only way to get GPU Direct RDMA with NCCL?
  3. I had modified the AWS NCCL OFI plugin Spack package to utilize your script settings and put that on Gust. Did you ever have any success with that? It would be relatively easy to make this a load dependency for the NCCL module on Derecho.
  4. IIRC, NCCL is a binary install, so while the Spack package wants rdma-core as a dependency, maybe it doesn't matter much in practice.
  5. Do you want my help with any of this? Are you planning to make an install - if so, with Spack, or otherwise?

@dphow
Copy link
Member Author

dphow commented Jul 24, 2024

Thanks Brian!

The plugin offers significant performance improvement so that NCCL uses Slingshot's hsn interfaces, otherwise it tries to use Infiniband or falls back onto ethernet. So essentially, we either need the plugin added into the NCCL module or added as a required dependency to the current NCCL installs.

There's an ongoing email chain addressing some of these questions with NVIDIA. I can forward you this if you'd like to review it.

@vanderwb
Copy link
Collaborator

Sure. Happy to take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants