-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Benchmark_ITT segfaults with MPI and ranks > 1 #393
Comments
Added comment about segfaults from other benchmarks run on our local cluster |
On Perlmutter, I built git branch c0d56a1 according to the recipe in ./systems/Perlmutter. Benchmark_ITT generates
|
Unfortunately, this problem still persists on the develop 042ab1a branch dated Mon Jun 27, 2022. |
Any updates on this? |
Unfortunately, no updates. I see similar segfaults on systems other than Perlmutter. I suspect it is more of a problem with the mpich family of MPI and later versions of Grid, though openmpi has also shown segfaults. |
Are you using GPU-aware MPI? We have seen several unexplained segfaults with this that vanish using the normal build of MPI. So far, the implementors have not been motivated to fix these. |
I see the same segfaults using CUDA aware OpenMPI, I cannot confirm this is the case with normal MPI. Do you suggest to use normal OpenMPI instead? |
Yes |
there must be something else going on:
|
Hi,
Benchmark_ITT segfaults in MPI run when nranks > 1 just after printing "Initialised RNGs". I have observed this same segfault on Perlmutter as well as a local cluster.
$ mpirun -np 2 ./Benchmark_ITT --mpi 1.1.1.2 --shm 2048
Git tag: 605cf40 although I found the same issue trying earlier working sets as far back as mid-February.
Environment: 3) gnu10/10.2.0 4) cuda11/11.6.0 5) openmpi3/3.1.4
Similar env was used on Perlmutter: gnu10 and CUDA 11.5 as I recall
$ ../configure --enable-simd=GPU --enable-accelerator=cuda --enable-comms=mpi3-auto --enable-gen-simd-width=32 --enable-openmp CXX=nvcc CXXFLAGS="-ccbin mpicxx -gencode arch=compute_70,code=sm_70 -std=c++14"
The text was updated successfully, but these errors were encountered: