Cray MPICH Issue - OFI poll failed #7061
saisandeepdammati
started this conversation in
General
Replies: 2 comments 1 reply
-
Is this with HPE slingshot network? |
Beta Was this translation helpful? Give feedback.
1 reply
-
Hi, Yes the machine uses HPE Slingshot network. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Howdy!
I am running a job using a parallel adaptive mesh refinement based CFD code on a Cray machine, Carpenter at ERDC. The code is written in C++ and is compiled using cray-mpich (version 8.1.26).
The job is performed with 4224 mpi processes on 22 nodes with 192 cores per node. The job runs for sometime (15-20 minutes) and then crashes with signal 9 error with the following MPICH error (complete error file is attached as text file):
I have looked at all the MPI_Send commands in the code and they look sensible, however, the runs crash with this error. Is this a familiar issue? Can you please provide me with a workaround or a fix for it?
Thanks in advance.
cray_mpich_error_carpenter.txt
Beta Was this translation helpful? Give feedback.
All reactions