-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong data with MPI send/recv and pipelining on Intel GPUs #7139
Comments
I can reproduce this on Aurora with commit |
Thanks for the reproducer. it appears in GPU pipelining, there is potentially scenarios that chunks are written into receive buffers out-of-order. I created a PR #7182 to fix it. |
I also confirmed this fixes the reproducer, however I now get hangs for some specific cases when running a full application with pipelining when not setting a larger buffer size. The cases seem to involve messages of different sizes, where some messages are much larger than the rest. I don't know the exact requirements yet and don't have a simple reproducer, but will keep trying to see if I can get one. |
I'm now getting hangs when running this test case with the newly compiled MPICH currently available in the alcf_kmd_val Aurora queue. |
I was able to reproduce the hang at 2 nodes with 1 process per node: The backtrace of the two ranks is:
rank 1:
I guess they are both in the Waitall waiting for something. |
We're getting incorrect results in application code when using MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1 if the buffer size MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ isn't set large enough. Setting it larger seems to work, but MPI should still give correct results (with possible performance hit, or give an error) if it is not set large enough. The full code is fairly complicated, but I have a simple reproducer which can somewhat reproduce the issue. The reproducer can easily fail if the buffer size is set lower than the default, but it doesn't seem to fail for the default size on up to 8 nodes. With a buffer size of 512k it fails easily on 4 nodes, and with 256k will fail regularly on 2 nodes.
Reproducer
sendrecvgpu.cc
mpicxx -fsycl -qopenmp sendrecvgpu.cc -o sendrecvgpu
export MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ=$((256*1024))
export ZE_FLAT_DEVICE_HIERARCHY=FLAT
mpiexec -np 24 --ppn 12 ./sendrecvgpu
The text was updated successfully, but these errors were encountered: