Wrong data with MPI send/recv and pipelining on Intel GPUs #7139

jcosborn · 2024-09-12T21:02:02Z

We're getting incorrect results in application code when using MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1 if the buffer size MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ isn't set large enough. Setting it larger seems to work, but MPI should still give correct results (with possible performance hit, or give an error) if it is not set large enough. The full code is fairly complicated, but I have a simple reproducer which can somewhat reproduce the issue. The reproducer can easily fail if the buffer size is set lower than the default, but it doesn't seem to fail for the default size on up to 8 nodes. With a buffer size of 512k it fails easily on 4 nodes, and with 256k will fail regularly on 2 nodes.

Reproducer

sendrecvgpu.cc

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <sycl/sycl.hpp>

//const int nmesg = 2;
const int nmesg = 16;
//const int nmesg = 24;
//const int nmesg = 32;
//const int nrep = 1;
const int nrep = 1000;
//const int nrep = 10000;
//const int nrep = 20000;
const int nmin = 128*1024;
//const int nmax = 128*1024;
//const int nmin = 256*1024;
const int nmax = 256*1024;
//const int nmin = 2*1024*1024;
//const int nmax = 2*1024*1024;

void sendrecv(double *dest[], double *src[], int n) {
  int rank, size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Request sreq[nmesg], rreq[nmesg];
  for(int i=0; i<nmesg; i++) {
    int k = 1 << i;
    int recv = (rank+k) % size;
    MPI_Irecv(dest[i], n, MPI_DOUBLE, recv, i, MPI_COMM_WORLD, &rreq[i]);
  }
  for(int i=0; i<nmesg; i++) {
    int k = 1 << i;
    int send = (rank+k*size-k) % size;
    MPI_Isend(src[i], n, MPI_DOUBLE, send, i, MPI_COMM_WORLD, &sreq[i]);
  }
  MPI_Waitall(nmesg, sreq, MPI_STATUS_IGNORE);
  MPI_Waitall(nmesg, rreq, MPI_STATUS_IGNORE);
}

int main(int argc, char** argv) {
  MPI_Init(&argc, &argv);
  int rank, size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  char name[MPI_MAX_PROCESSOR_NAME];
  int namelen;
  MPI_Get_processor_name(name, &namelen);
  //sycl::queue q{sycl::gpu_selector_v};
  sycl::platform plat{sycl::gpu_selector_v};
  auto devs = plat.get_devices();
  int ndev = devs.size();
  int devid = rank % ndev;
  printf("%s  rank %3i  device %2i\n", name, rank, devid);
  fflush(stdout);
  MPI_Barrier(MPI_COMM_WORLD);
  sycl::queue q{devs[devid]};
  double *src[nmesg], *srcg[nmesg], *dest[nmesg], *destg[nmesg];
  for(int i=0; i<nmesg; i++) {
    src[i] = (double*)malloc(nmax*sizeof(double));
    srcg[i] = (double*)sycl::malloc_device<double>(nmax, q);
    dest[i] = (double*)malloc(nmax*sizeof(double));
    destg[i] = (double*)sycl::malloc_device<double>(nmax, q);
#pragma omp parallel for
    for(int j=0; j<nmax; j++) {
      src[i][j] = i + j;
    }
  }

  int error = 0;
  int errort = 0;
  for(int n=nmin; n<=nmax; n*=2) {
    if(rank==0) printf("Testing n = %i ...", n);
    for(int rep=0; rep<nrep; rep++) {
      //sendrecv(dest, src, n);
      for(int i=0; i<nmesg; i++) {
	q.memcpy(srcg[i], src[i], n*sizeof(double));
	q.memset(destg[i], 0, n*sizeof(double));
      }
      q.wait();
      sendrecv(destg, srcg, n);
      for(int i=0; i<nmesg; i++) {
	q.memcpy(dest[i], destg[i], n*sizeof(double));
      }
      q.wait();
      for(int i=0; i<nmesg; i++) {
	for(int j=0; j<n; j++) {
	  if (dest[i][j] != src[i][j]) {
	    printf("\n  error %i dest[%i][%i] = %f expected %f\n", rep, i, j, dest[i][j], src[i][j]);
	    error++;
	    break;
	  }
	}
	if(error>0) break;
      }
      MPI_Allreduce(&error, &errort, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
      if (errort>0) break;
    }
    if(errort>0) {
      if (rank==0) printf(" %i errors.\n", errort);
      break;
    } else {
      if (rank==0) printf(" done.\n");
    }
  }
  MPI_Finalize();
}

mpicxx -fsycl -qopenmp sendrecvgpu.cc -o sendrecvgpu

export MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1
export MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ=$((256*1024))
export ZE_FLAT_DEVICE_HIERARCHY=FLAT
mpiexec -np 24 --ppn 12 ./sendrecvgpu

The text was updated successfully, but these errors were encountered:

colleeneb · 2024-09-13T18:19:13Z

I can reproduce this on Aurora with commit d79cd238209c787bbcbbe730f9b958afe4e852ac ~~b3480ddfec1d9e98b06783aec97c082eadeca1a7~~ (updating with test from newer commit) from main as well.

zhenggb72 · 2024-10-19T15:27:32Z

Thanks for the reproducer. it appears in GPU pipelining, there is potentially scenarios that chunks are written into receive buffers out-of-order. I created a PR #7182 to fix it.

colleeneb · 2024-11-01T14:35:14Z

I confirmed that the reproducer passes for module load mpich/opt/develop-git.204f8cd on Aurora (which includes PR #7182 ). @jcosborn if you have a chance to test out this module, it would be appreciated!

jcosborn · 2024-11-14T18:32:21Z

I also confirmed this fixes the reproducer, however I now get hangs for some specific cases when running a full application with pipelining when not setting a larger buffer size. The cases seem to involve messages of different sizes, where some messages are much larger than the rest. I don't know the exact requirements yet and don't have a simple reproducer, but will keep trying to see if I can get one.

jcosborn · 2025-01-15T19:31:03Z

I'm now getting hangs when running this test case with the newly compiled MPICH currently available in the alcf_kmd_val Aurora queue.

colleeneb · 2025-01-15T21:17:24Z

I was able to reproduce the hang at 2 nodes with 1 process per node: MPIR_CVAR_CH4_OFI_ENABLE_GPU_PIPELINE=1 MPIR_CVAR_CH4_OFI_GPU_PIPELINE_BUFFER_SZ=$((256*1024)) ZE_FLAT_DEVICE_HIERARCHY=FLAT mpiexec -np 2 --ppn 1 ./sendrecvgpu

The backtrace of the two ranks is:
rank 0:

(gdb) bt
#0  0x0000154b07fe0210 in ofi_mutex_unlock_noop () at src/common.c:988
#1  0x0000154b07ff5349 in ofi_genlock_unlock (lock=0x51984f0) at ./include/ofi_lock.h:394
#2  ofi_cq_read_entries (src_addr=0x0, count=<optimized out>, buf=<optimized out>, cq=0x5198460) at ./include/ofi_util.h:615
#3  ofi_cq_readfrom (cq_fid=0x5198460, buf=<optimized out>, count=<optimized out>, src_addr=0x0) at prov/util/src/util_cq.c:272
#4  0x0000154b2a5f620b in MPIDI_NM_progress ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#5  0x0000154b2a5f4f76 in MPIDI_progress_test ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#6  0x0000154b2a5f23e5 in MPIR_Waitall_state ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#7  0x0000154b2a5f2ba9 in MPIR_Waitall ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#8  0x0000154b2a44f9b5 in PMPI_Waitall ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#9  0x0000000000403030 in sendrecv (dest=0x7ffecd6fee70, src=0x7ffecd6fef70, n=131072) at t.cpp:36
#10 0x0000000000403743 in main (argc=1, argv=0x7ffecd6ff288) at t.cpp:80

rank 1:

#0  0x000014c273136fa0 in MPIDI_POSIX_eager_recv_begin ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#1  0x000014c2730783af in MPIDI_SHM_progress ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#2  0x000014c273077e8a in MPIDI_progress_test ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#3  0x000014c2730753e5 in MPIR_Waitall_state ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#4  0x000014c273075ba9 in MPIR_Waitall ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#5  0x000014c272ed29b5 in PMPI_Waitall ()
   from /opt/aurora/24.180.3/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-4.3.0rc2-yuj3fkn/lib/libmpi.so.0
#6  0x0000000000403030 in sendrecv (dest=0x7ffda14eea60, src=0x7ffda14eeb60, n=131072) at t.cpp:36
#7  0x0000000000403743 in main (argc=1, argv=0x7ffda14eee78) at t.cpp:80

I guess they are both in the Waitall waiting for something.

colleeneb added the aurora label Sep 12, 2024

raffenet added the need confirm Need verify if the issue still exist with latest code label Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong data with MPI send/recv and pipelining on Intel GPUs #7139

Wrong data with MPI send/recv and pipelining on Intel GPUs #7139

jcosborn commented Sep 12, 2024

colleeneb commented Sep 13, 2024 •

edited

Loading

zhenggb72 commented Oct 19, 2024

colleeneb commented Nov 1, 2024

jcosborn commented Nov 14, 2024

jcosborn commented Jan 15, 2025

colleeneb commented Jan 15, 2025

Wrong data with MPI send/recv and pipelining on Intel GPUs #7139

Wrong data with MPI send/recv and pipelining on Intel GPUs #7139

Comments

jcosborn commented Sep 12, 2024

Reproducer

colleeneb commented Sep 13, 2024 • edited Loading

zhenggb72 commented Oct 19, 2024

colleeneb commented Nov 1, 2024

jcosborn commented Nov 14, 2024

jcosborn commented Jan 15, 2025

colleeneb commented Jan 15, 2025

colleeneb commented Sep 13, 2024 •

edited

Loading