Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One-sided communications in MPICH are considerably slower than those in Aurora MPICH #7263

Open
victor-anisimov opened this issue Jan 14, 2025 · 3 comments

Comments

@victor-anisimov
Copy link

Tests conducted on 144-node runs in the queue alcf_kmd_val show that one-sided communications in mpich 4.3.0rc2 are slower by 18% than those in the default Aurora MPICH. A single-file reproducer /tmp/reproducer7-alcf_kmd_val.tgz is available for download from aurora-uan-0010.

@hzhou
Copy link
Contributor

hzhou commented Jan 14, 2025

For reference, the relevant code in Fortran:

  call MPI_Win_Create(boxRegExpansion, buffer_size, dcmplx_size, MPI_INFO_NULL, MPI_COMM_RMGROUP, window, mpiError)
  call MPI_Win_fence(0, window)
  nMessagesRank = 0
  nFence = 100
  if(iam == 0) write(*,'(a,i0,a/)') "Invoke fence after every ", nFence, " messages"
  do i = 1, nMessagesTotal
    if(srcRank(i) == rmRank) then
      ! catch errors in the input data
      if(srcAddress(i) >=0 .and. dstAddress(i) >= 0 .and. &
         srcAddress(i)+dataSize(i)-1 < expansionSize .and. &
         dstAddress(i)+dataSize(i)-1 < bufferSize(dstRank(i)) ) then
           ! rmRank in the sub-communicator requests the data from the destination rank, dstRank(i)
           nElements  = dataSize(i)
           targetRank = dstRank(i)
           call MPI_Get(boxRegExpansion(srcAddress(i)), nElements, MPI_DOUBLE_COMPLEX, targetRank, &
                        dstAddress(i), nElements, MPI_DOUBLE_COMPLEX, window, mpiError)
        nMessagesRank = nMessagesRank + 1
      else
        write(*,'(a,i4,a,i8)') "Rank ", iam, " message ", i   ! corrupted input data
      endif
    endif
    if(mod(i,nFence) == 0) call MPI_Win_fence(0, window)
    if(mod(i,nFence) == 0 .and. iam == 0) write(*,'(a,i12,a)') "Rank 0 conducted ",i," one-sided messages"
  enddo
  !write(*,'(a,i10)') "rank after get before fence: ", iam
  !call flush(6)
  call MPI_Win_fence(0, window)
  !write(*,'(a,i10)') "rank after fence before free: ", iam
  !call flush(6)
  call MPI_Win_free(window, mpiError)

@victor-anisimov
Copy link
Author

I have conducted performance test for one-sided communications between two GPU pointers on a single node by using a single-file reproducer (12 MPI ranks per node). The use of default Aurora MPICH in lustre_scaling queue makes the test completing in 2.5 seconds. The use of mpich 4.3.0rc2 in alcf_kmd_val queue leads the test completing in 2.9 seconds. This performance difference corresponds to 16% slowdown. The data file (data.txt) used by the single-file reproducer on a single node is attached to this message.
data.txt.gz

@victor-anisimov
Copy link
Author

The performance test of host-to-host one-sided communications on a single node shows 3x slowdown. The reproducer that can run either on a host or on a device is attached.
reproducer9-host-device.tgz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants