Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make allgather signature consistent for server_connect_fn() #1381

Closed

Conversation

wzamazon
Copy link
Contributor

pmix_server_connect_fn() call grpcomm.allgather(), which
uses a signature to identify the action.

pmix_server_connect_fn() uses "procs" (which is an array
of pmix_proc_t) as signature.

procs was provided by client when it called PMIx_connect().

The issue is that different client may have procs in different
order (when inter MPI communicator trascation is used).
As a result, pmix servers may end up with different signature
for the same allgather operation.

This patch addressed the issue by sorting "procs" by
pmix_proc_t before passing it to grpcomm.allgather.

Signed-off-by: Wei Zhang [email protected]

pmix_server_connect_fn() call grpcomm.allgather(), which
uses a signature to identify the action.

pmix_server_connect_fn() uses "procs" (which is an array
of pmix_proc_t) as signature.

procs was provided by client when it called PMIx_connect().

The issue is that different client may have procs in different
order (when inter MPI communicator trascation is used).
As a result, pmix servers may end up with different signature
for the same allgather operation.

This patch addressed the issue by sorting "procs" by
pmix_proc_t before passing it to grpcomm.allgather.

Signed-off-by: Wei Zhang <[email protected]>
@wzamazon
Copy link
Contributor Author

wzamazon commented Jul 11, 2022

@rhc54 Can you take a look?

This will fix inter communicator test hang in Open MPI main branch.

open-mpi/ompi#8958

@rhc54
Copy link
Contributor

rhc54 commented Jul 11, 2022

Hmmm... this is a bit of a problem. PMIx states that the proc array is order sensitive, which is why we treat it that way here. This breaks that correlation. Might solve OMPI's issue but create problems for apps and libraries that follow the standard.

Perhaps a better solution would be to do the sort in OMPI prior to calling connect?

@wzamazon
Copy link
Contributor Author

wzamazon commented Jul 11, 2022

PMIx states that the proc array is order sensitive,

Thank you!

Can you point me to the document? I tried to find document regarding procs ordering, but could not find it.

Perhaps a better solution would be to do the sort in OMPI prior to calling connect?

Yes. I can definitely modify OMPI for this.

@jjhursey
Copy link
Member

We discussed this briefly in the PMIx Standard IAWG today (July 11) and created an issue to track the clarification:

We plan to bring this sup for discussion on the PMIx Standard Monthly meeting this Thursday, July 14.

@rhc54
Copy link
Contributor

rhc54 commented Jul 11, 2022

I'm on the road today but will post the link later. For now please modify the OMPI side

@wzamazon wzamazon closed this Jul 11, 2022
@wzamazon
Copy link
Contributor Author

For now please modify the OMPI side

will do. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants