-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable servers to use pmi to get rank and count #626
Conversation
Maybe we could check whether the hostfile is set at runtime, and use it if so, even if PMI2/PMIX is selected? Similarly for the key/value store if we check the SHAREDFS path, use the file system if defined, and use PMI otherwise if PMI is enabled? |
094652e
to
463eb7a
Compare
One problem with exchanging server addresses with PMI2.. at least one version of SLURM PMI2 seems to use
leads to an error like:
As a fix, I put in a hack to convert any |
f113726
to
50a3089
Compare
Finding some strange behavior in testing, so putting this back on WIP while trying to figure it out. |
Setting the following does fix the problem I was chasing:
Hooray for @CamStan ! Under Totalview, I could see that some data structure had a NULL pointer. Since I'm messing with the order of startup calls, I thought I had messed up some |
Oh nice! I thought I'd tried this in attempt to resolve the issue I was seeing, but I'll try it again and see if it helps there as well. |
Seems like margo uses They use a default of |
Confirmed that this works on both lassen and quartz with the new setting, but it fails with a segfault in both cases without it. The segfaults come from different parts of the code. |
I don't know what I did when I first tried fixing my problem by setting |
Margo sets the ABT stack size to be 2MB by default. PR #659 changes things to let margo call ABT_init for us, in which case, its setting of 2MB is used. With that, things work for me without having to set |
For the servers to acquire their rank and the number of servers in the job, one must currently create a hostfile. This is automatic when using the
unifyfs
utility to launch the servers, but it is a bit cumbersome and error prone when launching the servers directly, which is helpful at times.And if one forgets to define a hostfile, all servers start up assuming they are rank 0 in a one process job. That situation can be confusing to debug.
This PR enables the servers to use PMI2/PMIX to acquire the number of servers and their rank within the set. If PMI is enabled and if UNIFYFS_SERVER_HOSTFILE is not set, then the servers use PMI to get the rank and server count. If PMI is not enabled or if UNIFYFS_SERVER_HOSTFILE is defined, then the servers use the host file method.
When PMI is available, this simplifies the task of launching the servers manually through the job launcher as the hostfile can be avoided. In particular, the above simplifies to just: