Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker: add support for PMIx bootstrap #3537

Merged
merged 5 commits into from
Mar 17, 2021

Conversation

ggouaillardet
Copy link
Contributor

PMIx support for bootstrapping flux is added by running
configure --with-pmix[=PATH]
When specified, PATH must contain the pmix.h header file.

PMIx is used at bootstrap by running
flux start --bootstrap=pmix

Limitations:

  • configure does not check PATH contains the pmix.h header file
  • libpmix.so currently has to be in the LD_LIBRARY_PATH

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ggouaillardet - thanks for contributing to Flux and welcome!

A quick bit of early feedback since this seems to be a work in progress -

  • would it be possible to localize the PMIx specific changes to pmiutil.c rather than duplicating that code in pmixutil.c?
  • It seems like some of the code from the the PMIx's pmi1.c (e.g. noticed artpol comment) is duplicated in pmixutil.c. If we are pulling in code from there, should we just grab the whole thing and retain its copyright+license header? (perhaps in src/common/libpmix as a vendored library)?
  • Note that when flux start starts flux, it provides the PMI environment. So it should not tell the broker to bootstrap with PMIx unless it can provide a PMIx environment.
  • We will want to be careful about passing PMIX_ environment variables through flux to a job that is launched by flux, so that it doesn't try to bootstrap in flux's job environment (although now that I think of it, we must be handling that now on our Sierra system - perhaps @SteVwonder can provide some feedback on that)
  • The autoconf integration probably needs some work, and my first inclination would be to point you to the s3 example right above the m4 that you added to configure.ac, but it might require a bit more thought.
  • We'll need to do something in CI. I know some of pmix is packaged in Ubuntu, but I'm not sure how old it is, or whether there is something like a standalone pmix launcher available that we could use to test-launch flux from a sharness test in our t directory.

A more general question is to ask is why can't we use PMIx's pmiv1 library externally? I know we had a number of problems with it on our Sierra systems and were building it on the side for a while. I had naively thought this would be the preferred method to bootstrap flux with PMIx, but reality seems to disagree with me :-) Perhaps we could understand why it's not working for @ggouaillardet in his environment, and @SteVwonder could comment on the current state of affaiirs on Sierra?

@garlick
Copy link
Member

garlick commented Feb 23, 2021

Note that when flux start starts flux, it provides the PMI environment. So it should not tell the broker to bootstrap with PMIx unless it can provide a PMIx environment.

I just realized that what you are probably doing there is just trying to get flux start to pass through your option to the broker. There is another way to do that: flux start -o,--pmix. However, we should probably try to figure out appropriate logic in pmiutil.c to select PMIx automatically if available so that we don't need a broker option.

know some of pmix is packaged in Ubuntu, but I'm not sure how old it is

On ubuntu 20.04 LTS, I ran sudo apt install libpmix-dev and got version 3.1.5-1.

@SteVwonder
Copy link
Member

although now that I think of it, we must be handling that now on our Sierra system - perhaps @SteVwonder can provide some feedback on that)

That's a good point. We do strip these variables out at the job shell plugin level when you pass -o mpi=spectrum. Maybe we need to make that the behavior by default? https://github.com/flux-framework/flux-core/blob/master/src/shell/lua.d/spectrum.lua#L53

@SteVwonder could comment on the current state of affaiirs on Sierra

Bootstrapping Flux on Summit, Lassen, and Sierra works currently as you describe - with an external shim. So this isn't a show stopper for LC users right now. The major downside of the shim is that the semantics of PMIx_Get do not perfectly align with PMI_KVS_Get such that we have to hardcode in flux-specific keys in the shim to get bootstrap to work: https://github.com/SteVwonder/pmi-shim/blob/pmi4flux/src/pmi1.c#L257. Using PMIx in Flux directly would eliminate the need for the shim (including the module load pmi-shim) and thus the hardcoded kludge.

There is also a limitation in the default OpenPMIx datastore that causes a hang when using the PMI1 shim (openpmix/pmi-shim#3). So we have to provide PMIX_MCA_gds="^ds12,ds21" at the command line when launching Flux with jsrun. Using PMIx directly would also eliminate this.

@garlick
Copy link
Member

garlick commented Feb 23, 2021

Bootstrapping Flux on Summit, Lassen, and Sierra works currently as you describe - with an external shim. So this isn't a show stopper for LC users right now. The major downside of the shim is that the semantics of PMIx_Get do not perfectly align with PMI_KVS_Get such that we have to hardcode in flux-specific keys in the shim to get bootstrap to work: https://github.com/SteVwonder/pmi-shim/blob/pmi4flux/src/pmi1.c#L257. Using PMIx in Flux directly would eliminate the need for the shim (including the module load pmi-shim) and thus the hardcoded kludge.

FYI the key name hardwired in the shim changed as of f1b4cd5. It's just the rank now.

There is also a limitation in the default OpenPMIx datastore that causes a hang when using the PMI1 shim (openpmix/pmi-shim#3). So we have to provide PMIX_MCA_gds="^ds12,ds21" at the command line when launching Flux with jsrun. Using PMIx directly would also eliminate this.

I'm still slightly confused why the pmi-shim (looks like now official repo for what used to be compat libraries?) doesn't provide PMI-1 KVS key scope out of the box, but I guess it doesn't matter - this has been going on for years and still is hurting us so we should probably go ahead with this PR to make everyone's life better.

@ggouaillardet
Copy link
Contributor Author

Thanks for the feedback!

I pushed some more commits to address some of the comments

  • PMIx is now tried by default at configure time
  • PMIx is automatically selected at runtime based on PMIX related environment variables
  • PMIx environment variables are now blocked from the environment
  • the pmix bootstrap method has been removed (since PMIx is now automatically detected and used)

The Open MPIx master does not provide libpmi1.so any more, so I am afraid the middle/long term option is not to rely on such library (regardless there are some old/pending issues).

The PMIx2PMI logic is indeed from Open MPIx (v3 fwiw) and this has been discussed off-github with @SteVwonder and @rhc54. Of course, I can add the copyright if needed.

@garlick
Copy link
Member

garlick commented Feb 24, 2021

Thanks for those changes.

We're hitting the following in CI which probably is caused somehow by the cppflags set in configure when pmix is not found.

In file included from brokercfg.c:17:0:
2246
  /usr/include/flux/core.h:1:2: error: #error Non-build-tree flux/core.h!
2247
   #error Non-build-tree flux/core.h!
2248
    ^~~~~
2249
  brokercfg.c:21:10: fatal error: src/common/libutil/log.h: No such file or directory
2250
   #include "src/common/libutil/log.h"
2251
            ^~~~~~~~~~~~~~~~~~~~~~~~~~
2252
  compilation terminated.

I'll play around with this a bit today.

@garlick garlick closed this Feb 24, 2021
@garlick garlick reopened this Feb 24, 2021
@garlick
Copy link
Member

garlick commented Feb 24, 2021

Yikes! Pressed the close button by accident. Sorry about that!

I'll see if I can get the automake stuff straightened out today and post a suggested change for you here, since it will be nice to have CI running as this PR evolves.

If it's possible to make changes only to pmiutil.c and not boot_pmi.c, I think that would be preferable to the function pointer redirection to pmixutil.c.

@ggouaillardet
Copy link
Contributor Author

@garlick thanks for the pointer!

I just push an other commit to better handle pmix_cppflags
maybe automake did not like libbroker_la_CPPFLAGS to be set inside a if HAVE_PMIX section

@garlick
Copy link
Member

garlick commented Feb 24, 2021

I think you are on the right track, but @pmix_cppflags@ seems to be defined as just "-I" when pmix is not available.

@ggouaillardet
Copy link
Contributor Author

@garlick indeed, I overlooked the lack of --with-pmix[=PATH] option when revamping that part ...
now all tests but one (flux-sched related) are passing, will continue tomorrow.

@garlick
Copy link
Member

garlick commented Feb 24, 2021

Good job! Yes - unfortunately the scheduler test failures don't seem to have left behind any clues in the raw log, unless I'm missing something. I suspect it may be only the tests that use test_under_flux that are failing. Will see if I can run it down on my test system.

There are a couple of things I might propose to change in the autotools support:

  • Substitute LIBPMIX_CFLAGS in Makefiles.am's to look more consistent with how we handle other libraries
  • Make pmix bootstrap "opt-in" with --enable-pmix-bootstrap, leaving --with-pmix= for setting the path to the header.
  • Possibly we may also want a way to set the path to the library? Not sure on that one.

@SteVwonder
Copy link
Member

SteVwonder commented Feb 24, 2021

EDIT: I added the path to libpmix.so to my LD_LIBRARY_PATH as noted above and it worked under prrte. It is still failing on Lassen under JSM.

I compiled this on our Lassen system, and it just appears to hang.

FLUX_PMI_DEBUG=1 PMIX_MCA_gds="^ds12,ds21" jsrun -a 1 -c ALL_CPUS -g ALL_GPUS -n 1 --bind=none --smpiargs="-disable_gpu_hooks" ./src/cmd/flux start --trace-pmi-server -o-Slog-stderr-level=7 flux resource list

It doesn't hang under PRRTE (on an x86 machine) but it doesn't bootstrap properly either:

❯ FLUX_PMI_DEBUG=1 ~/opt/packages/toss3/prrte/master-2021-01-28/bin/prterun -n 2 ./src/cmd/flux start flux resource list                                                                                                       10:20:42 ()
     STATE NNODES   NCORES    NGPUS NODELIST
     STATE NNODES   NCORES    NGPUS NODELIST
      free      1        1        0 quartz1538
      free      1        1        0 quartz1538
 allocated      0        0        0
 allocated      0        0        0
      down      0        0        0
      down      0        0        0

@ggouaillardet any suggestions on what I might be doing wrong?

Side-note: it might also be helpful if we enable the same debug tracing for PMIx that we have for PMI

@garlick
Copy link
Member

garlick commented Feb 24, 2021

I compiled this on our Lassen system, and it just appears to hang.

Good data point. The code here always calls PMIx_Get() with proc.rank set to PMIX_RANK_WILDCARD, but the code you referenced in the shim sets it to the target rank. Are we hitting that bug you mentioned earlier?

Side-note: it might also be helpful if we enable the same debug tracing for PMIx that we have for PMI

I think if we push the pmix support into pmiutil.c then we will get the tracing for free. My idea there was that we would just add a new mode PMI_MODE_PMIX that is conditionally compiled, and probably just have it call PMIx_*() directly rather than go through the dlopen indirection, and see how that goes.

If we can do that, then we could change broker_pmi_kvs_get() to include the target rank, and then just ignore it for the other modes.

@ggouaillardet
Copy link
Contributor Author

@SteVwonder quick note: you currently need to set FLUX_PMIX_DEBUG=1 (note the X) to enable debugging.

@ggouaillardet
Copy link
Contributor Author

@SteVwonder second quick note: prte does not seem to propagate the environment ($PATH, $LD_LIBRARY_PATH nor FLUX_PMIX_DEBUG) by default.
Also, if you run two tasks, the default is to bind to core (bind to socket otherwise).

Here is my output

$ prun -n 2 --bind-to none env FLUX_PMIX_DEBUG=1 LD_LIBRARY_PATH=$LD_LIBRARY_PATH `which flux` start flux resource list
flux-broker: pmix-debug-dlopen: library name /home/usersup/gilles/local/openpmix/lib/libpmix.so
flux-broker: pmix-debug-dlopen: library name /home/usersup/gilles/local/openpmix/lib/libpmix.so
pmix-debug-dlopen[-1]: init = operation completed successfully
pmix-debug-dlopen[0]: get_params (rank=0 size=2 kvsname=prte-head-40446@36) = operation completed successfully
pmix-debug-dlopen[0]: kvs_get (kvsname=prte-head-40446@36 key=flux.instance-level value=<none>) = operation failed
pmix-debug-dlopen[0]: kvs_put (kvsname=prte-head-40446@36 key=0 value=N6#A]Xxc$alFHxP1%*J0Vjh$Zu=$yS5hb:e=Ih=D,tcp://[::ffff:172.19.60.201]:49152) = operation completed successfully
pmix-debug-dlopen[0]: kvs_commit (kvsname=prte-head-40446@36) = operation completed successfully
pmix-debug-dlopen[-1]: init = operation completed successfully
pmix-debug-dlopen[1]: get_params (rank=1 size=2 kvsname=prte-head-40446@36) = operation completed successfully
pmix-debug-dlopen[1]: kvs_get (kvsname=prte-head-40446@36 key=flux.instance-level value=<none>) = operation failed
pmix-debug-dlopen[1]: kvs_put (kvsname=prte-head-40446@36 key=1 value=jRzn>k#qC%!/e>8[-8A+)n[u#>w%p0>km=[u%B@.) = operation completed successfully
pmix-debug-dlopen[1]: kvs_commit (kvsname=prte-head-40446@36) = operation completed successfully
pmix-debug-dlopen[1]: barrier = operation completed successfully
pmix-debug-dlopen[1]: kvs_get (kvsname=prte-head-40446@36 key=0 value=N6#A]Xxc$alFHxP1%*J0Vjh$Zu=$yS5hb:e=Ih=D,tcp://[::ffff:172.19.60.201]:49152) = operation completed successfully
pmix-debug-dlopen[0]: barrier = operation completed successfully
pmix-debug-dlopen[0]: kvs_get (kvsname=prte-head-40446@36 key=1 value=jRzn>k#qC%!/e>8[-8A+)n[u#>w%p0>km=[u%B@.) = operation completed successfully
pmix-debug-dlopen[0]: barrier = operation completed successfully
pmix-debug-dlopen[1]: barrier = operation completed successfully
pmix-debug-dlopen[0]: finalize = operation completed successfully
pmix-debug-dlopen[1]: finalize = operation completed successfully
     STATE NNODES   NCORES    NGPUS NODELIST
      free      2       48        0 n01,n02
 allocated      0        0        0 
      down      0        0        0 

@rhc54
Copy link

rhc54 commented Feb 25, 2021

prte does not seem to propagate the environment ($PATH, $LD_LIBRARY_PATH nor FLUX_PMIX_DEBUG) by default.

Correct - we chose to not do so many years ago because of the difficulty of knowing which envars can be safely propagated. Some environments allow it by configuration - I have no problem doing something similar if it is desirable. Just don't want to do it everywhere as that can get us into trouble.

@ggouaillardet
Copy link
Contributor Author

@garlick I pushed some more commits to address the latest comments.

libpmix.so is still dlopen-ed but in pmiutil.c

@SteVwonder a consequence is the FLUX_PMIX_DEBUG environment variable is not used anymore (debug is set via the one and only FLUX_PMI_DEBUG environment variable)

@rhc54 at this stage, I do not think changes in prte are necessary.

@rhc54
Copy link

rhc54 commented Feb 25, 2021

at this stage, I do not think changes in prte are necessary.

Got it. It might be nice to add support for flux in the opposite scenario where prte executes inside of a flux environment. We'd need to know how to pickup the flux allocation and, if possible, use flux (something like the equivalent to srun as opposed to ssh) to start the daemons.

However, that's a different subject - if someone can point us to where we might find info on those matters (here or off-list), it would be appreciated.

@garlick
Copy link
Member

garlick commented Feb 25, 2021

Nice progress! All the CI tests are passing now.

One issue I noticed when trying this out is that flux start fails with a PMI init error if openpmix is installed (to a /usr/local prefix in my test), but we are not in a pmix environment. The expected result is a size=1 flux instance (singleton).

If I turn on FLUX_PMI_DEBUG=1, it shows that we are finding a libpmi.so and then PMI_Init() fails.

It seems that openpmix installs a libpmi.so that fails PMI_Init() when it doesn't find a PMIX server. The slurm PMI_Init() assumes singleton when it doesn't find slurm. The code at that point doesn't know if it wants to be a singleton or if someone is running flux start as a parallel job, so it is a bit tricky. Possibly (with an opt-in PMIx configuration) we should just avoid dlopening libpmi.so if HAVE_PMIX is defined?

Other random stuff:

  • I think we might make our lives simpler for this PR, we don't dlopen() libpmix.so but instead just link the broker to it and invoke the PMIx functions directly. Maybe there is a hidden downside there, but @grondo and I couldn't think of one offhand.
  • I think we'll want to pass the target rank in with broker_pmi_kvs_get() as discussed above.
  • argonne process mapping key is not used by flux so the special case for that key can go away
  • configure should be opt-in for pmix not opt-out.

@rhc54
Copy link

rhc54 commented Feb 25, 2021

It seems that openpmix installs a libpmi.so that fails PMI_Init() when it doesn't find a PMIX server.

Yeah, we removed that in v4.0 to avoid that very problem. Meantime, you can tell older versions not to build/install that lib by configuring with --disable-pmi-backward-compatibility

@garlick
Copy link
Member

garlick commented Feb 25, 2021 via email

@garlick
Copy link
Member

garlick commented Feb 25, 2021

Yeah, we removed that in v4.0 to avoid that very problem. Meantime, you can tell older versions not to build/install that lib by configuring with --disable-pmi-backward-compatibility

Maybe we can just assume that we won't see PMIx's libpmi.so in the wild then. (We can leave things as is for now, and work around later if we need to)

The current automake macro doesn't find pmix.h even when it is in the default include path. Doing some googling around (and recalling barely-remembered automake patterns), I think one normally would set CPPFLAGS on the configure command line if a header file is in a non-standard path, and use the normal header discovery macro.

Oh! I just noticed that pmix-3.2.3 comes with a pmix.pc file. That makes our life easier if we can just use that. The configure.ac snippet would boil down to something like this (tested):

AC_DEFUN([X_AC_PMIX], [
    AC_ARG_ENABLE([pmix-bootstrap],
        AS_HELP_STRING([--enable-pmix-bootstrap], [Enable PMIx bootstrap]))
    AS_IF([test "x$enable_pmix_bootstrap" = "xyes"], [
        PKG_CHECK_MODULES([PMIX], [pmix])
        AC_DEFINE([HAVE_LIBPMIX], [1], [Enable PMIx bootstrap])
    ])
])

Then if pmix is installed to a non-standard location, one just sets PKG_CONFIG_PATH to point to the location of pmix.pc. Will that work on the target platforms? I.e. do the vendors ship pmix.pc?

@rhc54
Copy link

rhc54 commented Feb 25, 2021

do the vendors ship pmix.pc?

AFAIK, they all do - it was a downstream packager that contributed it

@ggouaillardet
Copy link
Contributor Author

@garlick what is your configure command line and where is the "standard location" of your pmix.h?

The intent was not to use pmix (and skip detection) unless --enable-pmix-bootstrap is passed on the configure command line (e.g. opt-in, at least according to the terminology I know). If --with-pmix=PATH is not specified, AC_SEARCH_HEADER will look for pmix.h in the standard location (plus CPPFLAGS current value if any.

Meanwhile, I will see if I can spot an issue with the current logic.

@ggouaillardet
Copy link
Contributor Author

@garlick there was indeed a bug (the lack of --with-pmix option) was not correctly handled and I pushed a fix.

I have no issue using PKG_CHECK_MODULES(), note it will define PMIX_CFLAGS and friends (so move away from LIBPMIX_CFLAGS you previously requested).

The two relatively minor advantages of dlopen() vs linking with libpmix.so I can think of are :

  • reduced memory footprint (since libpmix.so is dlclose() at the end of the bootstrap)
  • no dependency to libpmix.so. That can be helpful if flux is on a filesystem shared between different clusters,
    and libpmix.so is on the local filesystem on not all the clusters.

That's by no means a strong opinion, and I will be happy to replace dlopen() if/when a consensus is reached.

@garlick
Copy link
Member

garlick commented Feb 26, 2021

Sorry @ggouaillardet for the delay responding! I had installed pmix to a prefix of /usr/local, so the pmix.h was in /usr/local/include. I make sure the .la files are not installed and also the libpmi.so and libpmi2.so.

If the pkgconfig files are on your system and @SteVwonder can confirm that this works for us on ours, then I prefer that approach since less M4 in the world makes the world better IMHO :-)

Your points about dlopen are good. I'm not sure what is the right approach. I think the main argument for not-dlopen is that it's simpler, and starting simple is good. I did play around with converting it just to see how it would look and the result is on this branch if you would like to have a look and/or try it and see if there are problems.

There is also a commit on there to illustrate what I think @SteVwonder needs for our sierra system to avoid the hangs.

(Feel free to ignore everything on that branch if it is not right - I was just playing around to understand better)

@garlick
Copy link
Member

garlick commented Mar 9, 2021

I seem not to be allowed to push to your branch (not sure why - maybe because github considers force pushing to anothers branch to be not nice?)

Anyway, my pmix_bootstrap branch is ready if you'd like to pull it down and force push it here.

Edit: @grondo pointed out that I might be pushing to the https://github.com remote and need to push to the [email protected] one. Yep, that was probably it. Anyway, I'll hold off retrying in case you're already working on it.

@garlick
Copy link
Member

garlick commented Mar 10, 2021

Thanks. Just pushed a couple of fixups (to be squashed before we merge):

  • Drop convert_int() since we only have one value to convert, with one possible type. This may improve coverage a bit.
  • In broker_pmi_kvs_get(): tighten up some error handling:
    • drop the test for val==NULL on PMIX_SUCCESS (impossible?)
    • throw PMIX_ERROR if val->data.string == NULL
    • throw PMIX_ERR_INVALID_VAL_LENGTH if value string is too long to fit (with terminating \0) in value buffer.

Finally, I'm noticing a memory leak when I run the sharness test under valgrind. Not sure if this is an expected leak in libpmix or if we are doing something wrong (any thoughts @rhc54?)

==3765928== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==3765928== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==3765928== Command: /home/garlick/proj/flux-core/src/broker/.libs/flux-broker -Sbroker.rc1_path= -Sbroker.rc3_path= /bin/true
==3765928==
==3765928==
==3765928== HEAP SUMMARY:
==3765928==     in use at exit: 54,644 bytes in 142 blocks
==3765928==   total heap usage: 7,799 allocs, 7,657 frees, 4,439,274 bytes allocated
==3765928==
==3765928== 32 bytes in 1 blocks are definitely lost in loss record 2 of 27
==3765928==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==3765928==    by 0x5B4A3EB: ??? (in /usr/lib/x86_64-linux-gnu/pmix/lib/libmca_common_dstore.so.1.0.2)
==3765928==    by 0x5B4C1C1: pmix_common_dstor_fetch (in /usr/lib/x86_64-linux-gnu/pmix/lib/libmca_common_dstore.so.1.0.2)
==3765928==    by 0x5B2278C: ??? (in /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so)
==3765928==    by 0x4ACE05A: ??? (in /usr/lib/x86_64-linux-gnu/pmix/lib/libpmix.so.2.2.25)
==3765928==    by 0x4ACFFE8: PMIx_Get_nb (in /usr/lib/x86_64-linux-gnu/pmix/lib/libpmix.so.2.2.25)
==3765928==    by 0x4AD0E4D: PMIx_Get (in /usr/lib/x86_64-linux-gnu/pmix/lib/libpmix.so.2.2.25)
==3765928==    by 0x125325: broker_pmi_get_params (pmiutil.c:421)
==3765928==    by 0x12344F: boot_pmi (boot_pmi.c:170)
==3765928==    by 0x113B5E: main (broker.c:370)
==3765928==

@rhc54
Copy link

rhc54 commented Mar 10, 2021

Just glancing at the code, I see where you do the PMIx_Get to return the value and then you do indeed execute a PMIX_VALUE_RELEASE to free the memory. So it looks to me like this is a leak in the library's dstore.

I believe you are using the head of OpenPMIx master branch? If so, we do need to leakcheck it as we have had significant change in recent months - something likely slipped thru the cracks.

@rhc54
Copy link

rhc54 commented Mar 10, 2021

Actually, with dstore involved, this must be the v3.x or v4.0 branch - which are you using?

@garlick
Copy link
Member

garlick commented Mar 10, 2021 via email

@garlick
Copy link
Member

garlick commented Mar 10, 2021

The memory leak was observed with pmix-3.1.5.

@garlick
Copy link
Member

garlick commented Mar 10, 2021

Pushed a couple more minor fixups after a final review pass. I think this is ready for testing on the big systems.

@SteVwonder
Copy link
Member

@garlick: just a heads up that I'm pulling this down to test on Lassen now. Will report back with results.

@SteVwonder
Copy link
Member

Looks to have worked on Lassen for both 2 nodes and 8 nodes. Thanks @ggouaillardet and @garlick!

2 nodes:

❯ FLUX_PMI_DEBUG=1 jsrun -a 1 -c ALL_CPUS -g ALL_GPUS -n 2 --bind=none --smpiargs="-disable_gpu_hooks" ./src/cmd/flux start flux getattr size
pmi-debug-pmix[-1]: init = operation completed successfully
pmi-debug-pmix[1]: get_params (rank=1 size=2 kvsname=7) = operation completed successfully
pmi-debug-pmix[1]: kvs_get (kvsname=7 key=flux.instance-level value=<none>) = operation failed
pmi-debug-pmix[1]: kvs_put (kvsname=7 key=1 value=vpHQ9RPOc.2$r!(A/Do:su}vTg!QJh*DU8^xfsK.) = operation completed successfully
pmi-debug-pmix[1]: kvs_commit (kvsname=7) = operation completed successfully
pmi-debug-pmix[-1]: init = operation completed successfully
pmi-debug-pmix[0]: get_params (rank=0 size=2 kvsname=7) = operation completed successfully
pmi-debug-pmix[0]: kvs_get (kvsname=7 key=flux.instance-level value=<none>) = operation failed
pmi-debug-pmix[0]: kvs_put (kvsname=7 key=0 value=yldT$mlx@Po!h@f/]JISl(<[oHZ8kVS#sQm}6Xjm,tcp://[::ffff:192.168.128.14]:49152) = operation completed successfully
pmi-debug-pmix[0]: kvs_commit (kvsname=7) = operation completed successfully
pmi-debug-pmix[0]: barrier = operation completed successfully
pmi-debug-pmix[0]: kvs_get (kvsname=7 key=1 value=vpHQ9RPOc.2$r!(A/Do:su}vTg!QJh*DU8^xfsK.) = operation completed successfully
pmi-debug-pmix[1]: barrier = operation completed successfully
pmi-debug-pmix[1]: kvs_get (kvsname=7 key=0 value=yldT$mlx@Po!h@f/]JISl(<[oHZ8kVS#sQm}6Xjm,tcp://[::ffff:192.168.128.14]:49152) = operation completed successfully
pmi-debug-pmix[0]: barrier = operation completed successfully
pmi-debug-pmix[1]: barrier = operation completed successfully
pmi-debug-pmix[0]: finalize = operation completed successfully
pmi-debug-pmix[1]: finalize = operation completed successfully
2

8 nodes:

❯ FLUX_PMI_DEBUG=1 jsrun -a 1 -c ALL_CPUS -g ALL_GPUS -n 8 --bind=none --smpiargs="-disable_gpu_hooks" ./src/cmd/flux start flux getattr size                                                                                  13:42:41 ()
pmi-debug-pmix[-1]: init = operation completed successfully
pmi-debug-pmix[0]: get_params (rank=0 size=8 kvsname=3) = operation completed successfully
pmi-debug-pmix[0]: kvs_get (kvsname=3 key=flux.instance-level value=<none>) = operation failed
pmi-debug-pmix[0]: kvs_put (kvsname=3 key=0 value=gOyrDBeDeXKA)d=BKS3MI^t[D3%MosX{dh[G@6=H,tcp://[::ffff:192.168.128.4]:49152) = operation completed successfully
pmi-debug-pmix[0]: kvs_commit (kvsname=3) = operation completed successfully
pmi-debug-pmix[-1]: init = operation completed successfully
pmi-debug-pmix[2]: get_params (rank=2 size=8 kvsname=3) = operation completed successfully
pmi-debug-pmix[2]: kvs_get (kvsname=3 key=flux.instance-level value=<none>) = operation failed
pmi-debug-pmix[2]: kvs_put (kvsname=3 key=2 value=?kGM}}GOmj.PWmv/<N:i4Y:ueM-bbAe%dwMy4-iX,tcp://[::ffff:192.168.128.17]:49152) = operation completed successfully
pmi-debug-pmix[2]: kvs_commit (kvsname=3) = operation completed successfully
pmi-debug-pmix[-1]: init = operation completed successfully
pmi-debug-pmix[5]: get_params (rank=5 size=8 kvsname=3) = operation completed successfully
pmi-debug-pmix[5]: kvs_get (kvsname=3 key=flux.instance-level value=<none>) = operation failed
pmi-debug-pmix[5]: kvs_put (kvsname=3 key=5 value=^/E/Mu^]DQff#*PLE)pMPKtVR8h%vs8A?c64@kCR) = operation completed successfully
pmi-debug-pmix[5]: kvs_commit (kvsname=3) = operation completed successfully
pmi-debug-pmix[-1]: init = operation completed successfully
pmi-debug-pmix[4]: get_params (rank=4 size=8 kvsname=3) = operation completed successfully
pmi-debug-pmix[4]: kvs_get (kvsname=3 key=flux.instance-level value=<none>) = operation failed
pmi-debug-pmix[4]: kvs_put (kvsname=3 key=4 value=k6&IIH}8GOd%Al8LF:HWp@TY/a5B<#)Q5H9l]{Iy) = operation completed successfully
pmi-debug-pmix[-1]: init = operation completed successfully
pmi-debug-pmix[6]: get_params (rank=6 size=8 kvsname=3) = operation completed successfully
pmi-debug-pmix[4]: kvs_commit (kvsname=3) = operation completed successfully
pmi-debug-pmix[-1]: init = operation completed successfully
pmi-debug-pmix[3]: get_params (rank=3 size=8 kvsname=3) = operation completed successfully
pmi-debug-pmix[3]: kvs_get (kvsname=3 key=flux.instance-level value=<none>) = operation failed
pmi-debug-pmix[6]: kvs_get (kvsname=3 key=flux.instance-level value=<none>) = operation failed
pmi-debug-pmix[6]: kvs_put (kvsname=3 key=6 value=].RkChk*%xSmC}xEl)NKzV=JS<]b#VsbV@-!pq4x) = operation completed successfully
pmi-debug-pmix[6]: kvs_commit (kvsname=3) = operation completed successfully
pmi-debug-pmix[-1]: init = operation completed successfully
<snip>
pmi-debug-pmix[7]: kvs_get (kvsname=3 key=3 value=bxrWxa8p)U9<j{[SHU5c[2{43[8>@kGn5(xswexN,tcp://[::ffff:192.168.128.18]:49152) = operation completed successfully
pmi-debug-pmix[5]: barrier = operation completed successfully
pmi-debug-pmix[5]: kvs_get (kvsname=3 key=2 value=?kGM}}GOmj.PWmv/<N:i4Y:ueM-bbAe%dwMy4-iX,tcp://[::ffff:192.168.128.17]:49152) = operation completed successfully
pmi-debug-pmix[0]: barrier = operation completed successfully
pmi-debug-pmix[2]: barrier = operation completed successfully
pmi-debug-pmix[4]: barrier = operation completed successfully
pmi-debug-pmix[6]: barrier = operation completed successfully
pmi-debug-pmix[1]: barrier = operation completed successfully
pmi-debug-pmix[3]: barrier = operation completed successfully
pmi-debug-pmix[7]: barrier = operation completed successfully
pmi-debug-pmix[5]: barrier = operation completed successfully
pmi-debug-pmix[0]: finalize = operation completed successfully
pmi-debug-pmix[2]: finalize = operation completed successfully
pmi-debug-pmix[4]: finalize = operation completed successfully
pmi-debug-pmix[6]: finalize = operation completed successfully
pmi-debug-pmix[1]: finalize = operation completed successfully
pmi-debug-pmix[3]: finalize = operation completed successfully
pmi-debug-pmix[7]: finalize = operation completed successfully
pmi-debug-pmix[5]: finalize = operation completed successfully
8

For future me / @dongahn, when compiling this on Lassen you need a pmix.pc in your PKG_CONFIG_PATH. IBM doesn't seem to be producing a .pc file for their pmix in JSM, so DEG (specifically John G) will be patching the RPM that LLNL creates to generate a "synthetic" .pc file. In the meantime, the following contents worked for me:

prefix=/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include

Name: pmix
Description: Process Management Interface for Exascale (PMIx)
Version: 4.1.0a1
URL: https://pmix.org/
Libs: -L${libdir} -lpmix
Cflags: -I${includedir}

I put that in a file called pmix.pc, put that file in an otherwise empty directory, then added that directory to my PKG_CONFIG_PATH when configuring flux-core: PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$HOME/opt/packages/toss3/spectrum-mpi/rolling-release/lib/pkgconfig/ ./configure --enable-pmix-bootstrap

@garlick
Copy link
Member

garlick commented Mar 15, 2021

Looks to have worked on Lassen for both 2 nodes and 8 nodes.

Excellent! Thank you for doing that. Would you mind adding an approving review? Then we can just set merge-when-passing once @ggouaillardet says we're good to go.

Copy link
Member

@SteVwonder SteVwonder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the test coverage!

Just one optional comment below:

@@ -57,6 +62,12 @@ static int cmd_version (optparse_t *p, int ac, char *av[])
#endif
#if HAVE_CALIPER
printf ("+caliper");
#endif
#if HAVE_LIBPMIX
printf ("+pmix==%ld.%ld.%ld",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to be explicit that this is the version of the client-side that we are using? I'm afraid users might think this is the version of pmix that Flux provides as a server.

Maybe +pmix -> +pmix-bootstrap?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me if it helps you!

ggouaillardet and others added 4 commits March 16, 2021 06:43
Problem: on some platforms PMIx is the preferred mechanism
to use for bootstrapping Flux.

Prepare to add PMIx support to the Flux broker by adding
an "opt in" configure option: --enable-pmix-bootstrap.
If specified, pkg-config is used to locate a suitable pmix
package.  Configure fails if pmix is requested but not found.

Add PMIX_LIBS and PMIX_CFLAGS to the broker Makefile.am.

Co-authored-by: Jim Garlick <[email protected]>
Problem: on some platforms PMIx is the preferred mechanism
to use for bootstrapping Flux.

Add support to the broker's pmiutil.c to use PMIx if the PMIx
server environment variables are set.

For now, keep the PMIx integration as simple as possible, and use
the PMIx_*() functions directly.  We can consider other options
such as indirection through dlopen() later, if we run into problems.

This implementation was guided by the PMI-1 compatibility code here:
  https://github.com/openpmix/pmi-shim

Since Flux does not require all of PMI-1, our code is much simpler.
In addition, some PMIx differences from PMI-1 with respect to key scope
could be dealt with directly, compared to the shim:

- add a 'from_rank' to broker_pmi_kvs_get() so that PMIx_Get() can set
  proc.rank to this instead of PMIX_RANK_UNDEF.  This avoids a hang
  with the dstore gds component, as described in openpmix/pmi-shim#3

- if 'from_rank' is set to -1, then set proc.rank to PMIX_RANK_UNDEF,
  and set the PMIX_OPTIONAL attribute to 1 so PMIx_Get() fails immediately
  if the key is not set.  This is used when the broker tries to fetch the
  'flux.instance-level' key, which the flux shell places in the KVS,
  and is not expected to exist when Flux is launched by a foreign resource
  manager.  Note to future implementor of flux shell PMIx plugin (flux-framework#3536):
  this assumes that 'flux.instance-level' would be set using
  PMIx_server_register_nspace() or equivalent, which would push the key
  to the client at initialization.

Add some PMIX well known environment variables to the blocklist in runat.c,
so they do not propagate to the initial program when Flux is launched by
a PMIx process manager.

Co-authored-by: Jim Garlick <[email protected]>
Problem: tests have no way to determine whether or not Flux was
built with --enable-pmix-bootstrap.

Add the pmix version to the output of 'flux version'.
@garlick
Copy link
Member

garlick commented Mar 16, 2021

I'll go ahead and force push a copy of this branch that has been rebased on current master and has a few minor fixups including the one requested by @SteVwonder shortly.

@ggouaillardet
Copy link
Contributor Author

I put that in a file called pmix.pc, put that file in an otherwise empty directory, then added that directory to my PKG_CONFIG_PATH when configuring flux-core: PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$HOME/opt/packages/toss3/spectrum-mpi/rolling-release/lib/pkgconfig/ ./configure --enable-pmix-bootstrap

FWIW, an alternative to using an adhoc pmix.pc is to do something like

configure --enable-pmix-bootstrap PMIX_CFLAGS=-I/opt/openpmix/include PMIX_LIBS="-L/opt/openpmix/lib -lpmix"

@garlick
Copy link
Member

garlick commented Mar 16, 2021

FWIW, an alternative to using an adhoc pmix.pc is to do something like

I had assumed the package check would fail without pmix.pcbut I just verified that you are right! For example this works on ubuntu 20.04 where the pmix.pc is missing:

./configure --enable-pmix-bootstrap PMIX_CFLAGS=-I/usr/lib/x86_64-linux-gnu/pmix/include PMIX_LIBS="-L/usr/lib/x86_64-linux-gnu/pmix/lib -lpmix"

@garlick
Copy link
Member

garlick commented Mar 16, 2021

@ggouaillardet - I think this can be merged if you can confirm it works for you.

Problem: need to test --enable-pmix-bootstrap in CI and obtain
code coverage report for related code.

Add --enable-pmix-bootstrap to bionic/coverage in the CI build
matrix generator.

Add openpmix and prrte to bionic docker image:
 - prrte: export a recent commit from the prrte git repo, since the most
   recent release (1.0.0) doesn't include prterun, which is needed by
   our pmix sharness test.
 - openpmix: export a recent git commit from the openpmix git repo,
   needed to compile prrte above as it won't work with openpmix-3.2.3.
 - add build requirements of flex and libevent-dev
 - add prrte runtime requirement of ssh
@garlick
Copy link
Member

garlick commented Mar 16, 2021

I force pushed one more time to correct a problem in the build matrix that was preventing bionic results from being posted.

@codecov
Copy link

codecov bot commented Mar 16, 2021

Codecov Report

Merging #3537 (17cf1ef) into master (428762d) will decrease coverage by 0.14%.
The diff coverage is 98.30%.

@@            Coverage Diff             @@
##           master    #3537      +/-   ##
==========================================
- Coverage   82.55%   82.40%   -0.15%     
==========================================
  Files         323      322       -1     
  Lines       48981    48829     -152     
==========================================
- Hits        40434    40239     -195     
- Misses       8547     8590      +43     
Impacted Files Coverage Δ
src/broker/boot_pmi.c 64.49% <ø> (ø)
src/broker/runat.c 84.58% <ø> (ø)
src/broker/pmiutil.c 77.43% <98.27%> (+10.56%) ⬆️
src/cmd/builtin/version.c 90.47% <100.00%> (-0.44%) ⬇️
src/common/libutil/setenvf.c 0.00% <0.00%> (-100.00%) ⬇️
src/modules/job-exec/bulk-exec.c 60.26% <0.00%> (-16.94%) ⬇️
src/modules/job-exec/exec.c 71.34% <0.00%> (-6.71%) ⬇️
src/common/libflux/handle.c 84.03% <0.00%> (-2.05%) ⬇️
src/modules/job-ingest/job-ingest.c 73.25% <0.00%> (-1.07%) ⬇️
src/modules/job-exec/job-exec.c 74.79% <0.00%> (-1.04%) ⬇️
... and 15 more

@ggouaillardet
Copy link
Contributor Author

@garlick I am happy to confirm this PR works for me!

@garlick
Copy link
Member

garlick commented Mar 17, 2021

Excellent! Thanks!

@mergify mergify bot merged commit ce879ba into flux-framework:master Mar 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants