PLM:SLURM doesn't work right for HPE Sling Shot when VNI enabled #2004

hppritcha · 2024-08-30T16:20:00Z

The slurm PLM component sets --mpi=none as part of the srun command used to launch the prted daemons.

On HPE Slingshot 11 networks where VNI credentials are enforced, this ends up in, effectively, a failure to launch for multi-node jobs.

Turning on FI_LOG_LEVEL=debug shows a characteristic signature for this:

Request dest_addr: 32 caddr.nic: 0X19D1 caddr.pid: 1 rxc_id: 0  error: 0x26c450f0 (err: 5, VNI_NOT_FOUND)

This addition to the srun command line options for prted launch needs suppressed for systems using HPE Sling shot.

The text was updated successfully, but these errors were encountered:

rhc54 · 2024-08-30T17:47:53Z

Problematic: the issue here is that specifying an MPI for srun will automatically make Slurm think that the daemons are MPI procs, which has implications for how they are run. What "mpi" option are you thinking of trying?

Bottom line is that the VNI allocation system is broken for indirect launch - been hearing that from other libraries. Only thing I can come up with is to find a non-srun solution, though I'm open to hearing how to get around it.

hppritcha · 2024-08-30T17:58:41Z

Problematic: the issue here is that specifying an MPI for srun will automatically make Slurm think that the daemons are MPI procs, which has implications for how they are run. What "mpi" option are you thinking of trying?

just not specifying anything about mpi.

I plan to open a PR to not insert this option into the srun cmd line.

An easy workaround for a user that finds this problematic will be to set

SLURM_MPI_TYPE=none

in their shell before using mpirun.

rhc54 · 2024-08-30T18:21:37Z

Ah, but it is necessary to have that option in non-HPE systems, especially when they set a default MPI type. You could wind up breaking all the non-HPE installations, and the HPE installations that have disabled VNI. Requiring everyone in those situations (which greatly outnumber those with Slingshot) to set a fix seems backwards to me. Perhaps finding a more generalized solution might be best?

Also, remember that Slurm now injects their own cmd line options, so need to figure out a solution that accounts for that as well.

rhc54 · 2024-08-31T13:40:10Z

Looking back, it appears we may have had to add this option to avoid having the daemon automatically bound, which then forced the procs it started to share that binding. Probably other options could also be used for that purpose. However, there may be additional reasons why we added it, so one might need some further investigation to be sure we don't cause problems.

The real issue isn't caused by the VNI itself - that's just an integer that is easily generated. The problem is the requirement that the VNI be "loaded" into CXI at privilege, which the PRRTE daemon isn't running at and thus is blocked from doing.

One solution is to create a setuid script that takes only one argument (the VNI) and executes the required operation at the CXI user's level. You might check and see if anyone has an issue with that, and what can be done to minimize any concerns. Ultimately, that's probably the correct solution - if one can make it acceptable.

naughtont3 · 2024-09-05T20:08:57Z

@hppritcha what version of SLURM are you using on this machine that experiences the issue?

wickberg · 2024-09-05T21:27:30Z

This came up on the PMIx call today, and I'm a bit lost on how --mpi=none in the Slurm PLM might be improving anything? The switch plugin in Slurm will kick in regardless, and should be setting up the VNIs.

rhc54 · 2024-10-18T19:02:24Z

@hppritcha I'm guessing that this resolved itself - perhaps some odd situation that generated a foobar result? In absence of any further input, I'll just close this issue as "not reproducible", so please let us know if this is real.

hppritcha self-assigned this Aug 30, 2024

hppritcha added the Target 3.x label Aug 30, 2024

rhc54 mentioned this issue Oct 18, 2024

Checklist for "stable" landing point #2020

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PLM:SLURM doesn't work right for HPE Sling Shot when VNI enabled #2004

PLM:SLURM doesn't work right for HPE Sling Shot when VNI enabled #2004

hppritcha commented Aug 30, 2024

rhc54 commented Aug 30, 2024

hppritcha commented Aug 30, 2024

rhc54 commented Aug 30, 2024

rhc54 commented Aug 31, 2024

naughtont3 commented Sep 5, 2024

wickberg commented Sep 5, 2024

rhc54 commented Oct 18, 2024

PLM:SLURM doesn't work right for HPE Sling Shot when VNI enabled #2004

PLM:SLURM doesn't work right for HPE Sling Shot when VNI enabled #2004

Comments

hppritcha commented Aug 30, 2024

rhc54 commented Aug 30, 2024

hppritcha commented Aug 30, 2024

rhc54 commented Aug 30, 2024

rhc54 commented Aug 31, 2024

naughtont3 commented Sep 5, 2024

wickberg commented Sep 5, 2024

rhc54 commented Oct 18, 2024