Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_socket_compatibility fails with srun on more than one processor #46

Closed
ltimmerman3 opened this issue Aug 22, 2024 · 11 comments
Closed
Labels
bug Something isn't working

Comments

@ltimmerman3
Copy link
Collaborator

Describe the bug
Checking for socket compatibility currently requires running srun path/to/sparc/executable without the --name input which invokes stdout with/without -socket. This fails when the run command invokes more than one processor as only the 0th task exits, while all others hang.

To Reproduce

  • ssh into slurm managed environment
  • set ASE_SPARC_COMMAND to srun path/to/sparc
  • run with socket mode = True (Fig 5 b run configuration) on a node with > 1 processors requested

Expected behavior
All processes should exit

Actual output or error trace
Only task 0 exits

This can be handled by enforcing srun -n 1 path/to/sparc as the run command for the compatibility check. Need to decide how to implement. Simplest: Check if "srun" in command -> edit command to be srun -n 1

@ltimmerman3 ltimmerman3 added the bug Something isn't working label Aug 22, 2024
@alchem0x2A
Copy link
Collaborator

alchem0x2A commented Aug 23, 2024

that's indeed one issue with srun (as opposed to mpirun) that terminating srun processes require using slurm to terminate the step.

def _send_mpi_signal(self, sig):
has implemented the termination procedure but there could be more things happening on the actual srun hierachy, I'll take a look

we could also implement closing the socket on receiving the EXIT message the C-SPARC side. This may be actually safer to work on, since enumerating all possible combinations of mpi/slurm is tedious.

To test

  • Check if EXIT message is implemented in the C-SPARC socket
  • Test on mpirun & srun on multiple processors
  • Test on srun on multiple nodes

@alchem0x2A
Copy link
Collaborator

@ltimmerman3 It seems my previous response was too complicated.

I may have been mistaken but it seems the following setups works for me with normal srun settings, could you check?

  • On pheonix
    CASE 1:
    • sparc compiled with following modules module load gcc mvapich2 openblas fftw netlib-scalapack
    • compilation command make USE_MKL=0 USE_FFTW=1 USE_SCALAPACK=1 USE_SOCKET=1
    • ASE_SPARC_COMMAND="srun -n 24 --export=ALL path/to/sparc"
    • python -c "from sparc.calculator import SPARC; print(SPARC(use_socket=True).detect_socket_compatibility())"

since the detect compatibility function only executes whatever sparc command available to the system without an actual -name suffix https://github.com/alchem0x2A/SPARC-X-API/blob/9136ce832cbfc2fb6519751409721036a9bcacc2/sparc/calculator.py#L967, the subprocess should return regardless of if any socket communication is started. I'm curious how to reproduce the scenario you've observed

@ltimmerman3
Copy link
Collaborator Author

@alchem0x2A I ran with the exact settings you provided and it returned just fine. However, recompiling with make USE_MKL=1 USE_FFTW=0 USE_SCALAPACK=0 USE_SOCKET=1 and ml intel-oneapi-compilers intel-oneapi-mpi intel-oneapi-mkl allowed me to reproduce the error of srun hanging when the run command included srun -n > 1

@alchem0x2A
Copy link
Collaborator

you're right! this issue stems from the check_inputs function in SPARC's initialization.c https://github.com/SPARC-X/SPARC/blob/ef868ee6143bad3da9fd84aacb56981ce9ea3801/src/initialization.c#L484, where the check is handled on rank 0 and exit signal is emitted on single rank, that explains the different behavior on different mpi platforms.

I confirm this behavior also exists for non-socket code, simply run srun -n N sparc without any command suffices will cause rank 0 hang on intel mpi.

I believe the ultimate solution is to always implement MPI_Abort instead of exit in the original SPARC code, but it could be a lot of work. One safer guardrail on the user-side is to use the --kill-on-bad-exit option in srun (https://slurm.schedmd.com/srun.html#OPT_kill-on-bad-exit), one sample ASE_SPARC_COMMAND would thus be

export ASE_SPARC_COMMAND="srun -n 24 --export=ALL -K sparc"

I've tested that it works on both mvapich2 and intel mpi, could you take a look? If the -K option can cover most cases then I'll advice to leave the setting to the users, since parsing ASE_SPARC_COMMAND may not be very straightforward

@ltimmerman3
Copy link
Collaborator Author

My recommendation would be to take a hybrid approach. Perhaps rather than relying on the user to figure this out or set the ASE_SPARC_COMMAND, I think we could generate a sparc_env.sh file at runtime or sparc-x-api install time that detects the environment (slurm or not slurm) and writes the appropriate commands, or at least the appropriate ASE_SPARC_COMMAND.

@alchem0x2A
Copy link
Collaborator

@ltimmerman3 Thx for the suggestions. ASE 3.23 introduces a profile system which is super cool https://wiki.fysik.dtu.dk/ase/ase/calculators/calculators.html#calculator-configuration. That's closer to the approach you're mentioning, and we could clearly show a default template in our doc. In this case ASE_SPARC_COMMAND, SPARC_DOC_PATH etc could be saved on a per-file basis. I'll bring it up in another issue

For now let's add a warning in the doc for the user to add the -K option, but eventually in the v1.1 release (which will be fully support ase 3.23 style) we should move to the profile system

@ltimmerman3
Copy link
Collaborator Author

ltimmerman3 commented Sep 19, 2024 via email

@ltimmerman3
Copy link
Collaborator Author

@alchem0x2A Quick follow up. I just ran a multi-node job (2 nodes, 48 processors) for initial testing with PLUMED. The actual run time for DFT was 4 seconds, but the job took over an hour. According to the job accounting info, over an hour of time was spent in a "failed" sparc call (which I assume corresponds to the check compatibility call). As such, despite the fact the job eventually ran with the -K option, I don't think this is a viable solution. I "hacked" the calculator file to enforce the -n 1 flag in the detect compatibility function and the code executed as expected. If the update to ase 3.23 is imminent then maybe it's not a priority but something to be aware of.

@alchem0x2A
Copy link
Collaborator

thx for the updates. It could make sense that the -K switch relies solely on the slurm scheduler to kill extra prcesses. It would make more sense to remove the check compatibility code from the actual socket calculation part and reformat the sparc.quicktest module for more robust checks. Let's keep this issue open until #48 is done

1 similar comment
@alchem0x2A
Copy link
Collaborator

thx for the updates. It could make sense that the -K switch relies solely on the slurm scheduler to kill extra prcesses. It would make more sense to remove the check compatibility code from the actual socket calculation part and reformat the sparc.quicktest module for more robust checks. Let's keep this issue open until #48 is done

@ltimmerman3
Copy link
Collaborator Author

ltimmerman3 commented Oct 21, 2024

Updated SPARC check_inputs which rectifies this issue. Needs to be included in SPARC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants