check_socket_compatibility fails with srun on more than one processor #46

ltimmerman3 · 2024-08-22T17:51:24Z

Describe the bug
Checking for socket compatibility currently requires running srun path/to/sparc/executable without the --name input which invokes stdout with/without -socket. This fails when the run command invokes more than one processor as only the 0th task exits, while all others hang.

To Reproduce

ssh into slurm managed environment
set ASE_SPARC_COMMAND to srun path/to/sparc
run with socket mode = True (Fig 5 b run configuration) on a node with > 1 processors requested

Expected behavior
All processes should exit

Actual output or error trace
Only task 0 exits

This can be handled by enforcing srun -n 1 path/to/sparc as the run command for the compatibility check. Need to decide how to implement. Simplest: Check if "srun" in command -> edit command to be srun -n 1

The text was updated successfully, but these errors were encountered:

alchem0x2A · 2024-08-23T01:57:06Z

~~that's indeed one issue with srun (as opposed to mpirun) that terminating srun processes require using slurm to terminate the step.~~

SPARC-X-API/sparc/calculator.py

Line 797 in c419ce2

def _send_mpi_signal(self, sig):

has implemented the termination procedure but there could be more things happening on the actual srun hierachy, I'll take a look

~~we could also implement closing the socket on receiving the EXIT message the C-SPARC side. This may be actually safer to work on, since enumerating all possible combinations of mpi/slurm is tedious.~~

To test

Check if EXIT message is implemented in the C-SPARC socket
Test on mpirun & srun on multiple processors
Test on srun on multiple nodes

alchem0x2A · 2024-09-17T04:08:22Z

@ltimmerman3 It seems my previous response was too complicated.

I may have been mistaken but it seems the following setups works for me with normal srun settings, could you check?

On pheonix
CASE 1:
- sparc compiled with following modules module load gcc mvapich2 openblas fftw netlib-scalapack
- compilation command make USE_MKL=0 USE_FFTW=1 USE_SCALAPACK=1 USE_SOCKET=1
- ASE_SPARC_COMMAND="srun -n 24 --export=ALL path/to/sparc"
- python -c "from sparc.calculator import SPARC; print(SPARC(use_socket=True).detect_socket_compatibility())"

since the detect compatibility function only executes whatever sparc command available to the system without an actual -name suffix https://github.com/alchem0x2A/SPARC-X-API/blob/9136ce832cbfc2fb6519751409721036a9bcacc2/sparc/calculator.py#L967, the subprocess should return regardless of if any socket communication is started. I'm curious how to reproduce the scenario you've observed

ltimmerman3 · 2024-09-17T16:29:03Z

@alchem0x2A I ran with the exact settings you provided and it returned just fine. However, recompiling with make USE_MKL=1 USE_FFTW=0 USE_SCALAPACK=0 USE_SOCKET=1 and ml intel-oneapi-compilers intel-oneapi-mpi intel-oneapi-mkl allowed me to reproduce the error of srun hanging when the run command included srun -n > 1

alchem0x2A · 2024-09-18T07:18:31Z

you're right! this issue stems from the check_inputs function in SPARC's initialization.c https://github.com/SPARC-X/SPARC/blob/ef868ee6143bad3da9fd84aacb56981ce9ea3801/src/initialization.c#L484, where the check is handled on rank 0 and exit signal is emitted on single rank, that explains the different behavior on different mpi platforms.

I confirm this behavior also exists for non-socket code, simply run srun -n N sparc without any command suffices will cause rank 0 hang on intel mpi.

I believe the ultimate solution is to always implement MPI_Abort instead of exit in the original SPARC code, but it could be a lot of work. One safer guardrail on the user-side is to use the --kill-on-bad-exit option in srun (https://slurm.schedmd.com/srun.html#OPT_kill-on-bad-exit), one sample ASE_SPARC_COMMAND would thus be

export ASE_SPARC_COMMAND="srun -n 24 --export=ALL -K sparc"

I've tested that it works on both mvapich2 and intel mpi, could you take a look? If the -K option can cover most cases then I'll advice to leave the setting to the users, since parsing ASE_SPARC_COMMAND may not be very straightforward

ltimmerman3 · 2024-09-18T19:11:46Z

My recommendation would be to take a hybrid approach. Perhaps rather than relying on the user to figure this out or set the ASE_SPARC_COMMAND, I think we could generate a sparc_env.sh file at runtime or sparc-x-api install time that detects the environment (slurm or not slurm) and writes the appropriate commands, or at least the appropriate ASE_SPARC_COMMAND.

alchem0x2A · 2024-09-19T03:43:19Z

@ltimmerman3 Thx for the suggestions. ASE 3.23 introduces a profile system which is super cool https://wiki.fysik.dtu.dk/ase/ase/calculators/calculators.html#calculator-configuration. That's closer to the approach you're mentioning, and we could clearly show a default template in our doc. In this case ASE_SPARC_COMMAND, SPARC_DOC_PATH etc could be saved on a per-file basis. I'll bring it up in another issue

For now let's add a warning in the doc for the user to add the -K option, but eventually in the v1.1 release (which will be fully support ase 3.23 style) we should move to the profile system

ltimmerman3 · 2024-09-19T03:45:50Z

[like] Timmerman, Lucas R reacted to your message:

…

________________________________ From: T.Tian ***@***.***> Sent: Thursday, September 19, 2024 3:43:40 AM To: SPARC-X/SPARC-X-API ***@***.***> Cc: Timmerman, Lucas R ***@***.***>; Mention ***@***.***> Subject: Re: [SPARC-X/SPARC-X-API] check_socket_compatibility fails with srun on more than one processor (Issue #46) @ltimmerman3<https://github.com/ltimmerman3> Thx for the suggestions. ASE 3.23 introduces a profile system which is super cool https://wiki.fysik.dtu.dk/ase/ase/calculators/calculators.html#calculator-configuration. That's closer to the approach you're mentioning, and we could clearly show a default template in our doc. In this case ASE_SPARC_COMMAND, SPARC_DOC_PATH etc could be saved on a per-file basis. I'll bring it up in another issue For now let's add a warning in the doc for the user to add the -K option, but eventually in the v1.1 release (which will be fully support ase 3.23 style) we should move to the profile system — Reply to this email directly, view it on GitHub<#46 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AV2VTTGNDDN3VTQSEG2MCODZXJB6ZAVCNFSM6AAAAABM6ULOTGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJZHEYTINBYGA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

ltimmerman3 · 2024-09-19T14:36:01Z

@alchem0x2A Quick follow up. I just ran a multi-node job (2 nodes, 48 processors) for initial testing with PLUMED. The actual run time for DFT was 4 seconds, but the job took over an hour. According to the job accounting info, over an hour of time was spent in a "failed" sparc call (which I assume corresponds to the check compatibility call). As such, despite the fact the job eventually ran with the -K option, I don't think this is a viable solution. I "hacked" the calculator file to enforce the -n 1 flag in the detect compatibility function and the code executed as expected. If the update to ase 3.23 is imminent then maybe it's not a priority but something to be aware of.

alchem0x2A · 2024-09-20T03:21:44Z

thx for the updates. It could make sense that the -K switch relies solely on the slurm scheduler to kill extra prcesses. It would make more sense to remove the check compatibility code from the actual socket calculation part and reformat the sparc.quicktest module for more robust checks. Let's keep this issue open until #48 is done

alchem0x2A · 2024-09-20T03:23:06Z

thx for the updates. It could make sense that the -K switch relies solely on the slurm scheduler to kill extra prcesses. It would make more sense to remove the check compatibility code from the actual socket calculation part and reformat the sparc.quicktest module for more robust checks. Let's keep this issue open until #48 is done

ltimmerman3 · 2024-10-21T12:45:39Z

Updated SPARC check_inputs which rectifies this issue. Needs to be included in SPARC.

ltimmerman3 added the bug Something isn't working label Aug 22, 2024

This was referenced Sep 19, 2024

Migration to ASE 3.23 #48

Open

Add notes for kill-on-bad-exit in doc #49

Merged

ltimmerman3 closed this as completed Oct 21, 2024

alchem0x2A mentioned this issue Dec 1, 2024

Add configuration file #84

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check_socket_compatibility fails with srun on more than one processor #46

check_socket_compatibility fails with srun on more than one processor #46

ltimmerman3 commented Aug 22, 2024

alchem0x2A commented Aug 23, 2024 •

edited

Loading

alchem0x2A commented Sep 17, 2024

ltimmerman3 commented Sep 17, 2024

alchem0x2A commented Sep 18, 2024

ltimmerman3 commented Sep 18, 2024

alchem0x2A commented Sep 19, 2024

ltimmerman3 commented Sep 19, 2024 via email

ltimmerman3 commented Sep 19, 2024

alchem0x2A commented Sep 20, 2024

alchem0x2A commented Sep 20, 2024

ltimmerman3 commented Oct 21, 2024 •

edited

Loading

check_socket_compatibility fails with srun on more than one processor #46

check_socket_compatibility fails with srun on more than one processor #46

Comments

ltimmerman3 commented Aug 22, 2024

alchem0x2A commented Aug 23, 2024 • edited Loading

alchem0x2A commented Sep 17, 2024

ltimmerman3 commented Sep 17, 2024

alchem0x2A commented Sep 18, 2024

ltimmerman3 commented Sep 18, 2024

alchem0x2A commented Sep 19, 2024

ltimmerman3 commented Sep 19, 2024 via email

ltimmerman3 commented Sep 19, 2024

alchem0x2A commented Sep 20, 2024

alchem0x2A commented Sep 20, 2024

ltimmerman3 commented Oct 21, 2024 • edited Loading

alchem0x2A commented Aug 23, 2024 •

edited

Loading

ltimmerman3 commented Oct 21, 2024 •

edited

Loading