-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check_socket_compatibility fails with srun on more than one processor #46
Comments
SPARC-X-API/sparc/calculator.py Line 797 in c419ce2
To test
|
@ltimmerman3 It seems my previous response was too complicated. I may have been mistaken but it seems the following setups works for me with normal srun settings, could you check?
since the detect compatibility function only executes whatever sparc command available to the system without an actual |
@alchem0x2A I ran with the exact settings you provided and it returned just fine. However, recompiling with |
you're right! this issue stems from the I confirm this behavior also exists for non-socket code, simply run I believe the ultimate solution is to always implement
I've tested that it works on both mvapich2 and intel mpi, could you take a look? If the |
My recommendation would be to take a hybrid approach. Perhaps rather than relying on the user to figure this out or set the ASE_SPARC_COMMAND, I think we could generate a sparc_env.sh file at runtime or sparc-x-api install time that detects the environment (slurm or not slurm) and writes the appropriate commands, or at least the appropriate ASE_SPARC_COMMAND. |
@ltimmerman3 Thx for the suggestions. ASE 3.23 introduces a profile system which is super cool https://wiki.fysik.dtu.dk/ase/ase/calculators/calculators.html#calculator-configuration. That's closer to the approach you're mentioning, and we could clearly show a default template in our doc. In this case For now let's add a warning in the doc for the user to add the |
[like] Timmerman, Lucas R reacted to your message:
…________________________________
From: T.Tian ***@***.***>
Sent: Thursday, September 19, 2024 3:43:40 AM
To: SPARC-X/SPARC-X-API ***@***.***>
Cc: Timmerman, Lucas R ***@***.***>; Mention ***@***.***>
Subject: Re: [SPARC-X/SPARC-X-API] check_socket_compatibility fails with srun on more than one processor (Issue #46)
@ltimmerman3<https://github.com/ltimmerman3> Thx for the suggestions. ASE 3.23 introduces a profile system which is super cool https://wiki.fysik.dtu.dk/ase/ase/calculators/calculators.html#calculator-configuration. That's closer to the approach you're mentioning, and we could clearly show a default template in our doc. In this case ASE_SPARC_COMMAND, SPARC_DOC_PATH etc could be saved on a per-file basis. I'll bring it up in another issue
For now let's add a warning in the doc for the user to add the -K option, but eventually in the v1.1 release (which will be fully support ase 3.23 style) we should move to the profile system
—
Reply to this email directly, view it on GitHub<#46 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AV2VTTGNDDN3VTQSEG2MCODZXJB6ZAVCNFSM6AAAAABM6ULOTGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJZHEYTINBYGA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@alchem0x2A Quick follow up. I just ran a multi-node job (2 nodes, 48 processors) for initial testing with PLUMED. The actual run time for DFT was 4 seconds, but the job took over an hour. According to the job accounting info, over an hour of time was spent in a "failed" sparc call (which I assume corresponds to the check compatibility call). As such, despite the fact the job eventually ran with the -K option, I don't think this is a viable solution. I "hacked" the calculator file to enforce the -n 1 flag in the detect compatibility function and the code executed as expected. If the update to ase 3.23 is imminent then maybe it's not a priority but something to be aware of. |
thx for the updates. It could make sense that the -K switch relies solely on the slurm scheduler to kill extra prcesses. It would make more sense to remove the check compatibility code from the actual socket calculation part and reformat the |
1 similar comment
thx for the updates. It could make sense that the -K switch relies solely on the slurm scheduler to kill extra prcesses. It would make more sense to remove the check compatibility code from the actual socket calculation part and reformat the |
Updated SPARC check_inputs which rectifies this issue. Needs to be included in SPARC. |
Describe the bug
Checking for socket compatibility currently requires running
srun path/to/sparc/executable
without the --name input which invokes stdout with/without -socket. This fails when the run command invokes more than one processor as only the 0th task exits, while all others hang.To Reproduce
Expected behavior
All processes should exit
Actual output or error trace
Only task 0 exits
This can be handled by enforcing
srun -n 1 path/to/sparc
as the run command for the compatibility check. Need to decide how to implement. Simplest: Check if "srun" in command -> edit command to be srun -n 1The text was updated successfully, but these errors were encountered: