-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change the file check in the VASP terminate function (also possibly the fallback method) #264
Conversation
…ssing the OUTCAR file to determine the process to kill. Fixed an indentation in the fallback method.
for more information, see https://pre-commit.ci
Thanks for checking and bug fixing @fyalcin ! I just checked the vasp source code, and it seems that the last file that is accessed is the As for getting rid of the I would love the opinion of @arosen93 , @janosh and @matthewkuner on this. |
Thanks for fixing this important bug. The VASP termination is a bit beyond my paygrade so I don't have much input to provide, but I look forward to the fix being merged. I'm tagging @munrojm here in case he has any objections/comments, as he often runs large "jobpacking" runs that I want to make sure still work appropriately (due to changes made over these several PRs). Edit: Linking to #217. |
@fyalcin Wow, I'm thoroughly impressed both by your in-depth troubleshooting and excellent write-up of results. I've run into the On the
I realize none of these are easy to measure. Maybe we can come up with decent proxies/estimates. |
@fyalcin I had not considered the potential issue with calling I agree with the questions from @janosh that I think we'd need to consider before removing the backup Finally, really cool work on determining which output file should be monitored! |
…s terminated, we can use a flag to make sure to not terminate the job twice with the job.terminate() call after `if has_error`.
…ocess has terminated, we can use a flag to make sure to not terminate the job twice with the job.terminate() call after `if has_error`." This reverts commit 46e3650.
Thank you for your kind words, everyone :) After changing the file checked from custodian/custodian/custodian.py Lines 470 to 471 in f7dc11a
custodian/custodian/custodian.py Lines 473 to 474 in f7dc11a
job.terminate() , even though VASP is already terminated naturally.
I just committed a new check 46e3650 to avoid calling custodian/custodian/custodian.py Lines 494 to 496 in f7dc11a
custodian/custodian/custodian.py Lines 473 to 474 in f7dc11a
I think @matthewkuner's idea of checking if a VASP job is running before any In the meantime, I will test the newest implementation on some of my old jobs that failed with |
That sounds like a tough problem. Most people will be running on linux clusters. So maybe another thing to look into are system-specific CLIs for monitoring process status like |
I think that while calling terminate() on a job that has already killed itself or finished OK (custodian might still throw e.g. an insufficient NBANDS or large SIGMA handlers on jobs that finished OK from the VASP side) is dubious, it is not worth to mess with this atm. According to the comment mentioned by @fyalcin it might actually be necessary for some systems. I think killall for now is a good backup to have, since we have already established that multi node jobs are much more common than several jobs on a single node. I am doubtful there is a good way to figure out the data that @janosh requested, but if all of us take a care to keep our logfiles, one could at least get some statistics. Maybe @fyalcin should then include some more warnings in the logs with specific keywords to make analysis easier? I also wanted to mention that while this is a great fix and I am very thankful to @fyalcin for figuring out that my solution from #259 was buggy (he also significantly contributed to that idea btw), I am not sure if this will solve all |
…n.terminate instead of the custom job.terminate, which was not reliably killing the vasp processes. This is now changed so the order terminate defaults to is terminate_func -> job.terminate -> p.terminate.
I might have figured out something about why custodian/custodian/custodian.py Lines 494 to 496 in f7dc11a
We only break out of the while loop when Please have a look at these 3 lines inside the custodian/custodian/custodian.py Lines 472 to 474 in f7dc11a
Here, the variable I believe that changing the For the other codes whose jobs might need the second This change has a positive side effect in which the fallback should only be triggered if something actually goes wrong with the new method with Finally, I don't understand the reasoning behind the custodian/custodian/custodian.py Lines 477 to 481 in f7dc11a
which calls terminate_func when there are no monitors but I'm not touching that for now.
|
Holy smokes, I second @arosen93, this is getting beyond my paygrade as well. I'm glad you figured this all out (except the I kindly ask that you leave copious comments to explain your reasoning behind all changes you make so that the next person who dives into this has an easier time understanding why things are done the way they are and what they need to be aware of when making further changes. |
…loop of Custodian._run_job(), and the choice of terminate function.
So, I went through the tracebacks of my calculations failing with
Then I randomly selected 20 failed calculations on the skylake nodes with "Device or resource busy" in the traceback and restarted them - zero failures this time. In short, I'd say that the changes look promising but perhaps need more testing. If anyone is able to do a similar testing, that would also be helpful :) |
@fyalcin why do you believe this points to issues with termination + resubmission via slurm? It sounds plausible, but I'm also uninformed on how much of this stuff works |
Linking #265 here too, since it seems the change in termination functions also introduced another bug. |
I believe the "Device or resource busy" in vasp.out happens when one attempts to start an MPI process that would put the total number of tasks per node above the slurm spec |
@fyalcin is this ready for merging? |
@matthewkuner I think it's almost there. I will add a couple more commits addressing #265 and it should be good to go afterward. |
…__() to be immutable.
…__() and VaspNEBJob.__init__() to be immutable.
So, I changed the inits of both |
Looks great for me now. Thanks @fyalcin for the good work. This should be pretty save and hopefully avoid a lot of the VaspValidatorErrors. |
@shyuep --- would you mind taking a look at this one when you get a chance? Pretty important bug fix and usability improvement! 🙏 |
Summary
A [recent PR](#259) and a [related post](https://matsci.org/t/vasprunxmlvalidator-error-in-the-middle-of-a-workflow/49136) on matsci atomate forums prompted me to look into this with a system with which I could reliably reproduce the exact
VasprunXMLValidator
error mentioned in the matsci post.After some log crunching and monitoring of the VASP processes (and the files being accessed by them), I noticed that VASP wasn't properly being killed by the
VaspJob.terminate()
, and as soon ascustodian
tried running a new VASP job, it fails because we are requesting more tasks than available per the slurm config. This results in a brokenvasprun.xml
and the validator check fails.The reason why VASP wasn't being killed properly eluded us (@MichaelWolloch and I) until we carefully observed the files being accessed by VASP during a run where we noticed that VASP stops accessing CHGCAR (the current implementation) shortly before a job finishes. In my test system, the
VaspJob.terminate()
call coincided with the time period between this and the job gracefully exiting by itself and this leads to theopen_files()
check failing. This in turn triggers the fallback method with theos
callkillall
, but an erroneous indentation introduced in a recent PR prevented the fallback method to run properly.After careful observation, it seems that there are several files that are being accessed by VASP during the whole run such as
OUTCAR
andvasprun.xml
, among others. My proposal in this PR is to use theOUTCAR
, but this is up for debate.This change fixes the issue with my test system that used to fail with
VasprunXMLValidator
and hopefully the other similar issues such as the one in the aforementioned post.As for the fallback method with
killall
, I propose to get rid of it entirely and would like your opinion on this. It is shown by @MichaelWolloch thatrlaunch multi
is a considerable speed-up in certain scenarios so we should ideally retain the full functionality of that mode. There are also certain cases where a problematic VASP job might end before custodian can interfere (depending on thepolling_time_step
andmonitor_freq
) after which custodian still callsVaspJob.terminate()
which would use the fallback method (since that particular VASP isn't running anymore) that would kill other VASP process spawned byrlaunch multi
on the same node; an undesirable outcome.This PR is to slightly modify the current implementation and to ask for maintainers' opinion on getting rid of the fallback method with
killall
once and for all.