Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the file check in the VASP terminate function (also possibly the fallback method) #264

Merged
merged 10 commits into from
Jul 6, 2023

Conversation

fyalcin
Copy link
Contributor

@fyalcin fyalcin commented Jun 15, 2023

Summary

A [recent PR](#259) and a [related post](https://matsci.org/t/vasprunxmlvalidator-error-in-the-middle-of-a-workflow/49136) on matsci atomate forums prompted me to look into this with a system with which I could reliably reproduce the exact VasprunXMLValidator error mentioned in the matsci post.

After some log crunching and monitoring of the VASP processes (and the files being accessed by them), I noticed that VASP wasn't properly being killed by the VaspJob.terminate(), and as soon as custodian tried running a new VASP job, it fails because we are requesting more tasks than available per the slurm config. This results in a broken vasprun.xml and the validator check fails.

The reason why VASP wasn't being killed properly eluded us (@MichaelWolloch and I) until we carefully observed the files being accessed by VASP during a run where we noticed that VASP stops accessing CHGCAR (the current implementation) shortly before a job finishes. In my test system, the VaspJob.terminate() call coincided with the time period between this and the job gracefully exiting by itself and this leads to the open_files() check failing. This in turn triggers the fallback method with the os call killall, but an erroneous indentation introduced in a recent PR prevented the fallback method to run properly.

After careful observation, it seems that there are several files that are being accessed by VASP during the whole run such as OUTCAR and vasprun.xml, among others. My proposal in this PR is to use the OUTCAR, but this is up for debate.

This change fixes the issue with my test system that used to fail with VasprunXMLValidator and hopefully the other similar issues such as the one in the aforementioned post.

As for the fallback method with killall, I propose to get rid of it entirely and would like your opinion on this. It is shown by @MichaelWolloch that rlaunch multi is a considerable speed-up in certain scenarios so we should ideally retain the full functionality of that mode. There are also certain cases where a problematic VASP job might end before custodian can interfere (depending on the polling_time_step and monitor_freq) after which custodian still calls VaspJob.terminate() which would use the fallback method (since that particular VASP isn't running anymore) that would kill other VASP process spawned by rlaunch multi on the same node; an undesirable outcome.

This PR is to slightly modify the current implementation and to ask for maintainers' opinion on getting rid of the fallback method with killall once and for all.

fyalcin and others added 2 commits June 15, 2023 15:24
…ssing the OUTCAR file to determine the process to kill. Fixed an indentation in the fallback method.
@MichaelWolloch
Copy link
Contributor

Thanks for checking and bug fixing @fyalcin !

I just checked the vasp source code, and it seems that the last file that is accessed is the vasprun.xml. So maybe this should be used instead of OUTCAR.
Initially I selected CHGCAR because it is accessed by all processes, and not just a single one, but this is of no great importance.

As for getting rid of the killall fallback:
I think that rlanuch multi is indeed a rather niche usecase. The more proper solution would probably be to not call terminate at all if the vasp run has already finished, but this is a whole other can of worms and goes far beyond this PR I think.

I would love the opinion of @arosen93 , @janosh and @matthewkuner on this.

@Andrew-S-Rosen
Copy link
Member

Andrew-S-Rosen commented Jun 15, 2023

Thanks for fixing this important bug. The VASP termination is a bit beyond my paygrade so I don't have much input to provide, but I look forward to the fix being merged.

I'm tagging @munrojm here in case he has any objections/comments, as he often runs large "jobpacking" runs that I want to make sure still work appropriately (due to changes made over these several PRs).

Edit: Linking to #217.

@janosh
Copy link
Member

janosh commented Jun 15, 2023

@fyalcin Wow, I'm thoroughly impressed both by your in-depth troubleshooting and excellent write-up of results. I've run into the VasprunXMLValidator error so many times yet was never able to find time and muster the perseverance to fix it. If you managed, I vote for you as employee of the month! 😄

On the killall removal question, I think we need more data to come to an informed decision. The primary factors of interest are

  1. When killall is triggered, how often does it kill the wrong VASP run?
  2. What percentage of users run with a setup where the wrong VASP might be killed?
  3. For users not in such a setup, how often is the fallback triggered? (I.e. how often does it serve a useful purpose?)

I realize none of these are easy to measure. Maybe we can come up with decent proxies/estimates.

@matthewkuner
Copy link
Contributor

matthewkuner commented Jun 15, 2023

@fyalcin I had not considered the potential issue with calling killall after a problematic VASP job already ends--I'm curious about how often this would happen. You may also be able to check whether the relevant VASP job is running before attempting to terminate it at all (in which case falling back to killall wouldn't be a risk).

I agree with the questions from @janosh that I think we'd need to consider before removing the backup killall entirely.

Finally, really cool work on determining which output file should be monitored!

fyalcin added 3 commits June 15, 2023 22:50
…s terminated, we can use a flag to make sure to not terminate the job twice with the job.terminate() call after `if has_error`.
…ocess has terminated, we can use a flag to make sure to not terminate the job twice with the job.terminate() call after `if has_error`."

This reverts commit 46e3650.
@fyalcin
Copy link
Contributor Author

fyalcin commented Jun 16, 2023

Thank you for your kind words, everyone :)

After changing the file checked from CHGCAR to vasprun.xml (thanks @MichaelWolloch), the only case where I see the fallback being triggered is where VASP throws an error and terminates itself within a monitor_frequency * polling_time_step second window. This is because

if p.poll() is not None:
break
triggers before
if n % self.monitor_freq == 0:
has_error = self._do_check(self.monitors, terminate)
has a chance to do its thing and we break out of the loop. Then, the first error check happens with all the handlers and the error is caught, triggering the job.terminate(), even though VASP is already terminated naturally.

I just committed a new check 46e3650 to avoid calling terminate() on a job whose child process has already terminated but reverted it because the terminate() call at

if has_error:
# This makes sure the job is killed cleanly for certain systems.
job.terminate()
was added specifically for cases (I assume) where
if n % self.monitor_freq == 0:
has_error = self._do_check(self.monitors, terminate)
fails to terminate the vasp process through terminating mpirun (in my case).

I think @matthewkuner's idea of checking if a VASP job is running before any terminate() call might be the way to go but history has shown that p.poll() is not the reliable way to do it. I'll think about this a bit.

In the meantime, I will test the newest implementation on some of my old jobs that failed with VasprunXMLValidator errors and report back with findings.

@janosh
Copy link
Member

janosh commented Jun 16, 2023

I think @matthewkuner's idea of checking if a VASP job is running before any terminate() call might be the way to go but history has shown that p.poll() is not the reliable way to do it. I'll think about this a bit.

That sounds like a tough problem. Most people will be running on linux clusters. So maybe another thing to look into are system-specific CLIs for monitoring process status like pgrep or ps -ef. Calling them with subprocess would have to guard against "Command not found" in case you're running on Windows where there's probably a differently named utility doing a similar job. This might turn into a can of worms though...

@MichaelWolloch
Copy link
Contributor

I think that while calling terminate() on a job that has already killed itself or finished OK (custodian might still throw e.g. an insufficient NBANDS or large SIGMA handlers on jobs that finished OK from the VASP side) is dubious, it is not worth to mess with this atm. According to the comment mentioned by @fyalcin it might actually be necessary for some systems.

I think killall for now is a good backup to have, since we have already established that multi node jobs are much more common than several jobs on a single node. I am doubtful there is a good way to figure out the data that @janosh requested, but if all of us take a care to keep our logfiles, one could at least get some statistics. Maybe @fyalcin should then include some more warnings in the logs with specific keywords to make analysis easier?

I also wanted to mention that while this is a great fix and I am very thankful to @fyalcin for figuring out that my solution from #259 was buggy (he also significantly contributed to that idea btw), I am not sure if this will solve all VasprunXMLValidator errors. They are thrown when the vasprun.xml cannot be read or is incomplete. This of course happens if VASP cannot run because of insufficient resources, which this should fix, but there might be other reasons that @janosh and others might have experienced as well. Still support the idea of making @fyalcin employee of the month :)

…n.terminate instead of the custom job.terminate, which was not reliably killing the vasp processes. This is now changed so the order terminate defaults to is terminate_func -> job.terminate -> p.terminate.
@fyalcin
Copy link
Contributor Author

fyalcin commented Jun 16, 2023

I might have figured out something about why

if has_error:
# This makes sure the job is killed cleanly for certain systems.
job.terminate()
needed to be put in.

We only break out of the while loop when p.poll() returns None, which should mean that vasp processes are not running anymore. However, this turned out to be a not-so-reliable way of ensuring vasp processes are killed and the second terminate needed to put in. In one of the recent PRs regarding this matter, @MichaelWolloch noticed that sending a terminate or kill signal to mpirun does not always propagate it to the child vasp processes in time (or at all?), and I believe this is exactly the crux of the issue here.

Please have a look at these 3 lines inside the while loop in Custodian._run_job()

terminate = self.terminate_func or p.terminate
if n % self.monitor_freq == 0:
has_error = self._do_check(self.monitors, terminate)

Here, the variable terminate is set to a custom terminate_func or p.terminate. This terminate function is then used as the second optional argument in _do_check(), which calls terminate() if it encounters a monitor error. The problem here is that p.terminate is subprocess.Popen.terminate, not our implementation at VaspJob.terminate. Since p.terminate only terminates the mpirun in our case, it doesn't reliably kill vasp and it was necessary to put in the second terminate call outside the loop, this time the correct job.terminate().

I believe that changing the terminate assignment in terminate = self.terminate_func or p.terminate to instead default to job.terminate if terminate_func is None, and finally to p.terminate if both job.terminate and terminate_func are none would allow us to get rid of the second terminate call outside the loop.

For the other codes whose jobs might need the second terminate() outside the loop and have their own implementation (such as cp2k), the proper terminate call inside the loop should now properly kill everything, and jobs that don't have their own implementations of terminate, the terminate outside the loop was redundant anyway since it wasn't doing anything.

This change has a positive side effect in which the fallback should only be triggered if something actually goes wrong with the new method with open_files() in VaspJob.terminate :)

Finally, I don't understand the reasoning behind the else block

else:
p.wait()
if self.terminate_func is not None and self.terminate_func != p.terminate:
self.terminate_func()
time.sleep(self.polling_time_step)

which calls terminate_func when there are no monitors but I'm not touching that for now.

@janosh
Copy link
Member

janosh commented Jun 16, 2023

Holy smokes, I second @arosen93, this is getting beyond my paygrade as well. I'm glad you figured this all out (except the else block which looks like it's in dire need of a refactor as well).

I kindly ask that you leave copious comments to explain your reasoning behind all changes you make so that the next person who dives into this has an easier time understanding why things are done the way they are and what they need to be aware of when making further changes.

…loop of Custodian._run_job(), and the choice of terminate function.
@fyalcin
Copy link
Contributor Author

fyalcin commented Jun 19, 2023

So, I went through the tracebacks of my calculations failing with VasprunXMLValidator errors, and here are some observations and statistics;

  • 860 failed calculations on a mix of zen3 and skylake nodes
  • 521 with vasp.out containing "Device or resource busy" pointing to issues with termination + resubmission via slurm
  • the 521 above is limited to skylake nodes only (does this mean the old terminate works fine on zen nodes?)
  • the total number of failed calculations on skylake nodes is 557, meaning ~94% of the failures can be attributed to issues with termination.

Then I randomly selected 20 failed calculations on the skylake nodes with "Device or resource busy" in the traceback and restarted them - zero failures this time. In short, I'd say that the changes look promising but perhaps need more testing. If anyone is able to do a similar testing, that would also be helpful :)

@matthewkuner
Copy link
Contributor

521 with vasp.out containing "Device or resource busy" pointing to issues with termination + resubmission via slurm

@fyalcin why do you believe this points to issues with termination + resubmission via slurm? It sounds plausible, but I'm also uninformed on how much of this stuff works

@mkhorton
Copy link
Member

mkhorton commented Jun 19, 2023

Linking #265 here too, since it seems the change in termination functions also introduced another bug.

@fyalcin
Copy link
Contributor Author

fyalcin commented Jun 20, 2023

521 with vasp.out containing "Device or resource busy" pointing to issues with termination + resubmission via slurm

@fyalcin why do you believe this points to issues with termination + resubmission via slurm? It sounds plausible, but I'm also uninformed on how much of this stuff works

I believe the "Device or resource busy" in vasp.out happens when one attempts to start an MPI process that would put the total number of tasks per node above the slurm spec --ntasks-per-node (backed up by the forum post mentioned in the original post). I can reliably reproduce this by trying to start a new VASP calculation on a node that is already running --ntasks-per-node vasp processes, where it immediately fails with Device or resource busy lines. In a regular workflow, this can happen by custodian trying to run a VaspJob after a failed termination.

@matthewkuner
Copy link
Contributor

@fyalcin is this ready for merging?

@fyalcin
Copy link
Contributor Author

fyalcin commented Jun 27, 2023

@fyalcin is this ready for merging?

@matthewkuner I think it's almost there. I will add a couple more commits addressing #265 and it should be good to go afterward.

@fyalcin
Copy link
Contributor Author

fyalcin commented Jun 30, 2023

So, I changed the inits of both VaspJob and VaspNEBJob to convert vasp_cmd and gamma_vasp_cmd into tuples on assignment as class attributes. Everything seems to be working fine. I'm still not sure if this was the best way to make these variables immutable, so if there are suggestions, they would be more than welcome. If not, I think this is ready for merging.

@MichaelWolloch
Copy link
Contributor

Looks great for me now. Thanks @fyalcin for the good work. This should be pretty save and hopefully avoid a lot of the VaspValidatorErrors.

@Andrew-S-Rosen
Copy link
Member

@shyuep --- would you mind taking a look at this one when you get a chance? Pretty important bug fix and usability improvement! 🙏

@shyuep shyuep merged commit 50d85e7 into materialsproject:master Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants