Change the file check in the VASP terminate function (also possibly the fallback method) #264

fyalcin · 2023-06-15T13:36:42Z

Summary

A [recent PR](#259) and a [related post](https://matsci.org/t/vasprunxmlvalidator-error-in-the-middle-of-a-workflow/49136) on matsci atomate forums prompted me to look into this with a system with which I could reliably reproduce the exact VasprunXMLValidator error mentioned in the matsci post.

After some log crunching and monitoring of the VASP processes (and the files being accessed by them), I noticed that VASP wasn't properly being killed by the VaspJob.terminate(), and as soon as custodian tried running a new VASP job, it fails because we are requesting more tasks than available per the slurm config. This results in a broken vasprun.xml and the validator check fails.

The reason why VASP wasn't being killed properly eluded us (@MichaelWolloch and I) until we carefully observed the files being accessed by VASP during a run where we noticed that VASP stops accessing CHGCAR (the current implementation) shortly before a job finishes. In my test system, the VaspJob.terminate() call coincided with the time period between this and the job gracefully exiting by itself and this leads to the open_files() check failing. This in turn triggers the fallback method with the os call killall, but an erroneous indentation introduced in a recent PR prevented the fallback method to run properly.

After careful observation, it seems that there are several files that are being accessed by VASP during the whole run such as OUTCAR and vasprun.xml, among others. My proposal in this PR is to use the OUTCAR, but this is up for debate.

This change fixes the issue with my test system that used to fail with VasprunXMLValidator and hopefully the other similar issues such as the one in the aforementioned post.

As for the fallback method with killall, I propose to get rid of it entirely and would like your opinion on this. It is shown by @MichaelWolloch that rlaunch multi is a considerable speed-up in certain scenarios so we should ideally retain the full functionality of that mode. There are also certain cases where a problematic VASP job might end before custodian can interfere (depending on the polling_time_step and monitor_freq) after which custodian still calls VaspJob.terminate() which would use the fallback method (since that particular VASP isn't running anymore) that would kill other VASP process spawned by rlaunch multi on the same node; an undesirable outcome.

This PR is to slightly modify the current implementation and to ask for maintainers' opinion on getting rid of the fallback method with killall once and for all.

…ssing the OUTCAR file to determine the process to kill. Fixed an indentation in the fallback method.

for more information, see https://pre-commit.ci

MichaelWolloch · 2023-06-15T14:49:48Z

Thanks for checking and bug fixing @fyalcin !

I just checked the vasp source code, and it seems that the last file that is accessed is the vasprun.xml. So maybe this should be used instead of OUTCAR.
Initially I selected CHGCAR because it is accessed by all processes, and not just a single one, but this is of no great importance.

As for getting rid of the killall fallback:
I think that rlanuch multi is indeed a rather niche usecase. The more proper solution would probably be to not call terminate at all if the vasp run has already finished, but this is a whole other can of worms and goes far beyond this PR I think.

I would love the opinion of @arosen93 , @janosh and @matthewkuner on this.

Andrew-S-Rosen · 2023-06-15T15:26:02Z

Thanks for fixing this important bug. The VASP termination is a bit beyond my paygrade so I don't have much input to provide, but I look forward to the fix being merged.

I'm tagging @munrojm here in case he has any objections/comments, as he often runs large "jobpacking" runs that I want to make sure still work appropriately (due to changes made over these several PRs).

Edit: Linking to #217.

janosh · 2023-06-15T15:58:24Z

@fyalcin Wow, I'm thoroughly impressed both by your in-depth troubleshooting and excellent write-up of results. I've run into the VasprunXMLValidator error so many times yet was never able to find time and muster the perseverance to fix it. If you managed, I vote for you as employee of the month! 😄

On the killall removal question, I think we need more data to come to an informed decision. The primary factors of interest are

When killall is triggered, how often does it kill the wrong VASP run?
What percentage of users run with a setup where the wrong VASP might be killed?
For users not in such a setup, how often is the fallback triggered? (I.e. how often does it serve a useful purpose?)

I realize none of these are easy to measure. Maybe we can come up with decent proxies/estimates.

matthewkuner · 2023-06-15T17:53:54Z

@fyalcin I had not considered the potential issue with calling killall after a problematic VASP job already ends--I'm curious about how often this would happen. You may also be able to check whether the relevant VASP job is running before attempting to terminate it at all (in which case falling back to killall wouldn't be a risk).

I agree with the questions from @janosh that I think we'd need to consider before removing the backup killall entirely.

Finally, really cool work on determining which output file should be monitored!

…s terminated, we can use a flag to make sure to not terminate the job twice with the job.terminate() call after `if has_error`.

…ocess has terminated, we can use a flag to make sure to not terminate the job twice with the job.terminate() call after `if has_error`." This reverts commit 46e3650.

fyalcin · 2023-06-16T01:30:37Z

Thank you for your kind words, everyone :)

After changing the file checked from CHGCAR to vasprun.xml (thanks @MichaelWolloch), the only case where I see the fallback being triggered is where VASP throws an error and terminates itself within a monitor_frequency * polling_time_step second window. This is because

custodian/custodian/custodian.py

Lines 470 to 471 in f7dc11a

    
           if p.poll() is not None: 
        
               break

triggers before

custodian/custodian/custodian.py

Lines 473 to 474 in f7dc11a

    
           if n % self.monitor_freq == 0: 
        
               has_error = self._do_check(self.monitors, terminate)

has a chance to do its thing and we break out of the loop. Then, the first error check happens with all the handlers and the error is caught, triggering the job.terminate(), even though VASP is already terminated naturally.

I just committed a new check 46e3650 to avoid calling terminate() on a job whose child process has already terminated but reverted it because the terminate() call at

custodian/custodian/custodian.py

Lines 494 to 496 in f7dc11a

    
           if has_error: 
        
               # This makes sure the job is killed cleanly for certain systems. 
        
               job.terminate()

was added specifically for cases (I assume) where

custodian/custodian/custodian.py

Lines 473 to 474 in f7dc11a

    
           if n % self.monitor_freq == 0: 
        
               has_error = self._do_check(self.monitors, terminate)

fails to terminate the vasp process through terminating mpirun (in my case).

I think @matthewkuner's idea of checking if a VASP job is running before any terminate() call might be the way to go but history has shown that p.poll() is not the reliable way to do it. I'll think about this a bit.

In the meantime, I will test the newest implementation on some of my old jobs that failed with VasprunXMLValidator errors and report back with findings.

janosh · 2023-06-16T06:46:47Z

I think @matthewkuner's idea of checking if a VASP job is running before any terminate() call might be the way to go but history has shown that p.poll() is not the reliable way to do it. I'll think about this a bit.

That sounds like a tough problem. Most people will be running on linux clusters. So maybe another thing to look into are system-specific CLIs for monitoring process status like pgrep or ps -ef. Calling them with subprocess would have to guard against "Command not found" in case you're running on Windows where there's probably a differently named utility doing a similar job. This might turn into a can of worms though...

MichaelWolloch · 2023-06-16T08:29:47Z

I think that while calling terminate() on a job that has already killed itself or finished OK (custodian might still throw e.g. an insufficient NBANDS or large SIGMA handlers on jobs that finished OK from the VASP side) is dubious, it is not worth to mess with this atm. According to the comment mentioned by @fyalcin it might actually be necessary for some systems.

I think killall for now is a good backup to have, since we have already established that multi node jobs are much more common than several jobs on a single node. I am doubtful there is a good way to figure out the data that @janosh requested, but if all of us take a care to keep our logfiles, one could at least get some statistics. Maybe @fyalcin should then include some more warnings in the logs with specific keywords to make analysis easier?

I also wanted to mention that while this is a great fix and I am very thankful to @fyalcin for figuring out that my solution from #259 was buggy (he also significantly contributed to that idea btw), I am not sure if this will solve all VasprunXMLValidator errors. They are thrown when the vasprun.xml cannot be read or is incomplete. This of course happens if VASP cannot run because of insufficient resources, which this should fix, but there might be other reasons that @janosh and others might have experienced as well. Still support the idea of making @fyalcin employee of the month :)

…n.terminate instead of the custom job.terminate, which was not reliably killing the vasp processes. This is now changed so the order terminate defaults to is terminate_func -> job.terminate -> p.terminate.

fyalcin · 2023-06-16T10:09:47Z

I might have figured out something about why

custodian/custodian/custodian.py

Lines 494 to 496 in f7dc11a

    
           if has_error: 
        
               # This makes sure the job is killed cleanly for certain systems. 
        
               job.terminate()

needed to be put in.

We only break out of the while loop when p.poll() returns None, which should mean that vasp processes are not running anymore. However, this turned out to be a not-so-reliable way of ensuring vasp processes are killed and the second terminate needed to put in. In one of the recent PRs regarding this matter, @MichaelWolloch noticed that sending a terminate or kill signal to mpirun does not always propagate it to the child vasp processes in time (or at all?), and I believe this is exactly the crux of the issue here.

Please have a look at these 3 lines inside the while loop in Custodian._run_job()

custodian/custodian/custodian.py

Lines 472 to 474 in f7dc11a

    
           terminate = self.terminate_func or p.terminate 
        
           if n % self.monitor_freq == 0: 
        
               has_error = self._do_check(self.monitors, terminate)

Here, the variable terminate is set to a custom terminate_func or p.terminate. This terminate function is then used as the second optional argument in _do_check(), which calls terminate() if it encounters a monitor error. The problem here is that p.terminate is subprocess.Popen.terminate, not our implementation at VaspJob.terminate. Since p.terminate only terminates the mpirun in our case, it doesn't reliably kill vasp and it was necessary to put in the second terminate call outside the loop, this time the correct job.terminate().

I believe that changing the terminate assignment in terminate = self.terminate_func or p.terminate to instead default to job.terminate if terminate_func is None, and finally to p.terminate if both job.terminate and terminate_func are none would allow us to get rid of the second terminate call outside the loop.

For the other codes whose jobs might need the second terminate() outside the loop and have their own implementation (such as cp2k), the proper terminate call inside the loop should now properly kill everything, and jobs that don't have their own implementations of terminate, the terminate outside the loop was redundant anyway since it wasn't doing anything.

This change has a positive side effect in which the fallback should only be triggered if something actually goes wrong with the new method with open_files() in VaspJob.terminate :)

Finally, I don't understand the reasoning behind the else block

custodian/custodian/custodian.py

Lines 477 to 481 in f7dc11a

    
           else: 
        
               p.wait() 
        
               if self.terminate_func is not None and self.terminate_func != p.terminate: 
        
                   self.terminate_func() 
        
                   time.sleep(self.polling_time_step)

which calls terminate_func when there are no monitors but I'm not touching that for now.

janosh · 2023-06-16T14:07:45Z

Holy smokes, I second @arosen93, this is getting beyond my paygrade as well. I'm glad you figured this all out (except the else block which looks like it's in dire need of a refactor as well).

I kindly ask that you leave copious comments to explain your reasoning behind all changes you make so that the next person who dives into this has an easier time understanding why things are done the way they are and what they need to be aware of when making further changes.

…loop of Custodian._run_job(), and the choice of terminate function.

fyalcin · 2023-06-19T22:01:46Z

So, I went through the tracebacks of my calculations failing with VasprunXMLValidator errors, and here are some observations and statistics;

860 failed calculations on a mix of zen3 and skylake nodes
521 with vasp.out containing "Device or resource busy" pointing to issues with termination + resubmission via slurm
the 521 above is limited to skylake nodes only (does this mean the old terminate works fine on zen nodes?)
the total number of failed calculations on skylake nodes is 557, meaning ~94% of the failures can be attributed to issues with termination.

Then I randomly selected 20 failed calculations on the skylake nodes with "Device or resource busy" in the traceback and restarted them - zero failures this time. In short, I'd say that the changes look promising but perhaps need more testing. If anyone is able to do a similar testing, that would also be helpful :)

matthewkuner · 2023-06-19T22:51:11Z

521 with vasp.out containing "Device or resource busy" pointing to issues with termination + resubmission via slurm

@fyalcin why do you believe this points to issues with termination + resubmission via slurm? It sounds plausible, but I'm also uninformed on how much of this stuff works

mkhorton · 2023-06-19T23:32:59Z

Linking #265 here too, since it seems the change in termination functions also introduced another bug.

fyalcin · 2023-06-20T08:42:30Z

521 with vasp.out containing "Device or resource busy" pointing to issues with termination + resubmission via slurm

@fyalcin why do you believe this points to issues with termination + resubmission via slurm? It sounds plausible, but I'm also uninformed on how much of this stuff works

I believe the "Device or resource busy" in vasp.out happens when one attempts to start an MPI process that would put the total number of tasks per node above the slurm spec --ntasks-per-node (backed up by the forum post mentioned in the original post). I can reliably reproduce this by trying to start a new VASP calculation on a node that is already running --ntasks-per-node vasp processes, where it immediately fails with Device or resource busy lines. In a regular workflow, this can happen by custodian trying to run a VaspJob after a failed termination.

matthewkuner · 2023-06-26T18:23:12Z

@fyalcin is this ready for merging?

fyalcin · 2023-06-27T09:23:56Z

@fyalcin is this ready for merging?

@matthewkuner I think it's almost there. I will add a couple more commits addressing #265 and it should be good to go afterward.

@MichaelWolloch

…suggestion by @MichaelWolloch

…__() to be immutable.

…__() and VaspNEBJob.__init__() to be immutable.

fyalcin · 2023-06-30T15:05:37Z

So, I changed the inits of both VaspJob and VaspNEBJob to convert vasp_cmd and gamma_vasp_cmd into tuples on assignment as class attributes. Everything seems to be working fine. I'm still not sure if this was the best way to make these variables immutable, so if there are suggestions, they would be more than welcome. If not, I think this is ready for merging.

MichaelWolloch · 2023-06-30T15:13:55Z

Looks great for me now. Thanks @fyalcin for the good work. This should be pretty save and hopefully avoid a lot of the VaspValidatorErrors.

Andrew-S-Rosen · 2023-07-04T05:31:29Z

@shyuep --- would you mind taking a look at this one when you get a chance? Pretty important bug fix and usability improvement! 🙏

fyalcin and others added 2 commits June 15, 2023 15:24

VaspJob.terminate() function now checks which VASP processes are acce…

2cd32cf

…ssing the OUTCAR file to determine the process to kill. Fixed an indentation in the fallback method.

[pre-commit.ci] auto fixes from pre-commit.com hooks

a1c70b1

for more information, see https://pre-commit.ci

fyalcin added 3 commits June 15, 2023 22:50

Merge branch 'materialsproject:master' into master

39352fe

Since the subprocess.Popen.poll() call checks if the child process ha…

46e3650

…s terminated, we can use a flag to make sure to not terminate the job twice with the job.terminate() call after `if has_error`.

Revert "Since the subprocess.Popen.poll() call checks if the child pr…

abe19e5

…ocess has terminated, we can use a flag to make sure to not terminate the job twice with the job.terminate() call after `if has_error`." This reverts commit 46e3650.

Custodian._do_check() during the run was being passed subprocess.Pope…

71b3278

…n.terminate instead of the custom job.terminate, which was not reliably killing the vasp processes. This is now changed so the order terminate defaults to is terminate_func -> job.terminate -> p.terminate.

Added comments to explain the reasoning behind the steps in the main …

2a5f7d0

…loop of Custodian._run_job(), and the choice of terminate function.

MichaelWolloch mentioned this pull request Jun 20, 2023

Bug: Command line argument 'vasp_gam' was not understood. #265

Open

fyalcin added 3 commits June 28, 2023 13:56

Changed the file checked from OUTCAR to vasprun.xml in line with the …

ba946c6

…suggestion by @MichaelWolloch

changed the vasp_cmd and gamma_vasp_cmd assignments in VaspJob.__init…

2eca0e0

…__() to be immutable.

changed the vasp_cmd and gamma_vasp_cmd assignments in VaspJob.__init…

16390ab

…__() and VaspNEBJob.__init__() to be immutable.

shyuep merged commit 50d85e7 into materialsproject:master Jul 6, 2023

MichaelWolloch mentioned this pull request Jul 28, 2023

remove MaxForceErrorHandler hackingmaterials/atomate#776

Closed

Andrew-S-Rosen mentioned this pull request Nov 2, 2023

Suggestion: More helpful error message on VASP job crashes than VasprunXMLValidator error #217

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change the file check in the VASP terminate function (also possibly the fallback method) #264

Change the file check in the VASP terminate function (also possibly the fallback method) #264

fyalcin commented Jun 15, 2023 •

edited

Loading

MichaelWolloch commented Jun 15, 2023

Andrew-S-Rosen commented Jun 15, 2023 •

edited

Loading

janosh commented Jun 15, 2023 •

edited

Loading

matthewkuner commented Jun 15, 2023 •

edited

Loading

fyalcin commented Jun 16, 2023 •

edited

Loading

janosh commented Jun 16, 2023

MichaelWolloch commented Jun 16, 2023

fyalcin commented Jun 16, 2023 •

edited

Loading

janosh commented Jun 16, 2023

fyalcin commented Jun 19, 2023

matthewkuner commented Jun 19, 2023

mkhorton commented Jun 19, 2023 •

edited

Loading

fyalcin commented Jun 20, 2023 •

edited

Loading

matthewkuner commented Jun 26, 2023

fyalcin commented Jun 27, 2023

fyalcin commented Jun 30, 2023

MichaelWolloch commented Jun 30, 2023

Andrew-S-Rosen commented Jul 4, 2023

Change the file check in the VASP terminate function (also possibly the fallback method) #264

Change the file check in the VASP terminate function (also possibly the fallback method) #264

Conversation

fyalcin commented Jun 15, 2023 • edited Loading

Summary

MichaelWolloch commented Jun 15, 2023

Andrew-S-Rosen commented Jun 15, 2023 • edited Loading

janosh commented Jun 15, 2023 • edited Loading

matthewkuner commented Jun 15, 2023 • edited Loading

fyalcin commented Jun 16, 2023 • edited Loading

janosh commented Jun 16, 2023

MichaelWolloch commented Jun 16, 2023

fyalcin commented Jun 16, 2023 • edited Loading

janosh commented Jun 16, 2023

fyalcin commented Jun 19, 2023

matthewkuner commented Jun 19, 2023

mkhorton commented Jun 19, 2023 • edited Loading

fyalcin commented Jun 20, 2023 • edited Loading

matthewkuner commented Jun 26, 2023

fyalcin commented Jun 27, 2023

fyalcin commented Jun 30, 2023

MichaelWolloch commented Jun 30, 2023

Andrew-S-Rosen commented Jul 4, 2023

fyalcin commented Jun 15, 2023 •

edited

Loading

Andrew-S-Rosen commented Jun 15, 2023 •

edited

Loading

janosh commented Jun 15, 2023 •

edited

Loading

matthewkuner commented Jun 15, 2023 •

edited

Loading

fyalcin commented Jun 16, 2023 •

edited

Loading

fyalcin commented Jun 16, 2023 •

edited

Loading

mkhorton commented Jun 19, 2023 •

edited

Loading

fyalcin commented Jun 20, 2023 •

edited

Loading