non-ideal error message for max_errors hit #43

computron · 2016-12-21T16:13:17Z

System

master branch (including latest fix to returncode), Py27, Linux

Summary

In my specific run, the same error comes up twice in a row (EDDRM)
The second time it happens, custodian says that it hits max_errors (I think this part is OK) and raises a custodian error to exit.
What happens next is that this results in a non-zero return code, which intercepts that message and then uses it raise a "nonzero returncode error".
Thus the final output message of nonzero return code is less helpful to debugging runs than the original text of max errors hit.

Error message

The stack trace I get back is:

Traceback (most recent call last):\n  File \"/projects/matqm/matmethods_env/codes/fireworks/fireworks/core/rocket.py\", line 224, in run\n    m_action = t.run_task(my_spec)\n  File \"/projects/matqm/matmethods_env/codes/atomate/atomate/vasp/firetasks/run_calc.py\", line 167, in run_task\n    c.run()\n  File \"/projects/matqm/matmethods_env/codes/custodian/custodian/custodian.py\", line 323, in run\n    .format(self.total_errors, ex))\nRuntimeError: 1 errors reached: (CustodianError(...), u'Job return code is 1. Terminating...'). Exited...\n

You can see that it's difficult to know from above that max_errors was reached and that is why we are exiting. You can figure it out though if you look at custodian.py line 323.

Files

The run is located in :
/projects/ps-matqm/prod_runs/block_2016-12-20-23-00-16-536064/launcher_2016-12-20-23-00-35-095234/launcher_2016-12-21-09-28-04-031609

Suggested solution (if known)

Actually on first glance I am not even sure why this is happening. As far as I can tell, when line 323 throws an exception, the line of code about return code validation should never even run.

The text was updated successfully, but these errors were encountered:

shyuep · 2016-12-21T16:24:21Z

What happens if you set max errors to be a larger number and terminate_on_nonzero to False? I need to know why this happens. Is your max error == 2?

computron · 2016-12-21T19:53:22Z

The max_errors should be 5 and there are other jobs that completed after 3 errors. e.g., see:

/projects/ps-matqm/prod_runs/block_2016-10-21-19-00-21-067631/launcher_2016-12-05-17-18-15-093034/launcher_2016-12-15-12-48-26-702546

for an example of a run with the same infrastructure, but completed successfully after 3 errors.

In this case, I think it is stopping at 2 errors because the same error is repeated, and custodian is smart enough to stop trying the same fix again and again 5 times.

shyuep · 2016-12-21T20:54:41Z

I tried looking at the code, but for eddrmm errors, there is no "repeated" check, unlike other errors. In fact, EDDRMM errors always result in a corrective action returned. The vasp.out seems to be untouched, even though the INCAR Algo has changed. If I have to speculate, the second time round, VASP didn't run at all and immediately exited, which result in the

computron · 2016-12-23T00:25:02Z

Note - to answer @xhqu1981 's question (which I somehow don't see here):

There is both a std_error.txt and std_error.txt.gz. The former is empty. The latter looks like below:

forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
vasp.std 0000000001791329 Unknown Unknown Unknown
vasp.std 000000000178FBFE Unknown Unknown Unknown
vasp.std 0000000001715FA2 Unknown Unknown Unknown
vasp.std 00000000016C30B3 Unknown Unknown Unknown
vasp.std 00000000016C8D79 Unknown Unknown Unknown
libpthread.so.0 000000353120F7E0 Unknown Unknown Unknown
libmpi.so.1 00002B0D71548A84 Unknown Unknown Unknown
libopen-pal.so.6 00002B0D71D6E45B Unknown Unknown Unknown
libmpi.so.1 00002B0D714CACB1 Unknown Unknown Unknown
libmpi.so.1 00002B0D715C90AE Unknown Unknown Unknown
libmpi.so.1 00002B0D715CF6D2 Unknown Unknown Unknown
libmpi.so.1 00002B0D714DFD6F Unknown Unknown Unknown
libmpi_mpifh.so.2 00002B0D7122E4EA Unknown Unknown Unknown
vasp.std 0000000000416628 Unknown Unknown Unknown
vasp.std 000000000056F71A Unknown Unknown Unknown
vasp.std 000000000057ABC9 Unknown Unknown Unknown
vasp.std 0000000000DD1B80 Unknown Unknown Unknown
vasp.std 0000000000E54EBB Unknown Unknown Unknown
vasp.std 000000000152BC27 Unknown Unknown Unknown
vasp.std 0000000000411FF6 Unknown Unknown Unknown
libc.so.6 000000353061ED5D Unknown Unknown Unknown
vasp.std 0000000000411EE9 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
vasp.std 0000000001791329 Unknown Unknown Unknown
vasp.std 000000000178FBFE Unknown Unknown Unknown
vasp.std 0000000001715FA2 Unknown Unknown Unknown
vasp.std 00000000016C30B3 Unknown Unknown Unknown
vasp.std 00000000016C8D79 Unknown Unknown Unknown
libpthread.so.0 000000353120F7E0 Unknown Unknown Unknown
libmpi.so.1 00002B8CE4DFAA94 Unknown Unknown Unknown
libopen-pal.so.6 00002B8CE562045B Unknown Unknown Unknown
libmpi.so.1 00002B8CE4D7CCB1 Unknown Unknown Unknown
libmpi.so.1 00002B8CE4E7B0AE Unknown Unknown Unknown
libmpi.so.1 00002B8CE4E816D2 Unknown Unknown Unknown
libmpi.so.1 00002B8CE4D91D6F Unknown Unknown Unknown
libmpi_mpifh.so.2 00002B8CE4AE04EA Unknown Unknown Unknown
vasp.std 0000000000416628 Unknown Unknown Unknown
vasp.std 000000000056F71A Unknown Unknown Unknown
vasp.std 000000000057ABC9 Unknown Unknown Unknown
vasp.std 0000000000DD1B80 Unknown Unknown Unknown
vasp.std 0000000000E54EBB Unknown Unknown Unknown
vasp.std 000000000152BC27 Unknown Unknown Unknown
vasp.std 0000000000411FF6 Unknown Unknown Unknown
libc.so.6 000000353061ED5D Unknown Unknown Unknown
vasp.std 0000000000411EE9 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
vasp.std 0000000001791329 Unknown Unknown Unknown
vasp.std 000000000178FBFE Unknown Unknown Unknown
vasp.std 0000000001715FA2 Unknown Unknown Unknown
vasp.std 00000000016C30B3 Unknown Unknown Unknown
vasp.std 00000000016C8D79 Unknown Unknown Unknown
libpthread.so.0 000000353120F7E0 Unknown Unknown Unknown
libmkl_avx.so 00002ACD81829C54 Unknown Unknown Unknown
libmkl_avx.so 00002ACD8184AC6A Unknown Unknown Unknown
libmkl_avx.so 00002ACD818230D4 Unknown Unknown Unknown

xhqu1981 · 2016-12-23T00:48:27Z

Thanks @computron a lot. I was wondering whether it is a similar issue in my test. After reading your reporting carefully again, I noticed that your platform is Linux which is not the OS expected to have that issue. As a result, I withdrew the comment yesterday.

xhqu1981 · 2016-12-23T01:06:51Z

To avoid confusing other people, I am duplicating my comment here, I was asking whether std_err printed a line:

"srun: error: Unable to create job step: Job/step already completing or completed"

It is some evidence for VASP fail to launch.

@computron 's current std_err.txt is empty, I don't think std_err provide any evidence about the status of VASP in this situation. I am sorry this is not a helpful clue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-ideal error message for max_errors hit #43

non-ideal error message for max_errors hit #43

computron commented Dec 21, 2016

shyuep commented Dec 21, 2016

computron commented Dec 21, 2016

shyuep commented Dec 21, 2016

computron commented Dec 23, 2016

xhqu1981 commented Dec 23, 2016

xhqu1981 commented Dec 23, 2016

non-ideal error message for max_errors hit #43

non-ideal error message for max_errors hit #43

Comments

computron commented Dec 21, 2016

System

Summary

Error message

Files

Suggested solution (if known)

shyuep commented Dec 21, 2016

computron commented Dec 21, 2016

shyuep commented Dec 21, 2016

computron commented Dec 23, 2016

xhqu1981 commented Dec 23, 2016

xhqu1981 commented Dec 23, 2016