-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non-ideal error message for max_errors hit #43
Comments
What happens if you set max errors to be a larger number and terminate_on_nonzero to False? I need to know why this happens. Is your max error == 2? |
The max_errors should be 5 and there are other jobs that completed after 3 errors. e.g., see: /projects/ps-matqm/prod_runs/block_2016-10-21-19-00-21-067631/launcher_2016-12-05-17-18-15-093034/launcher_2016-12-15-12-48-26-702546 for an example of a run with the same infrastructure, but completed successfully after 3 errors. In this case, I think it is stopping at 2 errors because the same error is repeated, and custodian is smart enough to stop trying the same fix again and again 5 times. |
I tried looking at the code, but for eddrmm errors, there is no "repeated" check, unlike other errors. In fact, EDDRMM errors always result in a corrective action returned. The vasp.out seems to be untouched, even though the INCAR Algo has changed. If I have to speculate, the second time round, VASP didn't run at all and immediately exited, which result in the |
Note - to answer @xhqu1981 's question (which I somehow don't see here): There is both a std_error.txt and std_error.txt.gz. The former is empty. The latter looks like below: forrtl: error (78): process killed (SIGTERM) |
Thanks @computron a lot. I was wondering whether it is a similar issue in my test. After reading your reporting carefully again, I noticed that your platform is Linux which is not the OS expected to have that issue. As a result, I withdrew the comment yesterday. |
To avoid confusing other people, I am duplicating my comment here, I was asking whether std_err printed a line: "srun: error: Unable to create job step: Job/step already completing or completed" It is some evidence for VASP fail to launch. @computron 's current std_err.txt is empty, I don't think std_err provide any evidence about the status of VASP in this situation. I am sorry this is not a helpful clue. |
System
Summary
Error message
The stack trace I get back is:
You can see that it's difficult to know from above that max_errors was reached and that is why we are exiting. You can figure it out though if you look at custodian.py line 323.
Files
The run is located in :
/projects/ps-matqm/prod_runs/block_2016-12-20-23-00-16-536064/launcher_2016-12-20-23-00-35-095234/launcher_2016-12-21-09-28-04-031609
Suggested solution (if known)
The text was updated successfully, but these errors were encountered: