Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-ideal error message for max_errors hit #43

Open
computron opened this issue Dec 21, 2016 · 6 comments
Open

non-ideal error message for max_errors hit #43

computron opened this issue Dec 21, 2016 · 6 comments

Comments

@computron
Copy link
Member

System

  • master branch (including latest fix to returncode), Py27, Linux

Summary

  • In my specific run, the same error comes up twice in a row (EDDRM)
  • The second time it happens, custodian says that it hits max_errors (I think this part is OK) and raises a custodian error to exit.
  • What happens next is that this results in a non-zero return code, which intercepts that message and then uses it raise a "nonzero returncode error".
  • Thus the final output message of nonzero return code is less helpful to debugging runs than the original text of max errors hit.

Error message

The stack trace I get back is:

Traceback (most recent call last):\n  File \"/projects/matqm/matmethods_env/codes/fireworks/fireworks/core/rocket.py\", line 224, in run\n    m_action = t.run_task(my_spec)\n  File \"/projects/matqm/matmethods_env/codes/atomate/atomate/vasp/firetasks/run_calc.py\", line 167, in run_task\n    c.run()\n  File \"/projects/matqm/matmethods_env/codes/custodian/custodian/custodian.py\", line 323, in run\n    .format(self.total_errors, ex))\nRuntimeError: 1 errors reached: (CustodianError(...), u'Job return code is 1. Terminating...'). Exited...\n

You can see that it's difficult to know from above that max_errors was reached and that is why we are exiting. You can figure it out though if you look at custodian.py line 323.

Files

The run is located in :
/projects/ps-matqm/prod_runs/block_2016-12-20-23-00-16-536064/launcher_2016-12-20-23-00-35-095234/launcher_2016-12-21-09-28-04-031609

Suggested solution (if known)

  • Actually on first glance I am not even sure why this is happening. As far as I can tell, when line 323 throws an exception, the line of code about return code validation should never even run.
@shyuep
Copy link
Member

shyuep commented Dec 21, 2016

What happens if you set max errors to be a larger number and terminate_on_nonzero to False? I need to know why this happens. Is your max error == 2?

@computron
Copy link
Member Author

The max_errors should be 5 and there are other jobs that completed after 3 errors. e.g., see:

/projects/ps-matqm/prod_runs/block_2016-10-21-19-00-21-067631/launcher_2016-12-05-17-18-15-093034/launcher_2016-12-15-12-48-26-702546

for an example of a run with the same infrastructure, but completed successfully after 3 errors.

In this case, I think it is stopping at 2 errors because the same error is repeated, and custodian is smart enough to stop trying the same fix again and again 5 times.

@shyuep
Copy link
Member

shyuep commented Dec 21, 2016

I tried looking at the code, but for eddrmm errors, there is no "repeated" check, unlike other errors. In fact, EDDRMM errors always result in a corrective action returned. The vasp.out seems to be untouched, even though the INCAR Algo has changed. If I have to speculate, the second time round, VASP didn't run at all and immediately exited, which result in the

@computron
Copy link
Member Author

Note - to answer @xhqu1981 's question (which I somehow don't see here):

There is both a std_error.txt and std_error.txt.gz. The former is empty. The latter looks like below:

forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
vasp.std 0000000001791329 Unknown Unknown Unknown
vasp.std 000000000178FBFE Unknown Unknown Unknown
vasp.std 0000000001715FA2 Unknown Unknown Unknown
vasp.std 00000000016C30B3 Unknown Unknown Unknown
vasp.std 00000000016C8D79 Unknown Unknown Unknown
libpthread.so.0 000000353120F7E0 Unknown Unknown Unknown
libmpi.so.1 00002B0D71548A84 Unknown Unknown Unknown
libopen-pal.so.6 00002B0D71D6E45B Unknown Unknown Unknown
libmpi.so.1 00002B0D714CACB1 Unknown Unknown Unknown
libmpi.so.1 00002B0D715C90AE Unknown Unknown Unknown
libmpi.so.1 00002B0D715CF6D2 Unknown Unknown Unknown
libmpi.so.1 00002B0D714DFD6F Unknown Unknown Unknown
libmpi_mpifh.so.2 00002B0D7122E4EA Unknown Unknown Unknown
vasp.std 0000000000416628 Unknown Unknown Unknown
vasp.std 000000000056F71A Unknown Unknown Unknown
vasp.std 000000000057ABC9 Unknown Unknown Unknown
vasp.std 0000000000DD1B80 Unknown Unknown Unknown
vasp.std 0000000000E54EBB Unknown Unknown Unknown
vasp.std 000000000152BC27 Unknown Unknown Unknown
vasp.std 0000000000411FF6 Unknown Unknown Unknown
libc.so.6 000000353061ED5D Unknown Unknown Unknown
vasp.std 0000000000411EE9 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
vasp.std 0000000001791329 Unknown Unknown Unknown
vasp.std 000000000178FBFE Unknown Unknown Unknown
vasp.std 0000000001715FA2 Unknown Unknown Unknown
vasp.std 00000000016C30B3 Unknown Unknown Unknown
vasp.std 00000000016C8D79 Unknown Unknown Unknown
libpthread.so.0 000000353120F7E0 Unknown Unknown Unknown
libmpi.so.1 00002B8CE4DFAA94 Unknown Unknown Unknown
libopen-pal.so.6 00002B8CE562045B Unknown Unknown Unknown
libmpi.so.1 00002B8CE4D7CCB1 Unknown Unknown Unknown
libmpi.so.1 00002B8CE4E7B0AE Unknown Unknown Unknown
libmpi.so.1 00002B8CE4E816D2 Unknown Unknown Unknown
libmpi.so.1 00002B8CE4D91D6F Unknown Unknown Unknown
libmpi_mpifh.so.2 00002B8CE4AE04EA Unknown Unknown Unknown
vasp.std 0000000000416628 Unknown Unknown Unknown
vasp.std 000000000056F71A Unknown Unknown Unknown
vasp.std 000000000057ABC9 Unknown Unknown Unknown
vasp.std 0000000000DD1B80 Unknown Unknown Unknown
vasp.std 0000000000E54EBB Unknown Unknown Unknown
vasp.std 000000000152BC27 Unknown Unknown Unknown
vasp.std 0000000000411FF6 Unknown Unknown Unknown
libc.so.6 000000353061ED5D Unknown Unknown Unknown
vasp.std 0000000000411EE9 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
vasp.std 0000000001791329 Unknown Unknown Unknown
vasp.std 000000000178FBFE Unknown Unknown Unknown
vasp.std 0000000001715FA2 Unknown Unknown Unknown
vasp.std 00000000016C30B3 Unknown Unknown Unknown
vasp.std 00000000016C8D79 Unknown Unknown Unknown
libpthread.so.0 000000353120F7E0 Unknown Unknown Unknown
libmkl_avx.so 00002ACD81829C54 Unknown Unknown Unknown
libmkl_avx.so 00002ACD8184AC6A Unknown Unknown Unknown
libmkl_avx.so 00002ACD818230D4 Unknown Unknown Unknown

@xhqu1981
Copy link
Contributor

Thanks @computron a lot. I was wondering whether it is a similar issue in my test. After reading your reporting carefully again, I noticed that your platform is Linux which is not the OS expected to have that issue. As a result, I withdrew the comment yesterday.

@xhqu1981
Copy link
Contributor

To avoid confusing other people, I am duplicating my comment here, I was asking whether std_err printed a line:

"srun: error: Unable to create job step: Job/step already completing or completed"

It is some evidence for VASP fail to launch.

@computron 's current std_err.txt is empty, I don't think std_err provide any evidence about the status of VASP in this situation. I am sorry this is not a helpful clue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants