Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSF-bjobs bug: temporary omission of running jobs in JSON output #133

Open
vinjana opened this issue Dec 19, 2018 · 0 comments
Open

LSF-bjobs bug: temporary omission of running jobs in JSON output #133

vinjana opened this issue Dec 19, 2018 · 0 comments

Comments

@vinjana
Copy link
Contributor

vinjana commented Dec 19, 2018

Apparently related to heavy load, LSF's bjobs output does not report all job IDs. For jobs in the running state when the job is not reported anymore it will be classified as kind of finished/exited. However, if the job later reappears in the JSON output the state is probably not changed back. We observed that in 3 queries within 10 minutes a job was missing, while it was reported before and after this interval.

Suggested Solution:

  • Only report jobs as exited (maybe COMPLETED_UNKNOWN), if they are explicitly marked as EXITED or DONE. When they are lost, this is always an indication, that something is wrong (unless the system is configured to prune the list of exited jobs older than 2 minutes)
  • Make it configurable for how long to wait for lost jobs and for how long the list is maintained by LSF (i.e. after what time jobs certainly are not expected to be found in the list)
  • Add a job state like "missing" or "not-reported" or maybe a counter in the "running" state telling when the job was last seen in the list.
  • Warn if a job is lost from the list, but it is expected that it should be visible. This is always an abnormal situation.
@vinjana vinjana added this to the Release 1.0 milestone Feb 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant