Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.

Node failure results in non-intuitive error for user, please improve error message #748

Open
adambertsch opened this issue Jul 1, 2019 · 1 comment
Labels
PhaseFound: Customer Sev: 3 Status: Open open for someone to grab and start working on Type: Defect

Comments

@adambertsch
Copy link

adambertsch commented Jul 1, 2019

Summary:
When a node crash occurs, the error that is returned by CSM to LSF results in a message in bhist -l that looks like this:

[DATE]: External Message "csm_allocation_delete returned CSMERR_TIMEOUT" was posted from "root" to messag box 0;

Describe the solution you'd like
We need something better, like: External Message "Compute node sierra4105 failed"

Issue Source:
We have been getting a lot of complaints from users about this message, and it requires a lot of manual work by our staff to provide them a meaningful answer. If the big data solution was really working, it would at least remove the manual work related to this problem... but really, the user visible message should be better, calling out a node failure, and ideally which node. Our other scheduler has done this for 10 years.

@mew2057
Copy link
Contributor

mew2057 commented Jul 8, 2019

I believe that CSM should be generating an error message with the relevant data for bad nodes. The
CSMERR_TIMEOUT should just be a general error path for when nodes failed to respond. The message definitely should have more illustrative data.

@fpizzano We should talk to someone on the LSF team to bubble this out.

@mew2057 mew2057 added PhaseFound: Customer Sev: 3 Status: Open open for someone to grab and start working on Type: Defect labels Jul 8, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
PhaseFound: Customer Sev: 3 Status: Open open for someone to grab and start working on Type: Defect
Projects
None yet
Development

No branches or pull requests

2 participants