-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-31009 Ensure Thor startup errors are properly reported. #18143
HPCC-31009 Ensure Thor startup errors are properly reported. #18143
Conversation
https://track.hpccsystems.com/browse/HPCC-31009 |
8ec1c7c
to
b3ec998
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are other similar failures possible in connect() which also returns false rather than throwing an exception (e.g. lines 390, 397, 432 etc). It is not clear why an exception is appropriate in line 290 and not in the case of the other failures in that function.
This also closes a few paths in k8s where an early error, e.g. an error during worker registration, would not be correctly reported back to the workunit, and cause the agent to be unaware of the failure and deadlock. Signed-off-by: Jake Smith <[email protected]>
b3ec998
to
a4f279c
Compare
Agreed. NB: quite a few of the changes are due to indentation changes only. |
@@ -14269,16 +14269,16 @@ void executeThorGraph(const char * graphName, IConstWorkUnit &workunit, const IP | |||
unsigned runningTimeLimit = workunit.getDebugValueInt("maxRunTime", 0); | |||
runningTimeLimit = runningTimeLimit ? runningTimeLimit : INFINITE; | |||
|
|||
std::list<WUState> expectedStates = { WUStateRunning, WUStateWait }; | |||
std::list<WUState> expectedStates = { WUStateRunning, WUStateWait, WUStateFailed }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: not absolutely required in this commit, but related. This code should have handled the workunit having put in the failed state before. The only downside before was the production of an extra spurious "Query failed, state: ..." exception
@@ -429,11 +423,7 @@ class CRegistryServer : public CSimpleInterface | |||
{ | |||
Owned<IException> e = deserializeException(msg); | |||
EXCLOG(e, "Registration error"); | |||
if (TE_FailedToRegisterSlave == e->errorCode()) | |||
{ | |||
setExitCode(0); // to avoid thor auto-recycling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: this was not needed before, the deliberate suppression of Thor restarts in bare-metal at this point, happens [deliberatly] because it hasn't got as far as creating the sentinel file (see init_thor script for non-restart when sentinel missing)
https://track.hpccsystems.com/browse/HPCC-31009 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakesmith a couple of questions. I will merge as is.
{ | ||
// bare-metal - check health of dafilesrv's on the Thor cluster. | ||
if (globals->getPropBool("@replicateOutputs")&&globals->getPropBool("@validateDAFS",true)&&!checkClusterRelicateDAFS(queryNodeGroup())) | ||
{ | ||
FLLOG(MCoperatorError, thorJob, "ERROR: Validate failure(s) detected, exiting Thor"); | ||
return globals->getPropBool("@validateDAFSretCode"); // default is no recycle! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be clearer if this explicitly returned an integer error code. E.g.
return globals->getPropBool("@validateDAFSretCode") ? TEC_DAFSdown : 0);
registry.clear(); | ||
// NB: workunit/graphName only set in one-shot mode (if isCloud()) | ||
thorMain(logHandler, workunit, graphName); | ||
LOG(MCauditInfo, ",Progress,Thor,Terminate,%s,%s,%s",thorname,nodeGroup.str(),queueName.str()); | ||
LOG(MCdebugProgress, thorJob, "ThorMaster terminated OK"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make any difference that this is now before the destruction of registry?
c9afd94
into
hpcc-systems:candidate-9.4.x
This also closes a few paths in k8s where an early error,
e.g. an error during worker registration, would not be
correctly reported back to the workunit, and cause the
agent to be unaware of the failure and deadlock.
Signed-off-by: Jake Smith [email protected]
Type of change:
Checklist:
Smoketest:
Testing: