-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-3498. Shutdown datanode if address is already in use #7256
Conversation
@Daniilchik thank you for working on this. |
@sarvekshayr
After
|
"Unable to communicate to {} server at {} for past {} seconds.", | ||
serverName, | ||
getAddress().getHostString() + ":" + getAddress().getPort(), | ||
address.getAddress() + ":" + address.getPort(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use log4j templates instead of concatenation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -108,6 +109,10 @@ public EndpointStateMachine.EndPointStates call() throws Exception { | |||
rpcEndPoint.setState(EndpointStateMachine.EndPointStates.SHUTDOWN); | |||
} catch (IOException ex) { | |||
rpcEndPoint.logIfNeeded(ex); | |||
if (ex.getCause() instanceof NativeIoException) { | |||
LOG.error(ex.getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we log stacktrace here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Daniilchik for working on the patch.
A few thoughts:
NativeIoException.class
should not be exposed toVersionEndpointTask
, as it is an internal detail of OzoneContainer. I believeOzoneContainer.start()
should throw a dedicated exception for this scenario, perhaps something likeContainerStartException
.rpcEndPoint.logIfNeeded(ex)
should not be called in the event of a port binding error. This call is responsible for the log message 'Unable to communicate with SCM server at scm/172.20.0.3:9861', but that’s not relevant in this situation.ozoneContainer.stop()
should not be invoked directly. Simply callingrpcEndPoint.setState(EndpointStateMachine.EndPointStates.SHUTDOWN)
is sufficient to initiate the shutdown process.
@myskov thanks for help, all done. |
8c47c15
to
c151608
Compare
@Daniilchik please do not use |
Sorry about that, I won’t do it again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please update the log output to reflect these changes?
@@ -107,7 +107,8 @@ public EndpointStateMachine.EndPointStates call() throws Exception { | |||
} catch (DiskOutOfSpaceException ex) { | |||
rpcEndPoint.setState(EndpointStateMachine.EndPointStates.SHUTDOWN); | |||
} catch (IOException ex) { | |||
rpcEndPoint.logIfNeeded(ex); | |||
LOG.error(ex.getCause().getMessage(), ex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOG.error(ex) should already print message from exception. I would rather put here thoughtful and distinct message so it will be much easier to pinpoint place of the failure and understand when and why it failed.
server.start(); | ||
} catch (IOException e) { | ||
LOG.error("Failed to bind to address", e); | ||
if (e.getMessage().contains("Failed to bind to address")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any actual reason to start a service if we can't start a server? Shouldn't we just handle all IOException like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we will handle all IOException like that we will end up with shutting down datanodes due to any network failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we identify how actual retryable exceptions looks like? Is netty throwing specifically only IOExceptions here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it is better just handle correctly server.start part and shutdown everything here cause we can't start server. There is no point to live afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed it with Daniil - we need to return to this code at some point in time to refactor it.
try { | ||
server.start(); | ||
} catch (IOException e) { | ||
LOG.error("Failed to bind to address", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failed to bind to address - you are writing this text for any IOException. What if it was other issue preventing us from starting server?
Message something like "Can't start whatever server due to error occurred during startup process:" will be more appropriate here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
serverName, | ||
address.getAddress(), | ||
address.getPort(), | ||
TimeUnit.MILLISECONDS.toSeconds(this.getMissedCount() * getScmHeartbeatInterval(this.conf)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather calculate it beforehand in a dedicated variable just to improve code readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ERROR server.XceiverServerGrpc: Error while starting the server |
...ntainer-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/BindException.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/ozone/container/common/TestEndPoint.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/apache/hadoop/ozone/container/common/transport/server/XceiverServerGrpc.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most recent version looks good to me. @ivanzlenko please take a look.
long missedDurationSeconds = TimeUnit.MILLISECONDS.toSeconds( | ||
this.getMissedCount() * getScmHeartbeatInterval(this.conf) | ||
); | ||
LOG.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you changed it to error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the remark. I’ll change it back. I initially changed it at the beginning of working on the task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
server.start(); | ||
} catch (IOException e) { | ||
LOG.error("Failed to bind to address", e); | ||
if (e.getMessage().contains("Failed to bind to address")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed it with Daniil - we need to return to this code at some point in time to refactor it.
@Daniilchik thank you for working on the patch. Thanks @sarvekshayr and @ivanzlenko for reviewing. |
What changes were proposed in this pull request?
This PR fixes a bug related to an incomplete datanode process.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-3498
How was this patch tested?
Manual tests using docker-compose, build-branch workflow.