Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-3498. Shutdown datanode if address is already in use #7256

Merged
merged 17 commits into from
Oct 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
7d3ccfe
HDDS-3498. Shutdown datanode if address is already in use
Daniilchik Oct 2, 2024
c9e88a2
HDDS-3498. Shutdown datanode if address is already in use
Daniilchik Oct 2, 2024
9fd9d99
HDDS-3498. Shutdown datanode if address is already in use
Daniilchik Oct 2, 2024
20f4470
HDDS-3498. Shutdown datanode if address is already in use
Daniilchik Oct 2, 2024
4d16aa7
HDDS-3498. Shutdown datanode if address is already in use
Daniilchik Oct 2, 2024
71588eb
HDDS-3498. Shutdown datanode if address is already in use
Daniilchik Oct 2, 2024
58741b5
HDDS-3498. Shutdown datanode if address is already in use
Daniilchik Oct 2, 2024
6017207
HDDS-3498. Shutdown datanode if address is already in use
Daniilchik Oct 2, 2024
29400e5
HDDS-3498. Shutdown datanode if address is already in use
Daniilchik Oct 2, 2024
164f1c0
Merge pull request #2 from Daniilchik/HDDS-3498-1
Daniilchik Oct 2, 2024
83fbbbe
HDDS-3498. Shutdown datanode if address is already in use
Daniilchik Oct 2, 2024
4a46f4d
HDDS-3498. Shutdown datanode if address is already in use
Daniilchik Oct 3, 2024
4b4a582
HDDS-3498. Shutdown datanode if address is already in use
Daniilchik Oct 3, 2024
6253cb1
HDDS-3498. Shutdown datanode if address is already in use
Daniilchik Oct 3, 2024
d9536ab
HDDS-3498. Shutdown datanode if address is already in use
Daniilchik Oct 3, 2024
3bb033d
HDDS-3498. Shutdown datanode if address is already in use
Daniilchik Oct 4, 2024
0ac758e
HDDS-3498. Shutdown datanode if address is already in use
Daniilchik Oct 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -234,12 +234,17 @@ public void logIfNeeded(Exception ex) {
}

if (missCounter == 0) {
long missedDurationSeconds = TimeUnit.MILLISECONDS.toSeconds(
this.getMissedCount() * getScmHeartbeatInterval(this.conf)
);
LOG.warn(
"Unable to communicate to {} server at {} for past {} seconds.",
serverName,
getAddress().getHostString() + ":" + getAddress().getPort(),
TimeUnit.MILLISECONDS.toSeconds(this.getMissedCount() *
getScmHeartbeatInterval(this.conf)), ex);
"Unable to communicate to {} server at {}:{} for past {} seconds.",
serverName,
address.getAddress(),
address.getPort(),
missedDurationSeconds,
ex
);
}

if (LOG.isTraceEnabled()) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
package org.apache.hadoop.ozone.container.common.states.endpoint;

import java.io.IOException;
import java.net.BindException;
import java.util.concurrent.Callable;

import org.apache.hadoop.hdds.conf.ConfigurationSource;
Expand Down Expand Up @@ -104,7 +105,7 @@ public EndpointStateMachine.EndPointStates call() throws Exception {
LOG.debug("Cannot execute GetVersion task as endpoint state machine " +
"is in {} state", rpcEndPoint.getState());
}
} catch (DiskOutOfSpaceException ex) {
} catch (DiskOutOfSpaceException | BindException ex) {
rpcEndPoint.setState(EndpointStateMachine.EndPointStates.SHUTDOWN);
} catch (IOException ex) {
rpcEndPoint.logIfNeeded(ex);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
package org.apache.hadoop.ozone.container.common.transport.server;

import java.io.IOException;
import java.net.BindException;
import java.util.Collections;
import java.util.List;
import java.util.UUID;
Expand Down Expand Up @@ -185,7 +186,16 @@ public HddsProtos.ReplicationType getServerType() {
@Override
public void start() throws IOException {
if (!isStarted) {
server.start();
try {
server.start();
} catch (IOException e) {
LOG.error("Error while starting the server", e);
if (e.getMessage().contains("Failed to bind to address")) {
Copy link
Contributor

@ivanzlenko ivanzlenko Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any actual reason to start a service if we can't start a server? Shouldn't we just handle all IOException like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we will handle all IOException like that we will end up with shutting down datanodes due to any network failures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we identify how actual retryable exceptions looks like? Is netty throwing specifically only IOExceptions here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it is better just handle correctly server.start part and shutdown everything here cause we can't start server. There is no point to live afterwards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed it with Daniil - we need to return to this code at some point in time to refactor it.

throw new BindException(e.getMessage());
} else {
throw e;
}
}
int realPort = server.getPort();

if (port == 0) {
Expand Down