Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Single-node OpenSearch crashes when connecting to an extension on a remote node #761

Closed
dbwiddis opened this issue May 17, 2023 · 5 comments
Labels
bug Something isn't working untriaged

Comments

@dbwiddis
Copy link
Member

dbwiddis commented May 17, 2023

What is the bug?

The OpenSearch ExtensionsManager / TransportService does not know its own IP or hostname when initializing extensions.

We are passing TransportService.getLocalNode() to the extension which works fine in local testing on "localhost" and effectively passes the listening Tranport port (9300) but the "local node" is not intended for this purpose, it just creates a short-cut for Transport calls to the local node. So an extension thinks OpenSearch lives here:

publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}

When initializing an extension, it receives the incoming transport just fine, and can respond in the same transport channel, but attempting to send additional transport requests back (like registering REST requests, settings, etc.) fails. And after some recent changes to support SSL, the resulting exception actually crashes. The crashing can be fixed. The underlying failure to communicate is a much gnarlier problem.

[2023-05-16T17:49:39,636][ERROR][o.o.e.ExtensionsManager  ] [runTask-0] Extension initialization failed
org.opensearch.transport.RemoteTransportException: [hello-world][172.31.26.75:4532][internal:discovery/extensions]
Caused by: org.opensearch.transport.ConnectTransportException: [runTask-0][127.0.0.1:9300] connect_exception
...
Caused by: java.io.IOException: Connection refused: 127.0.0.1/127.0.0.1:9300

Getting the IP is not trivial:

  • We see an IP from the transport connection but it has gone through NAT and is useless.
  • Similarly, the OpenSearch node itself doesn't know it's public IP address, either, just a private IP address

Further complicating matters, using the opensearch-cluster-cdk to set up the OpenSearch cluster in a "standard" environment produces a single hostname which resolves to the load balancer, and doesn't actually connect to the same extension node that is starting up the extension.

We could hard-code the IP address, but this is dynamic, created during cluster startup, so the extension doesn't know it and if we have multiple cluster manager nodes we aren't exactly sure of it either; but the extension must be started up before OpenSearch so it can be connected to (see #729). And I'd need to make sure the extension is running in the same private subnet to even access it (possibly a way I can hack around this for performance testing but clearly not a good solution).

How can one reproduce the bug?

  1. Start up an extension on an EC2 node. (Cloning SDK and running ./gradlew HelloWorld works.)
  2. Start up a single-node OpenSearch cluster anywhere else (Cloning Opensearch and running ./gradlew run works.)

What is the expected behavior?

OpenSearch initializes the extension and the extension knows where to send transport requests back to the OpenSearch node.

What is your host/environment?

Mixed/not relevant other than the fact that the cluster is brought up dynamically in EC2 and its IP is not known in advance and is private behind a load balancer

Do you have any additional context?

This is somewhat of a hard blocker for performance testing (#725). I've been struggling with other issues that have finally gotten resolved and this is literally the "last mile" to get everything working.

I'm going to continue to pursue workarounds to get performance testing going, but they will just be hacks; this is a real (and hard) problem that needs to be solved and I don't have clarity on how to solve it properly at this point.

Need some ideas from some experts here.

@dblock @reta @nknize @saratvemulapalli @andrross @kartg

@cwperks
Copy link
Member

cwperks commented May 22, 2023

Hey @dbwiddis I just noticed in local testing that when I remove network.host: 0.0.0.0 it publishes the address I expect for the extension to connect back with the OpenSearch node.

It used to be that when network.host: 0.0.0.0 was present in my opensearch.yml file that I would see a line in the output like this after running ./bin/opensearch:

[2023-05-22T16:04:25,588][INFO ][o.o.t.TransportService   ] [smoketestnode] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}

Recently when I've been testing I've been seeing different output though. Now I see:

[2023-05-22T14:54:36,903][INFO ][o.o.t.TransportService   ] [smoketestnode] publish_address {10.25.70.25:9300}, bound_addresses {[::]:9300}

I'm not sure why.

@saratvemulapalli
Copy link
Member

Thanks @cwperks. I was trying to read up, whats the difference between these 2 addresses[1]
Looks like 0.0.0.0 would allow listening on all network interfaces present, where as 127.0.0.1 listens only on Loopback interface.

Im curious how OpenSearch/netty translates this.

[1] https://www.baeldung.com/linux/difference-ip-address#meaning-of-0000-in-different-contexts

@saratvemulapalli
Copy link
Member

saratvemulapalli commented May 22, 2023

@dbwiddis Im trying to play with this, I see every TransportMessage has remoteAddress which should be able to get OpenSearch IP and port. Though the port is a socket pair internally routed to 9300.

I didnt try this on virtual instances, I tried it on local but I believe this should work.
Did you already try this one?

My diff:

diff --git a/src/main/java/org/opensearch/sdk/handlers/ExtensionsInitRequestHandler.java b/src/main/java/org/opensearch/sdk/handlers/ExtensionsInitRequestHandler.java
index a49b1e8..c2f2737 100644
--- a/src/main/java/org/opensearch/sdk/handlers/ExtensionsInitRequestHandler.java
+++ b/src/main/java/org/opensearch/sdk/handlers/ExtensionsInitRequestHandler.java
@@ -54,6 +54,9 @@ public class ExtensionsInitRequestHandler {
         SDKTransportService sdkTransportService = extensionsRunner.getSdkTransportService();
         sdkTransportService.setOpensearchNode(extensionInitRequest.getSourceNode());
         sdkTransportService.setUniqueId(extensionInitRequest.getExtension().getId());
+        logger.info("SARAT", extensionInitRequest.remoteAddress());
+        logger.info(extensionInitRequest.remoteAddress().address().toString());
+        logger.info(extensionInitRequest.remoteAddress().getPort());
         // Successfully initialized. Send the response.
         try {
             return new InitializeExtensionResponse(

Logs:

13:59:15.520 [opensearch[hello-world][generic][T#1]] INFO  org.opensearch.sdk.handlers.ExtensionsInitRequestHandler - SARAT
13:59:15.520 [opensearch[hello-world][generic][T#1]] INFO  org.opensearch.sdk.handlers.ExtensionsInitRequestHandler - /127.0.0.1:55023
13:59:15.521 [opensearch[hello-world][generic][T#1]] INFO  org.opensearch.sdk.handlers.ExtensionsInitRequestHandler - 55023
13:59:15.558 [opensearch[hello-world][transport_worker][T#3]] DEBUG org.opensearch.transport.TcpTransport - opened transport connection [1] to [{3c22fb88f11c.ant.amazon.com}{Pi9Pd-HxRfCRt_j_z4DYTQ}{m-gPRNOiRKm-WszG-vioIw}{127.0.0.1}{127.0.0.1:9300}{dimr}{shard_indexing_pressure_enabled=true}] using channels [[Netty4TcpChannel{localAddress=/127.0.0.1:55036, remoteAddress=127.0.0.1/127.0.0.1:9300}, Netty4TcpChannel{localAddress=/127.0.0.1:55033, remoteAddress=127.0.0.1/127.0.0.1:9300}, Netty4TcpChannel{localAddress=/127.0.0.1:55027, remoteAddress=127.0.0.1/127.0.0.1:9300}, Netty4TcpChannel{localAddress=/127.0.0.1:55028, remoteAddress=127.0.0.1/127.0.0.1:9300}, Netty4TcpChannel{localAddress=/127.0.0.1:55031, remoteAddress=127.0.0.1/127.0.0.1:9300}, Netty4TcpChannel{localAddress=/127.0.0.1:55030, remoteAddress=127.0.0.1/127.0.0.1:9300}, Netty4TcpChannel{localAddress=/127.0.0.1:55029, remoteAddress=127.0.0.1/127.0.0.1:9300}, Netty4TcpChannel{localAddress=/127.0.0.1:55037, remoteAddress=127.0.0.1/127.0.0.1:9300}, Netty4TcpChannel{localAddress=/127.0.0.1:55032, remoteAddress=127.0.0.1/127.0.0.1:9300}, Netty4TcpChannel{localAddress=/127.0.0.1:55035, remoteAddress=127.0.0.1/127.0.0.1:9300}, Netty4TcpChannel{localAddress=/127.0.0.1:55026, remoteAddress=127.0.0.1/127.0.0.1:9300}, Netty4TcpChannel{localAddress=/127.0.0.1:55034, remoteAddress=127.0.0.1/127.0.0.1:9300}, Netty4TcpChannel{localAddress=/127.0.0.1:55038, remoteAddress=127.0.0.1/127.0.0.1:9300}]]

@dbwiddis
Copy link
Member Author

in local testing that when I remove network.host: 0.0.0.0 it publishes the address I expect for the extension to connect back with the OpenSearch node.

This post is informative.

network.host automatically sets network.bind_host and network.publish_host

If you remove it, those can be set individually, so you're seeing the publish_host before it was overridden. And the bind_host can/should be 0.0.0.0.

@dbwiddis dbwiddis changed the title [BUG] OpenSearch crashes when connecting to an extension on a remote node [BUG] Single-node OpenSearch crashes when connecting to an extension on a remote node Jun 1, 2023
@dbwiddis
Copy link
Member Author

Closing as the issues associated with this have been eiether resolved or included in #782

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged
Projects
None yet
Development

No branches or pull requests

3 participants