-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Single-node OpenSearch crashes when connecting to an extension on a remote node #761
Comments
Hey @dbwiddis I just noticed in local testing that when I remove It used to be that when
Recently when I've been testing I've been seeing different output though. Now I see:
I'm not sure why. |
Thanks @cwperks. I was trying to read up, whats the difference between these 2 addresses[1] Im curious how OpenSearch/netty translates this. [1] https://www.baeldung.com/linux/difference-ip-address#meaning-of-0000-in-different-contexts |
@dbwiddis Im trying to play with this, I see every I didnt try this on virtual instances, I tried it on local but I believe this should work. My diff:
Logs:
|
This post is informative.
If you remove it, those can be set individually, so you're seeing the |
Closing as the issues associated with this have been eiether resolved or included in #782 |
What is the bug?
The OpenSearch ExtensionsManager / TransportService does not know its own IP or hostname when initializing extensions.
We are passing
TransportService.getLocalNode()
to the extension which works fine in local testing on "localhost" and effectively passes the listening Tranport port (9300) but the "local node" is not intended for this purpose, it just creates a short-cut for Transport calls to the local node. So an extension thinks OpenSearch lives here:When initializing an extension, it receives the incoming transport just fine, and can respond in the same transport channel, but attempting to send additional transport requests back (like registering REST requests, settings, etc.) fails. And after some recent changes to support SSL, the resulting exception actually crashes. The crashing can be fixed. The underlying failure to communicate is a much gnarlier problem.
Getting the IP is not trivial:
Further complicating matters, using the
opensearch-cluster-cdk
to set up the OpenSearch cluster in a "standard" environment produces a single hostname which resolves to the load balancer, and doesn't actually connect to the same extension node that is starting up the extension.We could hard-code the IP address, but this is dynamic, created during cluster startup, so the extension doesn't know it and if we have multiple cluster manager nodes we aren't exactly sure of it either; but the extension must be started up before OpenSearch so it can be connected to (see #729). And I'd need to make sure the extension is running in the same private subnet to even access it (possibly a way I can hack around this for performance testing but clearly not a good solution).
How can one reproduce the bug?
./gradlew HelloWorld
works.)./gradlew run
works.)What is the expected behavior?
OpenSearch initializes the extension and the extension knows where to send transport requests back to the OpenSearch node.
What is your host/environment?
Mixed/not relevant other than the fact that the cluster is brought up dynamically in EC2 and its IP is not known in advance and is private behind a load balancer
Do you have any additional context?
This is somewhat of a hard blocker for performance testing (#725). I've been struggling with other issues that have finally gotten resolved and this is literally the "last mile" to get everything working.
I'm going to continue to pursue workarounds to get performance testing going, but they will just be hacks; this is a real (and hard) problem that needs to be solved and I don't have clarity on how to solve it properly at this point.
Need some ideas from some experts here.
@dblock @reta @nknize @saratvemulapalli @andrross @kartg
The text was updated successfully, but these errors were encountered: