-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Continuous SSL exceptions post upgrade from 2.11 to 2.15 #4718
Comments
@blueish-eyez I tried this on my local but I can't reproduce it successfully. Here is what I have done:
|
Thinking about it in larger picture there will be plenty of additional points of influence here.. it's a production grade cluster (although sandbox, but with the same principles of access and usage). There are java applications directly writing to the cluster. I tried setting up docker-compose and upgrade but I also could not reproduce in such scenario. I also cannot reproduce on another regular production grade cluster that is using traditional filebeat-logstash-opensearch write pipeline. Is it possible to decode [2024-08-12T16:48:40,662][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport], exactly the "o.o.h.n.s" bit, as mentioned in https://opensearch.org/docs/latest/install-and-configure/configuring-opensearch/logs/? |
The |
Hi. I'm experiencing the same issue. The strange thing is that it's happening exactly once every 10 hours, and almost exclusively on the coordinating nodes. This happens even when no indexing or query requests are being sent. I've created and tested clusters with different versions, and the same issue doesn't happen with 2.13.0 and earlier,
So, I'm not sure, but I suspect it's related to this commit. |
Hi |
We still need to narrow this down and get an easy repro, maybe @stephen-crawford can help us here? |
@dblock I will also take a look |
@machenity do you have any proxy / gateway in front of the cluster?
@blueish-eyez do you have periodic pattern as well or it is very random? |
OK folks, I think the mystery is resolved: TLDR; no functional regressions have been introduced. So pre-2.14.0, the secure HTTP transport didn't log any exceptions (see please https://github.com/opensearch-project/security/blob/2.13.0.0/src/main/java/org/opensearch/security/ssl/OpenSearchSecuritySSLPlugin.java#L270 where the error handler was set to NOOP). In 2.14.0 and later, the handler was switched from NOOP to the one which logs (see please https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/OpenSearchSecurityPlugin.java#L2125) and here is why the exceptions are appearing now. The takeaway is that those were present before but swallowed. |
@reta Thanks! There's still something causing these errors that's not supposed to, no? |
Thanks @dblock , correct, these errors are caused by clients closing the connection. It is not possible to pinpoint the exact reasons but just a few:
|
@reta thanks for the debug!
My clusters were provisioned on top of our private cloud, in-house Kubernetes, so all coordinating nodes were behind the LoadBalancer service. I asked our infrastructure team if there is any monitoring job involved. Thanks again 🙇 |
I checked it again, and the 10-hour duration patterns are only for low-used clusters. |
Hi @reta |
Thanks @blueish-eyez , yes, I will be working on restoring the previous behavior when such exceptions where swallowed. @dblock could you please transfer this ticket to |
@blueish-eyez the problem with
By any chance, did you (re)configure logging on your clusters? One notable difference between 2.11.x and 2.15.x is that transports are now in different package. |
@reta thanks for the update. Logging is configured at info level and I believe this was the baseline. No specific packages have elevated / suppressed messaging (except for testing that suggested @zane-neo in the first response, that however did not show any relevant messages around SSL exceptions). For the time being I can set |
FYI: we see the same exceptions (currently running 2.17, didn't check when the exceptions were introduced in the logs). A very simple way to reproduce - and how we noticed the exceptions in the first place: try the Nagios
Running the check will immediately trigger an exception. We didn't change anything in our monitoring and the |
@rlueckl thanks, could you reproduce it with docker setup? I cannot, the log is clean:
|
Hi @reta, check_http 2.3.1 (workstation) -> docker 2.17.0 (workstation) -> no exception The last test would be interesting if you could test with check_http 2.3.3 and the docker image to see if there's an exception. Unfortunately I'm somewhat limited because of our company firewalls and can't test this case. EDIT: also tested with check_http 2.4.9 -> no exception. So it seems that the exception is (only) triggered by check_http 2.3.3 (maybe other versions, I want to test 2.3.5 as soon as I have some time). |
Thanks @rlueckl
I will try this configuration now, will update you shortly:
UPD: Debian Bookworm + OpenSearch 2.17.1 (on Docker):
Log is clean :( |
Interesting. I can't reproduce the exception with 2.3.3 from my workstation, but running the check locally on the server I get the exception. check_http 2.3.3 (server) -> deb install 2.17.0 (server) -> connection reset exception |
I encountered the same error, and after checking, I found the following:
https://forum.opensearch.org/t/ssl-exception-connection-reset-error-on-master-nodes/22281/5 |
Describe the bug
I had a working cluster free of errors, however post upgrade to 2.15 (also tested 2.16) I'm getting a ton of the following error on worker nodes:
This message is always identical and I cannot pinpoint it to a specific action performed by OpenSearch. The only thing that's also repeatedly logged is:
But I feel it's a long shot.
I did not see this problem on 2.11 and this only started appearing after upgrading. Has anyone else experienced this? Is there any direction you could recommend me to go other than verifying certificates (already done, none expired, CAs are there, as well as the keys etc). Is there perhaps a way to connect the SSL exception with a specific task that caused it? Was it communication from a specific node? From master node maybe? Was it in regards to snapshots or whatever?
Any help is highly appreciated!
Related component
Other
To Reproduce
Expected behavior
No SSL exceptions post upgrade
Additional Details
Plugins
Security plugin
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: