Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Continuous SSL exceptions post upgrade from 2.11 to 2.15 #4718

Open
blueish-eyez opened this issue Aug 16, 2024 · 24 comments
Open

[BUG] Continuous SSL exceptions post upgrade from 2.11 to 2.15 #4718

blueish-eyez opened this issue Aug 16, 2024 · 24 comments
Assignees
Labels
bug Something isn't working v2.18.0 Issues targeting release v2.18.0 v3.0.0

Comments

@blueish-eyez
Copy link

blueish-eyez commented Aug 16, 2024

Describe the bug

I had a working cluster free of errors, however post upgrade to 2.15 (also tested 2.16) I'm getting a ton of the following error on worker nodes:

[2024-08-12T16:48:40,662][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport] [opensearch-0] Exception during establishing a SSL connection: java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
	at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401) ~[?:?]
	at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434) ~[?:?]
	at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.16.0.jar:2.16.0]
	at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.16.0.jar:2.16.0]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994) [netty-common-4.1.111.Final.jar:4.1.111.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.111.Final.jar:4.1.111.Final]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]

This message is always identical and I cannot pinpoint it to a specific action performed by OpenSearch. The only thing that's also repeatedly logged is:

[INFO ][o.o.j.s.JobSweeper       ] [opensearch-0] Running full sweep

But I feel it's a long shot.

I did not see this problem on 2.11 and this only started appearing after upgrading. Has anyone else experienced this? Is there any direction you could recommend me to go other than verifying certificates (already done, none expired, CAs are there, as well as the keys etc). Is there perhaps a way to connect the SSL exception with a specific task that caused it? Was it communication from a specific node? From master node maybe? Was it in regards to snapshots or whatever?

Any help is highly appreciated!

Related component

Other

To Reproduce

  1. Setup a working cluster on 2.11 with Security plugin
  2. Upgrade the cluster to 2.15
  3. Check worker node logs for SSL exceptions

Expected behavior

No SSL exceptions post upgrade

Additional Details

Plugins
Security plugin

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • Docker image based OpenSearch 2.15

Additional context
Add any other context about the problem here.

@blueish-eyez blueish-eyez added bug Something isn't working untriaged Require the attention of the repository maintainers and may need to be prioritized labels Aug 16, 2024
@zane-neo
Copy link
Contributor

@blueish-eyez I tried this on my local but I can't reproduce it successfully. Here is what I have done:

  1. Create a 2.11 opensearch cluster with docker-compose file and create several documents in a index.
  2. Change the docker-compose file opensearch version to 2.16 and restart.
    Can you share steps that can consistently reproduce the issue?

@blueish-eyez
Copy link
Author

Thinking about it in larger picture there will be plenty of additional points of influence here.. it's a production grade cluster (although sandbox, but with the same principles of access and usage). There are java applications directly writing to the cluster. I tried setting up docker-compose and upgrade but I also could not reproduce in such scenario. I also cannot reproduce on another regular production grade cluster that is using traditional filebeat-logstash-opensearch write pipeline.

Is it possible to decode [2024-08-12T16:48:40,662][ERROR][o.o.h.n.s.SecureNetty4HttpServerTransport], exactly the "o.o.h.n.s" bit, as mentioned in https://opensearch.org/docs/latest/install-and-configure/configuring-opensearch/logs/?
Like o.o.i.r stands for logger.org.opensearch.index.reindex. Perhaps knowing the call it was made from I could pin-point the faulty module?

@zane-neo
Copy link
Contributor

The o.o.h.n.s stands for org.opensearch.http.netty4.ssl, you can try change the log level of this package or the class org.opensearch.http.netty4.ssl.SecureNetty4HttpServerTransport to see if more details can be found.

@machenity
Copy link

machenity commented Sep 2, 2024

Hi. I'm experiencing the same issue.
In my OpenSearch cluster on version 2.16.0, the above issue is repeating on coordinating nodes.

The strange thing is that it's happening exactly once every 10 hours, and almost exclusively on the coordinating nodes.

image

This happens even when no indexing or query requests are being sent.

I've created and tested clusters with different versions, and the same issue doesn't happen with 2.13.0 and earlier,
but only with 2.14.0 and later.

  • 2.12.0: not occurred
  • 2.13.0: not occurred
  • 2.14.0: occurred
  • 2.16.0: occurred

So, I'm not sure, but I suspect it's related to this commit.

@10000-ki
Copy link
Contributor

10000-ki commented Sep 2, 2024

Hi
The same phenomenon occurs to me, and the issue seems to exist starting from v2.14.0 or higher.

@dblock
Copy link
Member

dblock commented Sep 2, 2024

We still need to narrow this down and get an easy repro, maybe @stephen-crawford can help us here?

@reta
Copy link
Collaborator

reta commented Sep 3, 2024

@dblock I will also take a look

@reta
Copy link
Collaborator

reta commented Sep 3, 2024

In my OpenSearch cluster on version 2.16.0, the above issue is repeating on coordinating nodes.

@machenity do you have any proxy / gateway in front of the cluster?

I had a working cluster free of errors, however post upgrade to 2.15 (also tested 2.16) I'm getting a ton of the following error on worker nodes:

@blueish-eyez do you have periodic pattern as well or it is very random?

@reta
Copy link
Collaborator

reta commented Sep 3, 2024

OK folks, I think the mystery is resolved: TLDR; no functional regressions have been introduced.

So pre-2.14.0, the secure HTTP transport didn't log any exceptions (see please https://github.com/opensearch-project/security/blob/2.13.0.0/src/main/java/org/opensearch/security/ssl/OpenSearchSecuritySSLPlugin.java#L270 where the error handler was set to NOOP).

In 2.14.0 and later, the handler was switched from NOOP to the one which logs (see please https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/OpenSearchSecurityPlugin.java#L2125) and here is why the exceptions are appearing now.

The takeaway is that those were present before but swallowed.

@dblock
Copy link
Member

dblock commented Sep 4, 2024

@reta Thanks! There's still something causing these errors that's not supposed to, no?

@reta
Copy link
Collaborator

reta commented Sep 4, 2024

@reta Thanks! There's still something causing these errors that's not supposed to, no?

Thanks @dblock , correct, these errors are caused by clients closing the connection. It is not possible to pinpoint the exact reasons but just a few:

  • proxy / gateways in the middle may close the connection
  • application / service crash that abruptly closes the connection
  • ...

@machenity
Copy link

@reta thanks for the debug!

In my OpenSearch cluster on version 2.16.0, the above issue is repeating on coordinating nodes.

@machenity do you have any proxy / gateway in front of the cluster?

My clusters were provisioned on top of our private cloud, in-house Kubernetes, so all coordinating nodes were behind the LoadBalancer service. I asked our infrastructure team if there is any monitoring job involved.

Thanks again 🙇

@machenity
Copy link

machenity commented Sep 5, 2024

I checked it again, and the 10-hour duration patterns are only for low-used clusters.
For the highly used clusters, these logs have occurred randomly.

@blueish-eyez
Copy link
Author

Hi @reta
I restarted one of the worker nodes and collected sample stdout for roughly 30 minutes. The SSL exception appears roughly every 5-30 seconds. Unfortunately to the point, where logs become unusable after days of uptime. Is there a way to suppress this?
Please also note that while others mentioned they are seeing this on master nodes, I'm seeing this only on worker nodes.

@reta
Copy link
Collaborator

reta commented Sep 5, 2024

The SSL exception appears roughly every 5-30 seconds. Unfortunately to the point, where logs become unusable after days of uptime. Is there a way to suppress this?

Thanks @blueish-eyez , yes, I will be working on restoring the previous behavior when such exceptions where swallowed. @dblock could you please transfer this ticket to security plugin?

@reta reta self-assigned this Sep 5, 2024
@dblock dblock removed the untriaged Require the attention of the repository maintainers and may need to be prioritized label Sep 9, 2024
@dblock
Copy link
Member

dblock commented Sep 9, 2024

[Catch All Triage - 1, 2, 3, 4, 5]

@dblock dblock transferred this issue from opensearch-project/OpenSearch Sep 9, 2024
@github-actions github-actions bot added the untriaged Require the attention of the repository maintainers and may need to be prioritized label Sep 9, 2024
@dblock dblock removed the untriaged Require the attention of the repository maintainers and may need to be prioritized label Sep 9, 2024
@reta reta added v3.0.0 v2.18.0 Issues targeting release v2.18.0 labels Sep 9, 2024
@reta
Copy link
Collaborator

reta commented Sep 10, 2024

@blueish-eyez the problem with sslExceptionHandler shoudl be fixed in 2.18.0 but it may not get rid of the this exceptions in the logs, fe I could reproduce (by simulation) those on 2.11.0:

[2024-09-10T19:45:25,650][ERROR][o.o.s.s.h.n.SecuritySSLNettyHttpServerTransport] [21140679ffc1] Exception during establishing a SSL connection: java.net.SocketException: Connection reset                                                                                                                                                           
java.net.SocketException: Connection reset                                                                                                                                                                                                                                                                                                            
        at sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394) ~[?:?]                                                                                                                                                                                                                                                       
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426) ~[?:?]                                                                                                                                                                                                                                                                       
        at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.11.0.jar:2.11.0]
        at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.11.0.jar:2.11.0]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.100.Final.jar:4.1.100.Final]
        at java.lang.Thread.run(Thread.java:833) [?:?]

By any chance, did you (re)configure logging on your clusters? One notable difference between 2.11.x and 2.15.x is that transports are now in different package.

@blueish-eyez
Copy link
Author

@reta thanks for the update. Logging is configured at info level and I believe this was the baseline. No specific packages have elevated / suppressed messaging (except for testing that suggested @zane-neo in the first response, that however did not show any relevant messages around SSL exceptions). For the time being I can set org.opensearch.http.netty4.ssl to FATAL until I can get the 2.18 image to test with

@rlueckl
Copy link

rlueckl commented Oct 16, 2024

FYI: we see the same exceptions (currently running 2.17, didn't check when the exceptions were introduced in the logs).

A very simple way to reproduce - and how we noticed the exceptions in the first place: try the Nagios check_http plugin.

# /usr/lib/nagios/plugins/check_http -H opensearch-host01.int.lan --port=9200 --ssl --certificate=15
OK - Certificate 'opensearch-host01.int.lan' will expire on Sun 17 Nov 2024 06:05:05 AM GMT +0000.

Running the check will immediately trigger an exception. We didn't change anything in our monitoring and the check_http script wasn't updated for a long time either.

@reta
Copy link
Collaborator

reta commented Oct 16, 2024

@rlueckl thanks, could you reproduce it with docker setup? I cannot, the log is clean:

$ docker run -it -p 9200:9200 -p 9600:9600 -e OPENSEARCH_INITIAL_ADMIN_PASSWORD=_ad0m#Ns_ -e "discovery.type=single-node"  opensearchproject/opensearch:2.17.1
$ /usr/local/nagios/libexec/check_http  -H localhost --port=9200 --ssl --certificate=15
SSL OK - Certificate 'node-0.example.com' will expire in 3410 days on 2034-02-17 12:03 -0500/EST.

@rlueckl
Copy link

rlueckl commented Oct 17, 2024

Hi @reta,
which version of check_http are you using? I've tested the docker image on my workstation but could not reproduce the issue, but my Linux Mint has a slightly older version of check_http than the servers.

check_http 2.3.1 (workstation) -> docker 2.17.0 (workstation) -> no exception
check_http 2.3.1 (workstation) -> docker 2.17.1 (workstation) -> no exception
check_http 2.3.1 (workstation) -> deb install 2.17.0 (server) -> no exception
check_http 2.3.3 (server) -> deb install 2.17.0 (server) -> connection reset exception
check_http 2.3.3 (server) -> docker 2.17.x (workstation) -> can't test, firewall between the server and my workstation.

The last test would be interesting if you could test with check_http 2.3.3 and the docker image to see if there's an exception. Unfortunately I'm somewhat limited because of our company firewalls and can't test this case.

EDIT: also tested with check_http 2.4.9 -> no exception. So it seems that the exception is (only) triggered by check_http 2.3.3 (maybe other versions, I want to test 2.3.5 as soon as I have some time).

@reta
Copy link
Collaborator

reta commented Oct 17, 2024

Hi @reta,
which version of check_http are you using?

Thanks @rlueckl

$ /usr/local/nagios/libexec/check_http  --version
check_http v2.4.11 (nagios-plugins 2.4.11)

I will try this configuration now, will update you shortly:

  • check_http 2.3.3 (server) -> deb install 2.17.0 (server) -> connection reset exception

UPD: Debian Bookworm + OpenSearch 2.17.1 (on Docker):

root@c2d1f85d6cd6:/tmp/nagios-plugins-2.3.3# /usr/local/nagios/libexec/check_http -H c2d1f85d6cd6 --port=9200 --ssl --certificate=15
SSL OK - Certificate 'node-0.example.com' will expire in 3410 days on 2034-02-17 17:03 +0000/UTC.

Log is clean :(

@rlueckl
Copy link

rlueckl commented Oct 18, 2024

Interesting. I can't reproduce the exception with 2.3.3 from my workstation, but running the check locally on the server I get the exception.

check_http 2.3.3 (server) -> deb install 2.17.0 (server) -> connection reset exception
check_http 2.3.3 (workstation) -> deb install 2.17.0 (server) -> no exception

@kkoki-ctrl
Copy link

I encountered the same error, and after checking, I found the following:

  • The error occurs sporadically 0 to 3 times per hour, but only on the master node.
  • The error happens regardless of the version (OpenSearch 2.13.0, 2.14.0, 2.16.0).
  • The error does not occur when the elasticsearch-exporter:v1.8.0 is stopped.

https://forum.opensearch.org/t/ssl-exception-connection-reset-error-on-master-nodes/22281/5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v2.18.0 Issues targeting release v2.18.0 v3.0.0
Projects
None yet
Development

No branches or pull requests

8 participants