From 9a2f8a80eb729c33b33a819a6246d03459e168c9 Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 4 Apr 2024 07:37:13 +0100 Subject: [PATCH] Add remote cluster network troubleshooting docs (#107072) Spells out in a little more detail our expectations for remote cluster connections, including an example log message when the network is unreliable and some suggestions for how to troubleshoot further. --- .../remote-clusters-troubleshooting.asciidoc | 40 +++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/docs/reference/modules/cluster/remote-clusters-troubleshooting.asciidoc b/docs/reference/modules/cluster/remote-clusters-troubleshooting.asciidoc index f7b08b40bb7ef..df3c54794dc06 100644 --- a/docs/reference/modules/cluster/remote-clusters-troubleshooting.asciidoc +++ b/docs/reference/modules/cluster/remote-clusters-troubleshooting.asciidoc @@ -77,6 +77,46 @@ org.elasticsearch.transport.ConnectTransportException: [][192.168.0.42:9443] *co server is enabled>> on the remote cluster. * Ensure no firewall is blocking the communication. +[[remote-clusters-unreliable-network]] +===== Remote cluster connection is unreliable + +====== Symptom + +The local cluster can connect to the remote cluster, but the connection does +not work reliably. For example, some cross-cluster requests may succeed while +others report connection errors, time out, or appear to be stuck waiting for +the remote cluster to respond. + +When {es} detects that the remote cluster connection is not working, it will +report the following message in its logs: +[source,txt,subs=+quotes] +---- +[2023-06-28T16:36:47,264][INFO ][o.e.t.ClusterConnectionManager] [local-node] transport connection to [{my-remote#192.168.0.42:9443}{...}] closed by remote +---- +This message will also be logged if the node of the remote cluster to which +{es} is connected is shut down or restarted. + +Note that with some network configurations it could take minutes or hours for +the operating system to detect that a connection has stopped working. Until the +failure is detected and reported to {es}, requests involving the remote cluster +may time out or may appear to be stuck. + +====== Resolution + +* Ensure that the network between the clusters is as reliable as possible. + +* Ensure that the network is configured to permit <>. + +* Ensure that the network is configured to detect faulty connections quickly. + In particular, you must enable and fully support TCP keepalives, and set a + short <>. + +* On Linux systems, execute `ss -tonie` to verify the details of the + configuration of each network connection between the clusters. + +* If the problems persist, capture network packets at both ends of the + connection and analyse the traffic to look for delays and lost messages. + [[remote-clusters-troubleshooting-tls-trust]] ===== TLS trust not established