You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We ran into an issue where there were intermittent network packet drops and disconnections on a single faulty node hosting the primary copy. As a part of replicating the request to the request copy the primary disconnected with the replica copy and reached out to the cluster manager to mark the copy as failed. The cluster manager failed the shard citing a stale copy post which the primary ack'ed back the request to the user with a single primary copy staying healthy.
Meanwhile the follower check requests seemed to also fail intermittently, but since there is no coordination of the health checks on the leader in the critical request path it is quite possible that a faulty primary can mark the replica to be unhealthy before the leader can evict the faulty node from the cluster.
Other nodes at this time were also witnessing node disconnections, even earlier leader saw som disconnection
[2024-03-10T12:11:51,233][WARN ][o.o.g.G.InternalReplicaShardAllocator] [88a0ddc6bd07e28209f48cbbba4e6d35] [some-logs-2024.03.10][5]: failed to list shard for shard_store on node [rd2GjJh7QEu6fa_G71z5pw]
FailedNodeException[Failed node [rd2GjJh7QEu6fa_G71z5pw]]; nested: NodeDisconnectedException[[6c43b365208c71cce1b97f96b36faf8e][172.16.xx.xx:9300][internal:cluster/nodes/indices/shard/store[n]] disconnected];
at org.opensearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:308)
at org.opensearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:282)
at org.opensearch.transport.TransportService$6.handleException(TransportService.java:794)
at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:316)
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1414)
at org.opensearch.transport.TransportService$9.run(TransportService.java:1266)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:756)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: NodeDisconnectedException[[6c43b365208c71cce1b97f96b36faf8e][172.16.xx.xx:9300][internal:cluster/nodes/indices/shard/store[n]] disconnected]
Now after sometime the follower checks on the faulty node failed and was evicted(lagging) from the cluster causing failure of the primary copy as well.
Since replication requests and health checks happen fairly independently it's possible for a leader to inaccurately deem a faulty node as healthy and apply the faulty node's decision of failing a peer shard because of primary's inability to connect to the replica. While failing replica leader can perform an independent verification of the health of the replica shard before marking a replica shard as faulty to confirm any possibility of asymmetric failures i.e. leader checks going through while replication requests failing
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
OS: [e.g. iOS]
Version [e.g. 22]
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
While failing replica leader can perform an independent verification of the health of the replica shard before marking a replica shard as faulty to confirm any possibility of asymmetric failures i.e. leader checks going through while replication requests failing
Lets say replica shard node responds to cluster manager successfully. Now, cluster manager doesn't fail the shard. What happens to primary at this point. Should it step down as primary copy? As indexing requests will be blocked.
Also, this healthy replica check has to consider other valid causes (false positive) where cluster manager is able to connect to replica node but node hosting the primary was not able to due to valid reason.
Lets say replica shard node responds to cluster manager successfully. Now, cluster manager doesn't fail the shard. What happens to primary at this point. Should it step down as primary copy? As indexing requests will be blocked.
Today the problem is that a single instance of a network blip between primary and replica node can cause the replica shard to fail, which in my opinion is too aggressive and counter-productive to the overall cluster resiliency. Ideally if the leader finds the replica to be healthy it should just reject the shard failure request and the active primary should consider failing the request itself. We can then also look at strengthening the health checks and wait till the follower checks on the faulty node evicts the bad node.
Also, this healthy replica check has to consider other valid causes (false positive) where cluster manager is able to connect to replica node but node hosting the primary was not able to due to valid reason.
Leader actually does asymmetric network checks today but as you note this can still happen, in such situations we can consider a form of gossip like Scuttlebutt to ensure we cover more network paths beyond just follower-leader.
Describe the bug
We ran into an issue where there were intermittent network packet drops and disconnections on a single faulty node hosting the primary copy. As a part of replicating the request to the request copy the primary disconnected with the replica copy and reached out to the cluster manager to mark the copy as failed. The cluster manager failed the shard citing a stale copy post which the primary ack'ed back the request to the user with a single primary copy staying healthy.
Meanwhile the follower check requests seemed to also fail intermittently, but since there is no coordination of the health checks on the leader in the critical request path it is quite possible that a faulty primary can mark the replica to be unhealthy before the leader can evict the faulty node from the cluster.
Other nodes at this time were also witnessing node disconnections, even earlier leader saw som disconnection
Now after sometime the follower checks on the faulty node failed and was evicted(lagging) from the cluster causing failure of the primary copy as well.
Related component
Cluster Manager
To Reproduce
Expected behavior
Since replication requests and health checks happen fairly independently it's possible for a leader to inaccurately deem a faulty node as healthy and apply the faulty node's decision of failing a peer shard because of primary's inability to connect to the replica. While failing replica leader can perform an independent verification of the health of the replica shard before marking a replica shard as faulty to confirm any possibility of asymmetric failures i.e. leader checks going through while replication requests failing
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: