[BUG] Intermitted network issue on a single primary node can fail all shard copies #13095

Bukhtawar · 2024-04-05T10:14:34Z

Describe the bug

We ran into an issue where there were intermittent network packet drops and disconnections on a single faulty node hosting the primary copy. As a part of replicating the request to the request copy the primary disconnected with the replica copy and reached out to the cluster manager to mark the copy as failed. The cluster manager failed the shard citing a stale copy post which the primary ack'ed back the request to the user with a single primary copy staying healthy.

[2024-03-10T12:06:09,815][WARN ][o.o.c.r.a.AllocationService] [88abbbhhh6bd07e28209f48cbbba4e6d35] failing shard [failed shard, shard [some-logs-2024.03.10][5], node[VxbTIaNsRDGrHgKmI2NNtA], [R], s[STARTED], a[id=n0nLvCjnR_-s_qGDqfD-0g], message [failed to perform indices:data/write/bulk[s] on replica [some-logs-2024.03.10][5], node[VxbTIaNsRDGrHgKmI2NNtA], [R], s[STARTED], a[id=n0nLvCjnR_-s_qGDqfD-0g]], failure [NodeDisconnectedException[[98754ae611d299affa8b740637656009][xx.xx.xx.xx:9300][indices:data/write/bulk[s][r]] disconnected]], markAsStale [true]]

Meanwhile the follower check requests seemed to also fail intermittently, but since there is no coordination of the health checks on the leader in the critical request path it is quite possible that a faulty primary can mark the replica to be unhealthy before the leader can evict the faulty node from the cluster.

Other nodes at this time were also witnessing node disconnections, even earlier leader saw som disconnection

[2024-03-10T12:11:51,233][WARN ][o.o.g.G.InternalReplicaShardAllocator] [88a0ddc6bd07e28209f48cbbba4e6d35] [some-logs-2024.03.10][5]: failed to list shard for shard_store on node [rd2GjJh7QEu6fa_G71z5pw]
FailedNodeException[Failed node [rd2GjJh7QEu6fa_G71z5pw]]; nested: NodeDisconnectedException[[6c43b365208c71cce1b97f96b36faf8e][172.16.xx.xx:9300][internal:cluster/nodes/indices/shard/store[n]] disconnected];
    at org.opensearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:308)
    at org.opensearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:282)
    at org.opensearch.transport.TransportService$6.handleException(TransportService.java:794)
    at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:316)
    at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1414)
    at org.opensearch.transport.TransportService$9.run(TransportService.java:1266)
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:756)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: NodeDisconnectedException[[6c43b365208c71cce1b97f96b36faf8e][172.16.xx.xx:9300][internal:cluster/nodes/indices/shard/store[n]] disconnected]

Now after sometime the follower checks on the faulty node failed and was evicted(lagging) from the cluster causing failure of the primary copy as well.

[2024-03-10T12:11:49,345][INFO ][o.o.c.s.MasterService    ] [88a0ddc6bd07e28209f48cbbba4e6d35] node-left[{6c43b365208c71cce1b97f96b36faf8e}{rd2GjJh7QEu6fa_G71z5pw}{MAu_ryl5Temxr_FgaD6fBw}{172.16.xx.xx}{172.16.xx.xx:9300}{dir} reason: lagging], term: 927, version: 11778114, delta: removed {{6c43b365208c71cce1b97f96b36faf8e}{rd2GjJh7QEu6fa_G71z5pw}{MAu_ryl5Temxr_FgaD6fBw}{172.16.xx.xx}{172.16.xx.xx:9300}{dir}}

Related component

Cluster Manager

To Reproduce

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior

Since replication requests and health checks happen fairly independently it's possible for a leader to inaccurately deem a faulty node as healthy and apply the faulty node's decision of failing a peer shard because of primary's inability to connect to the replica. While failing replica leader can perform an independent verification of the health of the replica shard before marking a replica shard as faulty to confirm any possibility of asymmetric failures i.e. leader checks going through while replication requests failing

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

shwetathareja · 2024-04-08T12:42:53Z

Thanks @Bukhtawar for filing this issue.

While failing replica leader can perform an independent verification of the health of the replica shard before marking a replica shard as faulty to confirm any possibility of asymmetric failures i.e. leader checks going through while replication requests failing

Lets say replica shard node responds to cluster manager successfully. Now, cluster manager doesn't fail the shard. What happens to primary at this point. Should it step down as primary copy? As indexing requests will be blocked.
Also, this healthy replica check has to consider other valid causes (false positive) where cluster manager is able to connect to replica node but node hosting the primary was not able to due to valid reason.

Bukhtawar · 2024-04-09T10:23:05Z

Lets say replica shard node responds to cluster manager successfully. Now, cluster manager doesn't fail the shard. What happens to primary at this point. Should it step down as primary copy? As indexing requests will be blocked.

Today the problem is that a single instance of a network blip between primary and replica node can cause the replica shard to fail, which in my opinion is too aggressive and counter-productive to the overall cluster resiliency. Ideally if the leader finds the replica to be healthy it should just reject the shard failure request and the active primary should consider failing the request itself. We can then also look at strengthening the health checks and wait till the follower checks on the faulty node evicts the bad node.

Also, this healthy replica check has to consider other valid causes (false positive) where cluster manager is able to connect to replica node but node hosting the primary was not able to due to valid reason.

Leader actually does asymmetric network checks today but as you note this can still happen, in such situations we can consider a form of gossip like Scuttlebutt to ensure we cover more network paths beyond just follower-leader.

Bukhtawar added bug Something isn't working untriaged labels Apr 5, 2024

github-actions bot added the Cluster Manager label Apr 5, 2024

github-project-automation bot added this to Cluster Manager Project Board Apr 5, 2024

github-project-automation bot moved this to 🆕 New in Cluster Manager Project Board Apr 5, 2024

shwetathareja removed the untriaged label Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Intermitted network issue on a single primary node can fail all shard copies #13095

[BUG] Intermitted network issue on a single primary node can fail all shard copies #13095

Bukhtawar commented Apr 5, 2024 •

edited

Loading

shwetathareja commented Apr 8, 2024

Bukhtawar commented Apr 9, 2024

[BUG] Intermitted network issue on a single primary node can fail all shard copies #13095

[BUG] Intermitted network issue on a single primary node can fail all shard copies #13095

Comments

Bukhtawar commented Apr 5, 2024 • edited Loading

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

shwetathareja commented Apr 8, 2024

Bukhtawar commented Apr 9, 2024

Bukhtawar commented Apr 5, 2024 •

edited

Loading