Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Intermitted network issue on a single primary node can fail all shard copies #13095

Open
Bukhtawar opened this issue Apr 5, 2024 · 2 comments
Labels
bug Something isn't working Cluster Manager

Comments

@Bukhtawar
Copy link
Collaborator

Bukhtawar commented Apr 5, 2024

Describe the bug

We ran into an issue where there were intermittent network packet drops and disconnections on a single faulty node hosting the primary copy. As a part of replicating the request to the request copy the primary disconnected with the replica copy and reached out to the cluster manager to mark the copy as failed. The cluster manager failed the shard citing a stale copy post which the primary ack'ed back the request to the user with a single primary copy staying healthy.

[2024-03-10T12:06:09,815][WARN ][o.o.c.r.a.AllocationService] [88abbbhhh6bd07e28209f48cbbba4e6d35] failing shard [failed shard, shard [some-logs-2024.03.10][5], node[VxbTIaNsRDGrHgKmI2NNtA], [R], s[STARTED], a[id=n0nLvCjnR_-s_qGDqfD-0g], message [failed to perform indices:data/write/bulk[s] on replica [some-logs-2024.03.10][5], node[VxbTIaNsRDGrHgKmI2NNtA], [R], s[STARTED], a[id=n0nLvCjnR_-s_qGDqfD-0g]], failure [NodeDisconnectedException[[98754ae611d299affa8b740637656009][xx.xx.xx.xx:9300][indices:data/write/bulk[s][r]] disconnected]], markAsStale [true]]

Meanwhile the follower check requests seemed to also fail intermittently, but since there is no coordination of the health checks on the leader in the critical request path it is quite possible that a faulty primary can mark the replica to be unhealthy before the leader can evict the faulty node from the cluster.

Other nodes at this time were also witnessing node disconnections, even earlier leader saw som disconnection

[2024-03-10T12:11:51,233][WARN ][o.o.g.G.InternalReplicaShardAllocator] [88a0ddc6bd07e28209f48cbbba4e6d35] [some-logs-2024.03.10][5]: failed to list shard for shard_store on node [rd2GjJh7QEu6fa_G71z5pw]
FailedNodeException[Failed node [rd2GjJh7QEu6fa_G71z5pw]]; nested: NodeDisconnectedException[[6c43b365208c71cce1b97f96b36faf8e][172.16.xx.xx:9300][internal:cluster/nodes/indices/shard/store[n]] disconnected];
    at org.opensearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:308)
    at org.opensearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:282)
    at org.opensearch.transport.TransportService$6.handleException(TransportService.java:794)
    at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:316)
    at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1414)
    at org.opensearch.transport.TransportService$9.run(TransportService.java:1266)
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:756)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: NodeDisconnectedException[[6c43b365208c71cce1b97f96b36faf8e][172.16.xx.xx:9300][internal:cluster/nodes/indices/shard/store[n]] disconnected]

Now after sometime the follower checks on the faulty node failed and was evicted(lagging) from the cluster causing failure of the primary copy as well.

[2024-03-10T12:11:49,345][INFO ][o.o.c.s.MasterService    ] [88a0ddc6bd07e28209f48cbbba4e6d35] node-left[{6c43b365208c71cce1b97f96b36faf8e}{rd2GjJh7QEu6fa_G71z5pw}{MAu_ryl5Temxr_FgaD6fBw}{172.16.xx.xx}{172.16.xx.xx:9300}{dir} reason: lagging], term: 927, version: 11778114, delta: removed {{6c43b365208c71cce1b97f96b36faf8e}{rd2GjJh7QEu6fa_G71z5pw}{MAu_ryl5Temxr_FgaD6fBw}{172.16.xx.xx}{172.16.xx.xx:9300}{dir}}

Related component

Cluster Manager

To Reproduce

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

Since replication requests and health checks happen fairly independently it's possible for a leader to inaccurately deem a faulty node as healthy and apply the faulty node's decision of failing a peer shard because of primary's inability to connect to the replica. While failing replica leader can perform an independent verification of the health of the replica shard before marking a replica shard as faulty to confirm any possibility of asymmetric failures i.e. leader checks going through while replication requests failing

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@shwetathareja
Copy link
Member

Thanks @Bukhtawar for filing this issue.

While failing replica leader can perform an independent verification of the health of the replica shard before marking a replica shard as faulty to confirm any possibility of asymmetric failures i.e. leader checks going through while replication requests failing

Lets say replica shard node responds to cluster manager successfully. Now, cluster manager doesn't fail the shard. What happens to primary at this point. Should it step down as primary copy? As indexing requests will be blocked.
Also, this healthy replica check has to consider other valid causes (false positive) where cluster manager is able to connect to replica node but node hosting the primary was not able to due to valid reason.

@Bukhtawar
Copy link
Collaborator Author

Lets say replica shard node responds to cluster manager successfully. Now, cluster manager doesn't fail the shard. What happens to primary at this point. Should it step down as primary copy? As indexing requests will be blocked.

Today the problem is that a single instance of a network blip between primary and replica node can cause the replica shard to fail, which in my opinion is too aggressive and counter-productive to the overall cluster resiliency. Ideally if the leader finds the replica to be healthy it should just reject the shard failure request and the active primary should consider failing the request itself. We can then also look at strengthening the health checks and wait till the follower checks on the faulty node evicts the bad node.

Also, this healthy replica check has to consider other valid causes (false positive) where cluster manager is able to connect to replica node but node hosting the primary was not able to due to valid reason.

Leader actually does asymmetric network checks today but as you note this can still happen, in such situations we can consider a form of gossip like Scuttlebutt to ensure we cover more network paths beyond just follower-leader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants