[BUG] Generic threads exhausted when ongoing concurrent node recoveries is higher than threadpool size #14768
Labels
bug
Something isn't working
Cluster Manager
Indexing:Replication
Issues and PRs related to core replication framework eg segrep
Describe the bug
The number of concurrent recoveries (both ongoing and outgoing) that can happen on a node is controlled using the setting
cluster.routing.allocation.node_concurrent_recoveries
. As of today, the peer recovery process uses Generic threadpool on both the RecoverySourceHandler and RecoveryTarget. While testing with high value ofcluster.routing.allocation.node_concurrent_recoveries
, ran into an issue where all the 128 generic threads were in WAITING state. The thread dump can be referred below. This happened due to cyclic dependency of recovery process submitting task asynchronously to the same generic threadpool. The thread after submitting the task keeps on waiting for thefuture.get()
to return. This poses a scenario where higher concurrent node recoveries can lead to cluster going in a limbo state since generic threadpool is exhausted and will get freed up only when it gets a generic thread to run the task. This is kind of deadlock scenario. In this case, the issue manifested as node considering itself part of the cluster while the active cluster manager did not consider the same on the node that saw the generic threadpool getting exhausted.Thread dump
showing generic thread in
WAITING
state -Recovery process submitting task from generic thread to generic threadpool
OpenSearch/server/src/main/java/org/opensearch/indices/recovery/RecoverySourceHandler.java
Lines 292 to 300 in ba9bdac
Generic threadpoool
OpenSearch/server/src/main/java/org/opensearch/threadpool/ThreadPool.java
Lines 236 to 237 in ba9bdac
Related component
Cluster Manager
To Reproduce
We can follow below steps to reproduce this -
cluster.routing.allocation.node_concurrent_recoveries
can be increased to a very high number like 1000.Expected behavior
The deadlock should not happen.
Additional Details
No response
The text was updated successfully, but these errors were encountered: