[BUG] Node concurrent recoveries settings not being honoured. #13702

SwethaGuptha · 2024-05-16T09:28:50Z

Describe the bug

Default/updated concurrency recovery settings (node_concurrent_recoveries, node_initial_primaries_recoveries) settings are not being honored and doesn’t have any effect on the recovery speed for clusters with batch mode enabled

This is happening because of the way we allocate unassigned shards in a batch. For a batch:

AllocationDeciders runs for all shards in the batch at once
For all eligible shards in the batch, shard state is update to initializing on the assigned node

OpenSearch/server/src/main/java/org/opensearch/gateway/BaseGatewayShardAllocator.java

Lines 89 to 113 in da3ab92

    
           public void allocateUnassignedBatch(List<ShardRouting> shardRoutings, RoutingAllocation allocation) { 
        
               // make Allocation Decisions for all shards 
        
               HashMap<ShardRouting, AllocateUnassignedDecision> decisionMap = makeAllocationDecision(shardRoutings, allocation, logger); 
        
               assert shardRoutings.size() == decisionMap.size() : "make allocation decision didn't return allocation decision for " 
        
                   + "some shards"; 
        
               // get all unassigned shards iterator 
        
               RoutingNodes.UnassignedShards.UnassignedIterator iterator = allocation.routingNodes().unassigned().iterator(); 
        
               while (iterator.hasNext()) { 
        
                   ShardRouting shard = iterator.next(); 
        
                   try { 
        
                       if (decisionMap.isEmpty() == false) { 
        
                           if (decisionMap.containsKey(shard)) { 
        
                               executeDecision(shard, decisionMap.remove(shard), allocation, iterator); 
        
                           } 
        
                       } else { 
        
                           // no need to keep iterating the unassigned shards, if we don't have anything in decision map 
        
                           break; 
        
                       } 
        
                   } catch (Exception e) { 
        
                       logger.error("Failed to execute decision for shard {} while initializing {}", shard, e); 
        
                       throw e; 
        
                   } 
        
               } 
        
           }

Because the decider execution and update to the shard status is not happening together for a shard, the cluster state doesn't change after running deciders on the unassigned shards. ThrottlingAllocationDecider reads the cluster state to decide if a shard recovery can be started on the node or not by comparing ongoing recoveries on the node with configured value of recovery settings (node_concurrent_recoveries, node_initial_primaries_recoveries). So, when we run the allocation decision together for all shards in a batch, the decider doesn't account for the decisions made for the other shards in the batch and we end up initializing all shards at once.

Logs indicating the same:

[2024-05-16T03:14:58,926][DEBUG][o.o.c.r.a.d.ThrottlingAllocationDecider] [0f25c84a727840d82cb5a1c7c7bf368f] ThrottlingAllocationDecider decision, throttle: [false] primary recovery limit [4], primaries in recovery [0] invoked for [[search-1140][35], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2024-05-16T03:13:47.416Z], delayed=false, allocation_status[deciders_throttled]]] on node [\{854244060761cf88c3b0c7ad6356b50b\}\{-wlVUFUHTHSJI6gwE2SPEw\}\{D_jWEub0TzewygP5S4af-Q\}\{10.212.77.62\}\{10.212.77.62:9300\}\{dir\}\

[2024-05-16T03:14:58,927][DEBUG][o.o.c.r.a.d.ThrottlingAllocationDecider] [0f25c84a727840d82cb5a1c7c7bf368f] ThrottlingAllocationDecider decision, throttle: [false] primary recovery limit [4], primaries in recovery [0] invoked for [[search-1962][15], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2024-05-16T03:13:47.404Z], delayed=false, allocation_status[deciders_throttled]]] on node [\{854244060761cf88c3b0c7ad6356b50b\}\{-wlVUFUHTHSJI6gwE2SPEw\}\{D_jWEub0TzewygP5S4af-Q\}\{10.212.77.62\}\{10.212.77.62:9300\}\{dir\}\

[2024-05-16T03:14:58,928][DEBUG][o.o.c.r.a.d.ThrottlingAllocationDecider] [0f25c84a727840d82cb5a1c7c7bf368f] ThrottlingAllocationDecider decision, throttle: [false] primary recovery limit [4], primaries in recovery [0] invoked for [[test_latency_219][1], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2024-05-16T03:13:47.399Z], delayed=false, allocation_status[deciders_throttled]]] on node [\{854244060761cf88c3b0c7ad6356b50b\}\{-wlVUFUHTHSJI6gwE2SPEw\}\{D_jWEub0TzewygP5S4af-Q\}\{10.212.77.62\}\{10.212.77.62:9300\}\{dir\}\

[2024-05-16T03:14:58,930][DEBUG][o.o.c.r.a.d.ThrottlingAllocationDecider] [0f25c84a727840d82cb5a1c7c7bf368f] ThrottlingAllocationDecider decision, throttle: [false] primary recovery limit [4], primaries in recovery [0] invoked for [[search-1058][5], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2024-05-16T03:13:47.403Z], delayed=false, allocation_status[deciders_throttled]]] on node [\{854244060761cf88c3b0c7ad6356b50b\}\{-wlVUFUHTHSJI6gwE2SPEw\}\{D_jWEub0TzewygP5S4af-Q\}\{10.212.77.62\}\{10.212.77.62:9300\}\{dir\}\

[2024-05-16T03:14:58,932][DEBUG][o.o.c.r.a.d.ThrottlingAllocationDecider] [0f25c84a727840d82cb5a1c7c7bf368f] ThrottlingAllocationDecider decision, throttle: [false] primary recovery limit [4], primaries in recovery [0] invoked for [[search-1062][4], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2024-05-16T03:13:47.401Z], delayed=false, allocation_status[deciders_throttled]]] on node [\{854244060761cf88c3b0c7ad6356b50b\}\{-wlVUFUHTHSJI6gwE2SPEw\}\{D_jWEub0TzewygP5S4af-Q\}\{10.212.77.62\}\{10.212.77.62:9300\}\{dir\}\

Related component

Cluster Manager

To Reproduce

Add belwo log line at ThrottlingAllocationDecider.java#L179
logger.debug( "ThrottlingAllocationDecider decision, throttle: [{}] primary recovery limit [{}]," + " primaries in recovery [{}] invoked for [{}] on node [{}]", primariesInRecovery >= primariesInitialRecoveries, primariesInitialRecoveries, primariesInRecovery, shardRouting, node.node() );
Launch a cluster with batch mode enabled and with shards more than the concurrent recoveries value.
Restart the cluster and check added log to see if throttling is working as expected or not

Expected behavior

Number of ongoing shard recoveries on a node should adhere to the node concurrent recovery setting.

Additional Details

OpenSearch Version: 2.14

The text was updated successfully, but these errors were encountered:

…arch-project#13702) Signed-off-by: Swetha Guptha <[email protected]>

SwethaGuptha added bug Something isn't working untriaged labels May 16, 2024

github-actions bot added the Cluster Manager label May 16, 2024

github-project-automation bot added this to Cluster Manager Project Board May 16, 2024

github-project-automation bot moved this to 🆕 New in Cluster Manager Project Board May 16, 2024

This was referenced May 20, 2024

Test case for validating AllocationDeciders not being honoured in bat… SwethaGuptha/OpenSearch#1

Closed

Fix unassigned shard allocation for batch mode #13748

Merged

rwali-aws removed the untriaged label May 28, 2024

rwali-aws moved this from 🆕 New to 👀 In review in Cluster Manager Project Board May 28, 2024

SwethaGuptha pushed a commit to SwethaGuptha/OpenSearch that referenced this issue May 29, 2024

Evaluate and execute decisions for shards in a batch together (opense…

0b92e0d

…arch-project#13702) Signed-off-by: Swetha Guptha <[email protected]>

SwethaGuptha pushed a commit to SwethaGuptha/OpenSearch that referenced this issue May 29, 2024

Evaluate and execute decisions for shards in a batch together (opense…

ce7373d

…arch-project#13702) Signed-off-by: Swetha Guptha <[email protected]>

SwethaGuptha pushed a commit to SwethaGuptha/OpenSearch that referenced this issue May 29, 2024

Evaluate and execute decisions for shards in a batch together (opense…

e83f114

…arch-project#13702) Signed-off-by: Swetha Guptha <[email protected]>

SwethaGuptha pushed a commit to SwethaGuptha/OpenSearch that referenced this issue Jun 6, 2024

Evaluate and execute decisions for shards in a batch together (opense…

191f324

…arch-project#13702) Signed-off-by: Swetha Guptha <[email protected]>

shwetathareja closed this as completed in #13748 Jun 13, 2024

github-project-automation bot moved this from 👀 In review to ✅ Done in Cluster Manager Project Board Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Node concurrent recoveries settings not being honoured. #13702

[BUG] Node concurrent recoveries settings not being honoured. #13702

SwethaGuptha commented May 16, 2024 •

edited

Loading

[BUG] Node concurrent recoveries settings not being honoured. #13702

[BUG] Node concurrent recoveries settings not being honoured. #13702

Comments

SwethaGuptha commented May 16, 2024 • edited Loading

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

SwethaGuptha commented May 16, 2024 •

edited

Loading