[BUG] improper shard distribution in search nodes after scaling. #14747

Dileep-Dora · 2024-07-15T10:33:10Z

Describe the bug

When adding search nodes to existing cluster, ideal expectation is all search nodes having equal number of shards, but this is not happening. We had to delete all the indices and restore all at once in order to get the equal distribution, Which implies availability of overall service

Related component

Search:Searchable Snapshots

To Reproduce

create a cluster with search nodes
ingest some data in 2-3 indices
take snapshots
delete and restore on search nodes
add extra search nodes and verify shards distribution

Expected behavior

After adding extra search nodes, all search nodes should have equal number of shards.

Additional Details

Opensearch
version: 2.13

kkhatua · 2024-07-24T16:24:45Z

Adding new nodes will not necessarily move shards from the older nodes to the new nodes to rebalance. This is because moving shards is essentially a peer recovery activity and consumes resources that could be critical to traffic.

Lucene shards tend to be sticky in nature and will stay on a node as long as there are no violations.

When users add nodes, it might be due to the cluster already having a high shard count or disk usage. In a scenario for the latter, adding new nodes will cause any old nodes exceeding threshold to relocate the shards to nodes with more disk space (in this case, the new nodes).

A simple way to achieve a balanced cluster after adding new nodes would be to explictly set the cluster.routing.allocation.total_shards_per_node to a value that will target approximately the average shards with the larger cluster's node count.

There has been a discussion of making this automatic, but the competing resources of shard migration is why this has not been automated. A monitoring system, however, could look at cat/alocation API and based on the skew periodically update this threshold and achieve the same effect.

(Closing this as it is working as designed)

jed326 · 2024-07-24T16:26:00Z

Also for Searchable Snapshots specifically, we balance only by average primary shard count:

OpenSearch/server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/RemoteShardsBalancer.java

Lines 232 to 267 in fcc231d

    
               /** 
        
                * Performs heuristic, naive weight-based balancing for remote shards within the cluster by using average nodes per 
        
                * cluster as the metric for shard distribution. 
        
                * It does so without accounting for the local shards located on any nodes within the cluster. 
        
                */ 
        
               @Override 
        
               void balance() { 
        
                   List<RoutingNode> remoteRoutingNodes = getRemoteRoutingNodes(); 
        
                   logger.trace("Performing balancing for remote shards."); 
        
                   if (remoteRoutingNodes.isEmpty()) { 
        
                       logger.debug("No eligible remote nodes found to perform balancing"); 
        
                       return; 
        
                   } 
        
                   final Map<String, Integer> nodePrimaryShardCount = calculateNodePrimaryShardCount(remoteRoutingNodes); 
        
                   int totalPrimaryShardCount = nodePrimaryShardCount.values().stream().reduce(0, Integer::sum); 
        
                   totalPrimaryShardCount += routingNodes.unassigned().getNumPrimaries(); 
        
                   int avgPrimaryPerNode = (totalPrimaryShardCount + routingNodes.size() - 1) / routingNodes.size(); 
        
                   ArrayDeque<RoutingNode> sourceNodes = new ArrayDeque<>(); 
        
                   ArrayDeque<RoutingNode> targetNodes = new ArrayDeque<>(); 
        
                   for (RoutingNode node : remoteRoutingNodes) { 
        
                       if (nodePrimaryShardCount.get(node.nodeId()) > avgPrimaryPerNode) { 
        
                           sourceNodes.add(node); 
        
                       } else if (nodePrimaryShardCount.get(node.nodeId()) < avgPrimaryPerNode) { 
        
                           targetNodes.add(node); 
        
                       } 
        
                   } 
        
                   while (sourceNodes.isEmpty() == false && targetNodes.isEmpty() == false) { 
        
                       RoutingNode sourceNode = sourceNodes.poll(); 
        
                       tryRebalanceNode(sourceNode, targetNodes, avgPrimaryPerNode, nodePrimaryShardCount); 
        
                   } 
        
               }

Dileep-Dora added bug Something isn't working untriaged labels Jul 15, 2024

github-actions bot added the Search:Searchable Snapshots label Jul 15, 2024

github-project-automation bot added this to Search Project Board Jul 15, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Jul 15, 2024

mch2 added the ShardManagement:Routing label Jul 17, 2024

github-project-automation bot added this to Shard Management Project Board Jul 17, 2024

github-project-automation bot moved this to 🆕 New in Shard Management Project Board Jul 17, 2024

mch2 removed the untriaged label Jul 24, 2024

kkhatua closed this as completed Jul 24, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Jul 24, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in Shard Management Project Board Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] improper shard distribution in search nodes after scaling. #14747

[BUG] improper shard distribution in search nodes after scaling. #14747

Dileep-Dora commented Jul 15, 2024

kkhatua commented Jul 24, 2024

jed326 commented Jul 24, 2024

[BUG] improper shard distribution in search nodes after scaling. #14747

[BUG] improper shard distribution in search nodes after scaling. #14747

Comments

Dileep-Dora commented Jul 15, 2024

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

kkhatua commented Jul 24, 2024

jed326 commented Jul 24, 2024