[BUG] Scaling down nodePool doesn't reassing all shards #870

pbagona · 2024-09-11T18:16:42Z

What is the bug?

When scaling down nodePool, Operator shows messages about drain of removed node, but after it finished health status is Red and some shards remain unassigned.

How can one reproduce the bug?

Current setup is I have 4 nodepools - master with 3 replica (role master), nodes with 2 replica each 300Gi storage (role data+ingest), ingests with 3 replicas each 100Gi storage (role ingests) and data with 5 replica each 1Ti storage (role data). Scaling down nodePool nodes introduces issues with shard allocation and health status of cluster.

What is the expected behavior?

Expected behavior is that after Operator drains node and decommissions it, cluster health status will be Green.

What is your host/environment?

k8s v1.27.13, OpenSearch k8s operator 2.5.1, OpenSearch cluster 1.3.16

Do you have any screenshots?

yes I will post screenshots

Do you have any additional context?

nodePool nodes and data existed first, then ingests was added and now goal is to remove old nodes nodePool.

I did this same setup with OpenSearch cluster version 2.XX on different k8s cluster and it worked as expected - when one nodePool was removed, operator drained nodes of that nodePool 1 by 1 and removed them and there was no interruption to service and after it finished cluster health status remained green.

When doing these same steps on OpensSearch cluster version 1.3.16, it results it cluster health status RED and some shard unable to allocate. Sometimes it is 1 shard that remains unallocated, sometimes more shards.

I tried removing nodePool in manifest specification all at once and then I tried just to scale it down by one replica but got same outcome.

In operator logs, I see that it correctly waits for node to drain and then decomissions it, but at that very moment cluster goes into RED state and I see error with allocations.

When I add removed nodepool/replica back to manifest, after pod is up and running status of cluster is back to Green and everything is behaving normally.

I tried this several times and get one of few errors about allocation every time.

Also as seen in screenshots bellow, before scaling down, nodes have 12.3gb used storage for disk.indicies but when one of nodes in nodePool gets removed, number of shards seems to be redistributed, but disk.indicies number stays same for all nodes, or changes just minimally but does not cover 12.3gb that should have been relocated to remaining nodes... and when nodePool is scaled back up to original and removed Pod mounts its old PV, everything is back to normal green state.

{"level":"debug","ts":"2024-09-11T17:39:26.493Z","logger":"events","msg":"Start to Exclude int2-opensearch/int2-opensearch","type":"Normal","object":{"kind":"OpenSearchCluster","namespace":"int2-opensearch","name":"int2-opensearch","uid":"4b093d1c-5644-411a-a010-af0c78faf969","apiVersion":"opensearch.opster.io/v1","resourceVersion":"173437926"},"reason":"Scaler"}
{"level":"info","ts":"2024-09-11T17:39:26.531Z","msg":"Group: nodes, Node int2-opensearch-nodes-1 is drained","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"int2-opensearch","namespace":"int2-opensearch"},"namespace":"int2-opensearch","name":"int2-opensearch","reconcileID":"b99653cd-8ca2-46c7-ba4a-b558f966345a"}
{"level":"info","ts":"2024-09-11T17:39:26.546Z","msg":"Reconciling OpenSearchCluster","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"int2-opensearch","namespace":"int2-opensearch"},"namespace":"int2-opensearch","name":"int2-opensearch","reconcileID":"93813285-fb36-4c9e-b2d1-1b2d08e28df5","cluster":{"name":"int2-opensearch","namespace":"int2-opensearch"}}
...
...
{"level":"debug","ts":"2024-09-11T17:39:26.637Z","logger":"events","msg":"Start to Drain int2-opensearch/int2-opensearch","type":"Normal","object":{"kind":"OpenSearchCluster","namespace":"int2-opensearch","name":"int2-opensearch","uid":"4b093d1c-5644-411a-a010-af0c78faf969","apiVersion":"opensearch.opster.io/v1","resourceVersion":"173438036"},"reason":"Scaler"}
{"level":"debug","ts":"2024-09-11T17:39:26.637Z","logger":"events","msg":"Start to decreaseing node int2-opensearch-nodes-1 on nodes ","type":"Normal","object":{"kind":"OpenSearchCluster","namespace":"int2-opensearch","name":"int2-opensearch","uid":"4b093d1c-5644-411a-a010-af0c78faf969","apiVersion":"opensearch.opster.io/v1","resourceVersion":"173438036"},"reason":"Scaler"}

Cluster health status

{
  "cluster_name" : "int2-opensearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 12,
  "number_of_data_nodes" : 6,
  "discovered_master" : true,
  "active_primary_shards" : 92,
  "active_shards" : 183,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 99.45652173913044
}

Allocation before change

Example of allocation after change

Example of unallocated shard explanation

{
  "index" : "***********",
  "shard" : 1,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "NODE_LEFT",
    "at" : "2024-09-11T18:04:19.246Z",
    "details" : "node_left [pDXfgCn9TQuRA5bGR1DKPw]",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt",
  "node_allocation_decisions" : [

EDIT:
I did tried it again in order to collect more and noticed that when I scale down nodePool nodes from 2 to 1 ... Operator goes from int2-opensearch-nodes-0, int2-opensearch-nodes-1 to int2-opensearch-nodes-0 and drains node int2-opensearch-nodes-1, during this process, it reallocates some shards to node that is being drained and then node pod is terminated and removed from cluster and logs from operator are as I posted above

The text was updated successfully, but these errors were encountered:

prudhvigodithi · 2024-09-17T16:17:49Z

[Triage]
Thanks @pbagona for the detailed description, I assume this has to do something with 1.3.16 version of OpenSearch (since as you mentioned it works with 2.x). Also since 1.3.x is just in maintenance mode I would recommend to use the latest 2.x version of OpenSearch.

Also when the state is RED have to tried to scale down to zero and then scale up (or a fresh restart)?

Thank you
@swoehrl-mw @getsaurabh02

swoehrl-mw · 2024-09-18T13:32:54Z

I concur with @prudhvigodithi here. This looks like a problem with opensearch itself. From your description opensearch is not able to correctly recover some shards if one of the replicas is removed. Since the 1.x version is no longer developed it does not make sense to implement special logic for this in the operator.

pbagona · 2024-10-02T12:51:51Z

@prudhvigodithi I tried to do fresh restart/scale down&up and status stayed same.

thanks guys for help and confirmation where the issue is. We are running on 2.x where we can, but on some k8s clusters we need to keep 1.x due to APP compatibility. They should provide new version which will work with Opensearch 2.x next year, so at least now we have another reason to push them to provide it ASAP.

For now I forwarded this info in team to be super careful when scaling down and removing replicas.

Should I close issue with "Close as not planned"?

prudhvigodithi · 2024-10-02T16:25:16Z

Thanks for confirmation @pbagona, I will close this issue for now, please feel free to comment or re-open if required.
@swoehrl-mw @getsaurabh02.

pbagona added bug Something isn't working untriaged Issues that have not yet been triaged labels Sep 11, 2024

github-project-automation bot moved this to 🆕 New in Engineering Effectiveness Board Sep 11, 2024

github-project-automation bot added this to Engineering Effectiveness Board Sep 11, 2024

prudhvigodithi removed the untriaged Issues that have not yet been triaged label Sep 12, 2024

prudhvigodithi closed this as completed Oct 2, 2024

github-project-automation bot moved this from Backlog to ✅ Done in Engineering Effectiveness Board Oct 2, 2024

prudhvigodithi assigned prudhvigodithi and swoehrl-mw Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Scaling down nodePool doesn't reassing all shards #870

[BUG] Scaling down nodePool doesn't reassing all shards #870

pbagona commented Sep 11, 2024 •

edited

Loading

prudhvigodithi commented Sep 17, 2024

swoehrl-mw commented Sep 18, 2024

pbagona commented Oct 2, 2024

prudhvigodithi commented Oct 2, 2024

[BUG] Scaling down nodePool doesn't reassing all shards #870

[BUG] Scaling down nodePool doesn't reassing all shards #870

Comments

pbagona commented Sep 11, 2024 • edited Loading

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

Do you have any screenshots?

Do you have any additional context?

prudhvigodithi commented Sep 17, 2024

swoehrl-mw commented Sep 18, 2024

pbagona commented Oct 2, 2024

prudhvigodithi commented Oct 2, 2024

pbagona commented Sep 11, 2024 •

edited

Loading