[BUG] illegal state: trying to move shard from primary mode to replica mode (Index-type: remote_snapshot) #12339

etgraylog · 2024-02-15T17:39:14Z

Describe the bug

During restart, OpenSearch appears to attempt to relocate the Primary-shard of a remote_snapshot type Index and fails.

This might be an instance of the problem mentioned in #11563 (comment).

[2024-02-15T15:43:08,808][INFO ][o.o.c.r.a.a.BalancedShardsAllocator] [10.0.1.146] Swap relocation performed for shard [[index_5][0], node[ECzLzBEhTYmA58qyuEWNaQ], [R], s[STARTED], a[id=WGd5CZOZTf2-qD411BjkoQ]]
[2024-02-15T15:43:09,012][WARN ][o.o.i.c.IndicesClusterStateService] [10.0.1.146] [index_5][0] marking and sending shard failed due to [failed updating shard routing entry]
java.lang.IllegalArgumentException: illegal state: trying to move shard from primary mode to replica mode. Current [index_5][0], node[ECzLzBEhTYmA58qyuEWNaQ], [P], s[STARTED], a[id=WGd5CZOZTf2-qD411BjkoQ], new [index_5][0], node[ECzLzBEhTYmA58qyuEWNaQ], [R], s[STARTED], a[id=WGd5CZOZTf2-qD411BjkoQ]
	at org.opensearch.index.shard.IndexShard.updateShardState(IndexShard.java:597) ~[opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.indices.cluster.IndicesClusterStateService.updateShard(IndicesClusterStateService.java:710) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:650) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:293) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:606) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:593) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:561) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:484) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:186) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245) [opensearch-2.11.1.jar:2.11.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]

This appears to leave a Replica shard perpetually in the state of INITIALIZING:

ubuntu@ip-10-0-252-254:~$ curl -s -XGET "http://******:*******@10.0.1.146:9200/_cat/recovery?active_only=true&v=true"
index          shard time  type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
index_5 0     57.4m peer init  10.0.1.204  10.0.1.204  10.0.1.146  10.0.1.146  n/a        n/a      0     0               0.0%          0           0     0               0.0%          0           -1           0                      -1.0%
ubuntu@ip-10-0-252-254:~$

Without any obvious causes of why:

ubuntu@ip-10-0-252-254:~$ curl -s -XGET "http://******:*******@10.0.1.146:9200/_cat/allocation?v"
shards disk.indices disk.used disk.avail disk.total disk.percent host       ip         node
    16       29.4gb      16gb     32.2gb     48.2gb           33 10.0.1.6   10.0.1.6   10.0.1.6
    15       22.5gb       9gb     39.2gb     48.2gb           18 10.0.1.204 10.0.1.204 10.0.1.204
    17       36.7gb    16.1gb     32.1gb     48.2gb           33 10.0.1.146 10.0.1.146 10.0.1.146
ubuntu@ip-10-0-252-254:~$

ubuntu@ip-10-0-252-254:~$ curl -s -XGET "http://******:*******@10.0.1.146:9200/_cluster/allocation/explain?pretty"
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
  },
  "status" : 400
}
ubuntu@ip-10-0-252-254:~$

Related component

Storage:Snapshots

To Reproduce

This isn't exactly trivial to reproduce, there seems to be something else involved that causes the problem. Here are the steps taken to arrive at the current-state however:

Create a OpenSearch multi-node cluster.
Index some data into an index.
Setup Searchable Snapshots & create one for the index from the previous step:
https://opensearch.org/docs/latest/tuning-your-cluster/availability-and-recovery/snapshots/searchable_snapshot/#create-a-searchable-snapshot-index
Restart the OpenSearch cluster.

Expected behavior

The expected behavior is for all shards to be successfully recovered upon restart without operations that result in a Yellow-state (e.g. orphaned replica-shards).

Additional Details

Plugins
Please list all plugins currently enabled.

opensearch-alerting
opensearch-anomaly-detection
opensearch-asynchronous-search
opensearch-cross-cluster-replication
opensearch-custom-codecs
opensearch-geospatial
opensearch-index-management
opensearch-job-scheduler
opensearch-knn
opensearch-ml
opensearch-neural-search
opensearch-notifications
opensearch-notifications-core
opensearch-observability
opensearch-performance-analyzer
opensearch-reports-scheduler
opensearch-security
opensearch-security-analytics
opensearch-sql
prometheus-exporter
repository-s3

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS] Debian
Version [e.g. 22] Bullseye
OpenSearch: 2.11.1

Additional context
This might be an instance of the problem mentioned in #11563 (comment).

The text was updated successfully, but these errors were encountered:

andrross · 2024-02-15T17:57:48Z

Thanks @etgraylog! This does indeed look like the issue fixed by #11563. That fix is included in 2.12, which will be released in the coming week. Will you be able to pick up that release and test this?

etgraylog · 2024-02-15T19:05:21Z

Thanks @etgraylog! This does indeed look like the issue fixed by #11563. That fix is included in 2.12, which will be released in the coming week. Will you be able to pick up that release and test this?

Thanks @andrross ! Certainly, I'll stay tuned 👍

etgraylog · 2024-02-22T03:34:33Z

Thanks @etgraylog! This does indeed look like the issue fixed by #11563. That fix is included in 2.12, which will be released in the coming week. Will you be able to pick up that release and test this?

Thanks @andrross ! Certainly, I'll stay tuned 👍

With 2.12.0 I'm not able to reproduce the issue so far, It seems the fix is working @andrross. Thanks again!

etgraylog added bug Something isn't working untriaged labels Feb 15, 2024

github-actions bot added the Storage:Snapshots label Feb 15, 2024

andrross added the Search:Searchable Snapshots label Feb 15, 2024

github-project-automation bot added this to Search Project Board Feb 15, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Feb 15, 2024

andrross removed untriaged Storage:Snapshots labels Feb 15, 2024

etgraylog closed this as completed Feb 22, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] illegal state: trying to move shard from primary mode to replica mode (Index-type: remote_snapshot) #12339

[BUG] illegal state: trying to move shard from primary mode to replica mode (Index-type: remote_snapshot) #12339

etgraylog commented Feb 15, 2024

andrross commented Feb 15, 2024

etgraylog commented Feb 15, 2024

etgraylog commented Feb 22, 2024

[BUG] illegal state: trying to move shard from primary mode to replica mode (Index-type: remote_snapshot) #12339

[BUG] illegal state: trying to move shard from primary mode to replica mode (Index-type: remote_snapshot) #12339

Comments

etgraylog commented Feb 15, 2024

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

andrross commented Feb 15, 2024

etgraylog commented Feb 15, 2024

etgraylog commented Feb 22, 2024