Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] illegal state: trying to move shard from primary mode to replica mode (Index-type: remote_snapshot) #12339

Closed
etgraylog opened this issue Feb 15, 2024 · 3 comments
Labels
bug Something isn't working Search:Searchable Snapshots

Comments

@etgraylog
Copy link

Describe the bug

During restart, OpenSearch appears to attempt to relocate the Primary-shard of a remote_snapshot type Index and fails.

This might be an instance of the problem mentioned in #11563 (comment).

[2024-02-15T15:43:08,808][INFO ][o.o.c.r.a.a.BalancedShardsAllocator] [10.0.1.146] Swap relocation performed for shard [[index_5][0], node[ECzLzBEhTYmA58qyuEWNaQ], [R], s[STARTED], a[id=WGd5CZOZTf2-qD411BjkoQ]]
[2024-02-15T15:43:09,012][WARN ][o.o.i.c.IndicesClusterStateService] [10.0.1.146] [index_5][0] marking and sending shard failed due to [failed updating shard routing entry]
java.lang.IllegalArgumentException: illegal state: trying to move shard from primary mode to replica mode. Current [index_5][0], node[ECzLzBEhTYmA58qyuEWNaQ], [P], s[STARTED], a[id=WGd5CZOZTf2-qD411BjkoQ], new [index_5][0], node[ECzLzBEhTYmA58qyuEWNaQ], [R], s[STARTED], a[id=WGd5CZOZTf2-qD411BjkoQ]
	at org.opensearch.index.shard.IndexShard.updateShardState(IndexShard.java:597) ~[opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.indices.cluster.IndicesClusterStateService.updateShard(IndicesClusterStateService.java:710) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:650) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:293) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:606) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:593) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:561) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:484) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:186) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282) [opensearch-2.11.1.jar:2.11.1]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245) [opensearch-2.11.1.jar:2.11.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]

This appears to leave a Replica shard perpetually in the state of INITIALIZING:

ubuntu@ip-10-0-252-254:~$ curl -s -XGET "http://******:*******@10.0.1.146:9200/_cat/recovery?active_only=true&v=true"
index          shard time  type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
index_5 0     57.4m peer init  10.0.1.204  10.0.1.204  10.0.1.146  10.0.1.146  n/a        n/a      0     0               0.0%          0           0     0               0.0%          0           -1           0                      -1.0%
ubuntu@ip-10-0-252-254:~$

Without any obvious causes of why:

ubuntu@ip-10-0-252-254:~$ curl -s -XGET "http://******:*******@10.0.1.146:9200/_cat/allocation?v"
shards disk.indices disk.used disk.avail disk.total disk.percent host       ip         node
    16       29.4gb      16gb     32.2gb     48.2gb           33 10.0.1.6   10.0.1.6   10.0.1.6
    15       22.5gb       9gb     39.2gb     48.2gb           18 10.0.1.204 10.0.1.204 10.0.1.204
    17       36.7gb    16.1gb     32.1gb     48.2gb           33 10.0.1.146 10.0.1.146 10.0.1.146
ubuntu@ip-10-0-252-254:~$

ubuntu@ip-10-0-252-254:~$ curl -s -XGET "http://******:*******@10.0.1.146:9200/_cluster/allocation/explain?pretty"
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
  },
  "status" : 400
}
ubuntu@ip-10-0-252-254:~$

Related component

Storage:Snapshots

To Reproduce

This isn't exactly trivial to reproduce, there seems to be something else involved that causes the problem. Here are the steps taken to arrive at the current-state however:

  1. Create a OpenSearch multi-node cluster.
  2. Index some data into an index.
  3. Setup Searchable Snapshots & create one for the index from the previous step:
    https://opensearch.org/docs/latest/tuning-your-cluster/availability-and-recovery/snapshots/searchable_snapshot/#create-a-searchable-snapshot-index
  4. Restart the OpenSearch cluster.

Expected behavior

The expected behavior is for all shards to be successfully recovered upon restart without operations that result in a Yellow-state (e.g. orphaned replica-shards).

Additional Details

Plugins
Please list all plugins currently enabled.

opensearch-alerting
opensearch-anomaly-detection
opensearch-asynchronous-search
opensearch-cross-cluster-replication
opensearch-custom-codecs
opensearch-geospatial
opensearch-index-management
opensearch-job-scheduler
opensearch-knn
opensearch-ml
opensearch-neural-search
opensearch-notifications
opensearch-notifications-core
opensearch-observability
opensearch-performance-analyzer
opensearch-reports-scheduler
opensearch-security
opensearch-security-analytics
opensearch-sql
prometheus-exporter
repository-s3

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS] Debian
  • Version [e.g. 22] Bullseye
  • OpenSearch: 2.11.1

Additional context
This might be an instance of the problem mentioned in #11563 (comment).

@etgraylog etgraylog added bug Something isn't working untriaged labels Feb 15, 2024
@andrross
Copy link
Member

Thanks @etgraylog! This does indeed look like the issue fixed by #11563. That fix is included in 2.12, which will be released in the coming week. Will you be able to pick up that release and test this?

@etgraylog
Copy link
Author

Thanks @etgraylog! This does indeed look like the issue fixed by #11563. That fix is included in 2.12, which will be released in the coming week. Will you be able to pick up that release and test this?

Thanks @andrross ! Certainly, I'll stay tuned 👍

@etgraylog
Copy link
Author

Thanks @etgraylog! This does indeed look like the issue fixed by #11563. That fix is included in 2.12, which will be released in the coming week. Will you be able to pick up that release and test this?

Thanks @andrross ! Certainly, I'll stay tuned 👍

With 2.12.0 I'm not able to reproduce the issue so far, It seems the fix is working @andrross. Thanks again!

@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search:Searchable Snapshots
Projects
Archived in project
Development

No branches or pull requests

2 participants