[BUG] Restore of Snapshot taken from a 3 node cluster fails. #14481

manoj0598 · 2024-06-12T09:26:55Z

Describe the bug

We have a 3 node opensearch cluster. When I take a snapshot of the opensearch and opendistro indices and check the snapshot, the snapshot does not have any failed shards:

{"snapshots":[{"snapshot":"2","uuid":"vCkI0dAMTxqmOPE3yF_Rfw","version_id":136357827,"version":"2.14.0","remote_store_index_shallow_copy":false,"indices":[".opendistro-alerting-alert-history-2024.06.11-1",".opensearch-notifications-config",".opendistro-alerting-alerts",".kibana_1",".opendistro-reports-instances",".opendistro-alerting-config",".opendistro-reports-definitions"],"data_streams":[],"include_global_state":false,"state":"SUCCESS","start_time":"2024-06-11T07:33:00.126Z","start_time_in_millis":1718091180126,"end_time":"2024-06-11T07:33:00.326Z","end_time_in_millis":1718091180326,"duration_in_millis":200,"failures":[],"shards":{"total":7,"failed":0,"successful":7}}]}

However, when I try to restore the same snapshot on the same 3 node cluster, I see that some shards have failed.
{"snapshot":{"snapshot":"2","indices":[".opendistro-alerting-alert-history-2024.06.11-1",".opendistro-alerting-alerts",".opendistro-reports-definitions",".kibana_1",".opendistro-reports-instances",".opensearch-notifications-config",".opendistro-alerting-config"],"shards":{"total":7,"failed":3,"successful":4}}}

The opensearch logs say that the snapshot is missing:
[2024-06-12T09:20:00,612][WARN ][o.o.s.InternalSnapshotsInfoService] [opensearch-cluster-master-0] failed to retrieve shard size for [snapshot=nsp-opensearch-repository:2/vCkI0dAMTxqmOPE3yF_Rfw, index=[.opendistro-alerting-alerts/FLxs5PqsSTOF7NqmUF-0NA], shard=[.opendistro-alerting-alerts][0]]
org.opensearch.snapshots.SnapshotMissingException: [nsp-opensearch-repository:2] is missing
at org.opensearch.repositories.blobstore.BlobStoreRepository.loadShardSnapshot(BlobStoreRepository.java:3556) ~[opensearch-2.14.0.jar:2.14.0]
at org.opensearch.repositories.blobstore.BlobStoreRepository.getShardSnapshotStatus(BlobStoreRepository.java:3356) ~[opensearch-2.14.0.jar:2.14.0]
at org.opensearch.snapshots.InternalSnapshotsInfoService$FetchingSnapshotShardSizeRunnable.doRun(InternalSnapshotsInfoService.java:241) [opensearch-2.14.0.jar:2.14.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913) [opensearch-2.14.0.jar:2.14.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.14.0.jar:2.14.0]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]

However, I have manually copied the backup files to the pod at the path registered in the repository.

We don't have a shared file system but we make sure to manually copy the snapshot files to all the nodes.

I wanted to clarify two things.

Is it advisable to backup and restore these indices?
How does opensearch restore process handle index that have multiple primary and replica shards.

We have the same issue with another set of application specific indices that have multiple primary and replica shards.

Related component

Storage:Snapshots

To Reproduce

Take a snapshot of an opensearch cluster consisting of multiple nodes
Restore the snapshot on the same node.

Expected behavior

The snapshot should get restored successfully without any shard failures.

Additional Details

No response

gaiksaya · 2024-06-20T21:22:45Z

Moving the issue to core repo as it is snapshots related issue.

gbbafna · 2024-06-27T15:09:59Z

[Storage Triage - attendees 1 2 3 4 5 6 7 8 9 10 ]

@@manoj0598

Is it advisable to backup and restore these indices?

Repository has to be shared across all the nodes in the cluster . If you copy the data manually , the behavior is undefined .

How does opensearch restore process handle index that have multiple primary and replica shards.

Snapshot stores and restores the data on per shard basis . Replica shards starts hydrating post the primary shards via peer recovery mechanism .

manoj0598 added bug Something isn't working untriaged labels Jun 12, 2024

dblock transferred this issue from opensearch-project/OpenSearch Jun 12, 2024

gaiksaya transferred this issue from opensearch-project/opensearch-devops Jun 20, 2024

github-actions bot added the Storage:Snapshots label Jun 20, 2024

prudhvigodithi mentioned this issue Jun 23, 2024

[AUTOCUT] Gradle Check Flaky Test Report for IndicesRequestCacheIT prudhvigodithi/OpenSearch#27

Closed

gbbafna removed bug Something isn't working untriaged labels Jun 27, 2024

gbbafna moved this from 🆕 New to 🏗 In progress in Storage Project Board Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Restore of Snapshot taken from a 3 node cluster fails. #14481

[BUG] Restore of Snapshot taken from a 3 node cluster fails. #14481

manoj0598 commented Jun 12, 2024

gaiksaya commented Jun 20, 2024

gbbafna commented Jun 27, 2024

[BUG] Restore of Snapshot taken from a 3 node cluster fails. #14481

[BUG] Restore of Snapshot taken from a 3 node cluster fails. #14481

Comments

manoj0598 commented Jun 12, 2024

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

gaiksaya commented Jun 20, 2024

gbbafna commented Jun 27, 2024