Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Restore of Snapshot taken from a 3 node cluster fails. #14481

Open
manoj0598 opened this issue Jun 12, 2024 · 2 comments
Open

[BUG] Restore of Snapshot taken from a 3 node cluster fails. #14481

manoj0598 opened this issue Jun 12, 2024 · 2 comments

Comments

@manoj0598
Copy link

Describe the bug

We have a 3 node opensearch cluster. When I take a snapshot of the opensearch and opendistro indices and check the snapshot, the snapshot does not have any failed shards:

{"snapshots":[{"snapshot":"2","uuid":"vCkI0dAMTxqmOPE3yF_Rfw","version_id":136357827,"version":"2.14.0","remote_store_index_shallow_copy":false,"indices":[".opendistro-alerting-alert-history-2024.06.11-1",".opensearch-notifications-config",".opendistro-alerting-alerts",".kibana_1",".opendistro-reports-instances",".opendistro-alerting-config",".opendistro-reports-definitions"],"data_streams":[],"include_global_state":false,"state":"SUCCESS","start_time":"2024-06-11T07:33:00.126Z","start_time_in_millis":1718091180126,"end_time":"2024-06-11T07:33:00.326Z","end_time_in_millis":1718091180326,"duration_in_millis":200,"failures":[],"shards":{"total":7,"failed":0,"successful":7}}]}

However, when I try to restore the same snapshot on the same 3 node cluster, I see that some shards have failed.
{"snapshot":{"snapshot":"2","indices":[".opendistro-alerting-alert-history-2024.06.11-1",".opendistro-alerting-alerts",".opendistro-reports-definitions",".kibana_1",".opendistro-reports-instances",".opensearch-notifications-config",".opendistro-alerting-config"],"shards":{"total":7,"failed":3,"successful":4}}}

The opensearch logs say that the snapshot is missing:
[2024-06-12T09:20:00,612][WARN ][o.o.s.InternalSnaps
hotsInfoService] [opensearch-cluster-master-0] failed to retrieve shard size for [snapshot=nsp-opensearch-repository:2/vCkI0dAMTxqmOPE3yF_Rfw, index=[.opendistro-alerting-alerts/FLxs5PqsSTOF7NqmUF-0NA], shard=[.opendistro-alerting-alerts][0]]
org.opensearch.snapshots.SnapshotMissingException: [nsp-opensearch-repository:2] is missing
at org.opensearch.repositories.blobstore.BlobStoreRepository.loadShardSnapshot(BlobStoreRepository.java:3556) ~[opensearch-2.14.0.jar:2.14.0]
at org.opensearch.repositories.blobstore.BlobStoreRepository.getShardSnapshotStatus(BlobStoreRepository.java:3356) ~[opensearch-2.14.0.jar:2.14.0]
at org.opensearch.snapshots.InternalSnapshotsInfoService$FetchingSnapshotShardSizeRunnable.doRun(InternalSnapshotsInfoService.java:241) [opensearch-2.14.0.jar:2.14.0]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913) [opensearch-2.14.0.jar:2.14.0]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.14.0.jar:2.14.0]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]

However, I have manually copied the backup files to the pod at the path registered in the repository.

We don't have a shared file system but we make sure to manually copy the snapshot files to all the nodes.

I wanted to clarify two things.

  1. Is it advisable to backup and restore these indices?
  2. How does opensearch restore process handle index that have multiple primary and replica shards.

We have the same issue with another set of application specific indices that have multiple primary and replica shards.

Related component

Storage:Snapshots

To Reproduce

  1. Take a snapshot of an opensearch cluster consisting of multiple nodes
  2. Restore the snapshot on the same node.

Expected behavior

The snapshot should get restored successfully without any shard failures.

Additional Details

No response

@manoj0598 manoj0598 added bug Something isn't working untriaged labels Jun 12, 2024
@dblock dblock transferred this issue from opensearch-project/OpenSearch Jun 12, 2024
@gaiksaya
Copy link
Member

Moving the issue to core repo as it is snapshots related issue.

@gbbafna
Copy link
Collaborator

gbbafna commented Jun 27, 2024

[Storage Triage - attendees 1 2 3 4 5 6 7 8 9 10 ]

@@manoj0598

Is it advisable to backup and restore these indices?

Repository has to be shared across all the nodes in the cluster . If you copy the data manually , the behavior is undefined .

How does opensearch restore process handle index that have multiple primary and replica shards.

Snapshot stores and restores the data on per shard basis . Replica shards starts hydrating post the primary shards via peer recovery mechanism .

@gbbafna gbbafna removed bug Something isn't working untriaged labels Jun 27, 2024
@gbbafna gbbafna moved this from 🆕 New to 🏗 In progress in Storage Project Board Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🏗 In progress
Development

No branches or pull requests

3 participants