Easy and fluent disaster recovery (3.0?) #11894

sandervandegeijn · 2024-01-15T21:01:52Z

Is your feature request related to a problem? Please describe

So your cluster has blown up or the hardware has failed or.... It's unsalvageable and you to rebuild the whole thing. Luckily you have made snapshots so you should be able to restore it. Well it's not really easy, you have to account for:

You need to run the security admin script first
Restoring cluster state yes/no?
You can't restore the security index (so what about all the manually created users/roles/etc)
You can't do it through the UI
You need to use the admin certificates or things will fail
It's unclear which indices you need to include in your snapshots to restore all the stuff in Dashboards (detectors, channels, etc)

Even if you're quite familiair with opensearch it can be complex, especially when your production environment has gone up in smoke and everybody is looking at you to restore everything asap

The point in time recovery of certain indices is quite easy through the UI, but a complete restore is not that easy (talking from experience here ;) )

Describe the solution you'd like

A clear and consistent flow that's user friendly that restores the whole cluster (including security settings, jobs, anomaly detectors, etc) to the state of the snapshot you want to restore. Without a ton of caveats to take into account, having to read a the docs with all the exceptions.

Restore is preferably done from the UI and can be executed by junior-medior level of sysadmins/devs.

This would also enable us to phase out or custom component that provisions all the settings from a git repository because we don't trust being able to restore everything after a loss of the cluster.

Trust is key here, I need to be able to trust the environment to be recoverable with my eyes closed.

Related component

Storage:Snapshots

Describe alternatives you've considered

Scripting everything (which I did), but this requires quite some knowledge of opensearch and it's internals. This should be as easy as possible with a polished experience.

Additional context

No response

peternied · 2024-01-17T16:14:52Z

[Triage - attendees 1 2 3 4]
Thanks for calling out this area of improvement and details around how it would better work

reta · 2024-01-17T17:00:30Z

I believe it will came naturally when Remote Store (the writeable paths) is released: in this case, the storage and compute layers are separated, recovering the node (or even whole cluster) should be as easy as pointing to Remote Store location.

@andrross please correct me here.

Bukhtawar · 2024-01-18T11:30:38Z

The major caveat in auto-recovering is cases around network partitioning and ensuring we don't have an isolated writer acknowledging write requests while we auto-recover the shard data on some other node. This will lead to divergent writes if safety checks aren't in place.
You might be interested in the issue. There are plans to support this feature #11921

andrross · 2024-01-18T16:26:57Z

@reta Yes, the plan with remote store is to enable automatic recovery in the case of any hardware failure, though as @Bukhtawar called out we're not quite there for all cases. However, as this issue documents, there are some significant pain points with snapshot-based disaster recovery where we could definitely make improvements.

sandervandegeijn · 2024-01-18T16:32:34Z

Yes, please. If I can help to review the plans, no problem. In all cases the recovery flow should be almost thoughtless and simple, if you're the one to recover the cluster while everyone stresses out around you it should as KISS as can be :)

linuxpi · 2024-05-02T15:26:01Z

[Storage Triage - attendees 1 2 3 4 5 6 7 8 9 10 11 12]

@sandervandegeijn Thanks for opening the issue. With recent Remote Store release we have simplified some part of the process. While adding some documentation around the new improvements will help.

We need to plan for and list out other improvements that can be taken up. One of those is already pointed by @Bukhtawar - #11921

We can debate more on whether such controls should be exposed via the UI.

sandervandegeijn · 2024-05-03T08:21:48Z

Great that there is progress. If I can help to review anything, no problem.

sandervandegeijn added enhancement Enhancement or improvement to existing feature or request untriaged labels Jan 15, 2024

github-actions bot added the Storage:Snapshots label Jan 15, 2024

peternied removed the untriaged label Jan 17, 2024

sandervandegeijn mentioned this issue Jan 17, 2024

[BUG] 2.11.1 restoring kibana indices from snapshots through UI fails opensearch-project/index-management-dashboards-plugin#964

Open

Bukhtawar added this to Storage Project Board Feb 15, 2024

github-project-automation bot moved this to 🆕 New in Storage Project Board Feb 15, 2024

linuxpi moved this from 🆕 New to Later (6 months plus) in Storage Project Board May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Easy and fluent disaster recovery (3.0?) #11894

Easy and fluent disaster recovery (3.0?) #11894

sandervandegeijn commented Jan 15, 2024 •

edited

Loading

peternied commented Jan 17, 2024

reta commented Jan 17, 2024

Bukhtawar commented Jan 18, 2024

andrross commented Jan 18, 2024

sandervandegeijn commented Jan 18, 2024

linuxpi commented May 2, 2024

sandervandegeijn commented May 3, 2024

Easy and fluent disaster recovery (3.0?) #11894

Easy and fluent disaster recovery (3.0?) #11894

Comments

sandervandegeijn commented Jan 15, 2024 • edited Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

peternied commented Jan 17, 2024

reta commented Jan 17, 2024

Bukhtawar commented Jan 18, 2024

andrross commented Jan 18, 2024

sandervandegeijn commented Jan 18, 2024

linuxpi commented May 2, 2024

sandervandegeijn commented May 3, 2024

sandervandegeijn commented Jan 15, 2024 •

edited

Loading