Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easy and fluent disaster recovery (3.0?) #11894

Open
sandervandegeijn opened this issue Jan 15, 2024 · 7 comments
Open

Easy and fluent disaster recovery (3.0?) #11894

sandervandegeijn opened this issue Jan 15, 2024 · 7 comments
Labels
enhancement Enhancement or improvement to existing feature or request Storage:Snapshots

Comments

@sandervandegeijn
Copy link

sandervandegeijn commented Jan 15, 2024

Is your feature request related to a problem? Please describe

So your cluster has blown up or the hardware has failed or.... It's unsalvageable and you to rebuild the whole thing. Luckily you have made snapshots so you should be able to restore it. Well it's not really easy, you have to account for:

  • You need to run the security admin script first
  • Restoring cluster state yes/no?
  • You can't restore the security index (so what about all the manually created users/roles/etc)
  • You can't do it through the UI
  • You need to use the admin certificates or things will fail
  • It's unclear which indices you need to include in your snapshots to restore all the stuff in Dashboards (detectors, channels, etc)

Even if you're quite familiair with opensearch it can be complex, especially when your production environment has gone up in smoke and everybody is looking at you to restore everything asap

The point in time recovery of certain indices is quite easy through the UI, but a complete restore is not that easy (talking from experience here ;) )

Describe the solution you'd like

A clear and consistent flow that's user friendly that restores the whole cluster (including security settings, jobs, anomaly detectors, etc) to the state of the snapshot you want to restore. Without a ton of caveats to take into account, having to read a the docs with all the exceptions.

Restore is preferably done from the UI and can be executed by junior-medior level of sysadmins/devs.

This would also enable us to phase out or custom component that provisions all the settings from a git repository because we don't trust being able to restore everything after a loss of the cluster.

Trust is key here, I need to be able to trust the environment to be recoverable with my eyes closed.

Related component

Storage:Snapshots

Describe alternatives you've considered

Scripting everything (which I did), but this requires quite some knowledge of opensearch and it's internals. This should be as easy as possible with a polished experience.

Additional context

No response

@sandervandegeijn sandervandegeijn added enhancement Enhancement or improvement to existing feature or request untriaged labels Jan 15, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3 4]
Thanks for calling out this area of improvement and details around how it would better work

@reta
Copy link
Collaborator

reta commented Jan 17, 2024

I believe it will came naturally when Remote Store (the writeable paths) is released: in this case, the storage and compute layers are separated, recovering the node (or even whole cluster) should be as easy as pointing to Remote Store location.

@andrross please correct me here.

@Bukhtawar
Copy link
Collaborator

The major caveat in auto-recovering is cases around network partitioning and ensuring we don't have an isolated writer acknowledging write requests while we auto-recover the shard data on some other node. This will lead to divergent writes if safety checks aren't in place.
You might be interested in the issue. There are plans to support this feature #11921

@andrross
Copy link
Member

@reta Yes, the plan with remote store is to enable automatic recovery in the case of any hardware failure, though as @Bukhtawar called out we're not quite there for all cases. However, as this issue documents, there are some significant pain points with snapshot-based disaster recovery where we could definitely make improvements.

@sandervandegeijn
Copy link
Author

Yes, please. If I can help to review the plans, no problem. In all cases the recovery flow should be almost thoughtless and simple, if you're the one to recover the cluster while everyone stresses out around you it should as KISS as can be :)

@linuxpi
Copy link
Collaborator

linuxpi commented May 2, 2024

[Storage Triage - attendees 1 2 3 4 5 6 7 8 9 10 11 12]

@sandervandegeijn Thanks for opening the issue. With recent Remote Store release we have simplified some part of the process. While adding some documentation around the new improvements will help.

We need to plan for and list out other improvements that can be taken up. One of those is already pointed by @Bukhtawar - #11921

We can debate more on whether such controls should be exposed via the UI.

@linuxpi linuxpi moved this from 🆕 New to Later (6 months plus) in Storage Project Board May 2, 2024
@sandervandegeijn
Copy link
Author

Great that there is progress. If I can help to review anything, no problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Storage:Snapshots
Projects
Status: Later (6 months plus)
Development

No branches or pull requests

6 participants