-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Easy and fluent disaster recovery (3.0?) #11894
Comments
I believe it will came naturally when Remote Store (the writeable paths) is released: in this case, the storage and compute layers are separated, recovering the node (or even whole cluster) should be as easy as pointing to Remote Store location. @andrross please correct me here. |
The major caveat in auto-recovering is cases around network partitioning and ensuring we don't have an isolated writer acknowledging write requests while we auto-recover the shard data on some other node. This will lead to divergent writes if safety checks aren't in place. |
@reta Yes, the plan with remote store is to enable automatic recovery in the case of any hardware failure, though as @Bukhtawar called out we're not quite there for all cases. However, as this issue documents, there are some significant pain points with snapshot-based disaster recovery where we could definitely make improvements. |
Yes, please. If I can help to review the plans, no problem. In all cases the recovery flow should be almost thoughtless and simple, if you're the one to recover the cluster while everyone stresses out around you it should as KISS as can be :) |
[Storage Triage - attendees 1 2 3 4 5 6 7 8 9 10 11 12] @sandervandegeijn Thanks for opening the issue. With recent Remote Store release we have simplified some part of the process. While adding some documentation around the new improvements will help. We need to plan for and list out other improvements that can be taken up. One of those is already pointed by @Bukhtawar - #11921 We can debate more on whether such controls should be exposed via the UI. |
Great that there is progress. If I can help to review anything, no problem. |
Is your feature request related to a problem? Please describe
So your cluster has blown up or the hardware has failed or.... It's unsalvageable and you to rebuild the whole thing. Luckily you have made snapshots so you should be able to restore it. Well it's not really easy, you have to account for:
Even if you're quite familiair with opensearch it can be complex, especially when your production environment has gone up in smoke and everybody is looking at you to restore everything asap
The point in time recovery of certain indices is quite easy through the UI, but a complete restore is not that easy (talking from experience here ;) )
Describe the solution you'd like
A clear and consistent flow that's user friendly that restores the whole cluster (including security settings, jobs, anomaly detectors, etc) to the state of the snapshot you want to restore. Without a ton of caveats to take into account, having to read a the docs with all the exceptions.
Restore is preferably done from the UI and can be executed by junior-medior level of sysadmins/devs.
This would also enable us to phase out or custom component that provisions all the settings from a git repository because we don't trust being able to restore everything after a loss of the cluster.
Trust is key here, I need to be able to trust the environment to be recoverable with my eyes closed.
Related component
Storage:Snapshots
Describe alternatives you've considered
Scripting everything (which I did), but this requires quite some knowledge of opensearch and it's internals. This should be as easy as possible with a polished experience.
Additional context
No response
The text was updated successfully, but these errors were encountered: