-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFD - Backup and restore #49
Comments
I just finished reading @pt247; this looks great! Some considerations below:
I would suggest doing this differently, as interacting with the
If we do end up having this structure, I prefer that those commands (
We already saved the kubeconfig as a secret on Kubernetes; we could reuse that as part of this and enable versioning for that secret. |
I also have a question: would we expect the S3 or storage to be managed by Nebari's terraform during the first deployment, or would the user be responsible for that? (I do prefer the later, though we would need to make sure the cluster roles have access to that 😄 ) |
That's a good point. The backup location should not be managed by Nebari, but Nebari should have access and rights to write to the location. |
You are right; it's simpler to implement a catch-all backup everything command. But, Admin, for good reasons, might be interested in backing up only specific components, for example, to back up user data only. |
Some of the main points from our most recent discussion on the matter:
My 50 centsWe'll first need to discuss the data needed for state restoration and ensure each component is clearly defined in its role within the backup and restore operations. For instance:
Furthermore, addressing the dependencies and interactions between services during the backup and restore processes is essential. For example, restoring Keycloak user data and groups should ideally precede the restoration of corresponding directories to maintain coherence. Finally, our discussions have highlighted the importance of individually mapping out each service's backup and restore processes before we consider how to orchestrate these processes. flowchart TD
B(Orchestrator)
C(NFS) --> B
D(Keycloak) --> B
E(Grafana?) --> B
S(Conda Store) --> B
While managing other services solely through APIs is feasible, the same cannot be said for the EFS structure, which needs to be considered as its category. As part of this RFD, we need to include the data that will be targeted as part of these stored components. Ideally, this would be facilitated through endpoints if we expose them somehow. Let's leverage the existing CLI command descriptions already presented in this RFD to ensure that any system we implement in the future can communicate in a way that our CLI—or other necessary tools—can effectively manage. Regarding data export versus backupExporting data in a serializable format does not necessarily ensure a complete service restoration to its previous state. To better define these distinctions, it's essential to evaluate the behavior of each service. Exporting state data from one version of a service to another could restore the previous structural identity of the service but not suffice to promote the same state it was in. If classified as backup/restore, importing and exporting should ideally match the service's original structure level and state. Suppose the provided files fail to restore the original state. In that case, the process should not be considered a backup/restore but a mere export/import—often due to the service's limitations or the incompleteness of the files or sources used to "restore" it. In discussing the RFD, we aim to identify and standardize these necessary components and files, ensuring that our state data are sufficient to equate importing/exporting with backup/restore as much as possible. In scenarios where the service offers robust API support and effectively handles new data, the distinction between backup and export becomes less significant and often negligible. For example, although listing and restoring YAML files of namespaced environments from the Conda store might enable us to use these environments again (by rebuilding), this action does not replicate the original "build" of those same environments. As discussed, it also does not leverage the previous builds unless we manage to store all the available databases within it; in my opinion, I would prefer that the conda-store handled that by itself, and we could work together to develop such usability, but we also need to consider what we can do now. However, this may only be the case for some services; for instance, Keycloak could adequately support backup and restore through simple import/export functions. |
The comments by @viniciusdc are well organized and point the effort in a good direction. I propose the following principles and tactical plan for implementation. Core ConsiderationsNebari is a modular and configurable collection of disparate OSS components. This implies certain principles related to the backup/restore effort:
Tactical PlanAll APIs should be implemented as REST endpoints using administrator access tokens for authentication and accessible only within the VPC.
Order of implementation: 1.) User accounts (highest priority because these cannot be recreated)
2.) Conda Environments (high priority as these would be very difficult to recreate)
3.) User code, notebooks, apps 4.) Nebari deployment-wide asynchronous (e.g. cron) jobs |
Both @viniciusdc and @tylergraff have some excellent thoughts here. I agree with @tylergraff that having a standardized interface for backups that we can implement for each service is a good plan. That will improve the devex and make things far easier as far as maintainability. @tylergraff 's proposed api endpoints would certainly provide coverage, but I would suggest that we go even simpler to start. Just have a I also agree with @viniciusdc that we should utilize built in backup mechanisms whenever possible. Keycloak already provides options for backup and restore which can be accessed through its rest API. Rather than reinvent the wheel here, we should wrap the functionality so that it implements our backup interface. For prioritization, I also agree with @tylergraff. We should first ensure that each service has backup and restore functionality before worrying about any kind of orchestration between backups. Users and groups is obvious for our first backup target, and would be really straightforward to implement since it would just be wrapping keycloak's restapi. After that, I would agree with conda-store next. I think conda-store backup should just be a backup of the blob storage in some form and a dump of the postgres db to start with. Finally the nfs file system, which I think we can just do a tarball of. Restores could be the reverse. This is not an endstate, but would represent an MVP implementation which would allow users to try out and we could learn a lot from it. Being an MVP it will also be cheaper and quicker to implement while (hopefully) avoiding going too far down any incorrect paths. |
I agree with re-using / wrapping existing capabilities, provided that the wrapper adopts a standardized authentication token pattern which would be used across future endpoints. I'm not convinced of adding an optional S3 bucket for user backups. This adds S3 authentication implementation and administration. It also implies that a single operation would serialize all users to that S3 bucket. Bulk actions can introduce ornery complexities, such as: how to handle a fatal error which occurs after some of the users were backed up to S3? How would we get debug insight into (potentially) which individual user account caused the error? My opinion is that we should implement list-all, serialize, and deserialize operations only; the latter two operate on single elements (e.g. users). Client-side tooling can perform S3 uploads separately and in a more modular fashion. |
From the comments, I can conclude the following:
Let's start with the requirements for Keycloak. I have a few questions:
PS: @tylergraf, I am going through your last comment just now. Can you explain in the case of Keycloak what you would like to see in "list-all, serialize, and deserialize"? |
I agree; whatever solution we pick, it needs to back up all or nothing. Luckily, pg_dump behaves like that. So, in case of failures, we can have API report the status of backup as failed with reason.
We can expose Keycloak REST API to authenticated admins. This will allow admins to write Client-side tooling to manager uses as needed, for, e.g., adding or removing users. |
Let me explain my reasoning and address those together: My team's DR approach is to incrementally re-build a new Nebari deployment which can be used productively by our customers throughout that rebuild process. We are comfortable with this and are looking to minimize the risk and time (in that order) involved. We are not looking to precisely duplicate a Nebari deployment or its contents. We see substantial risk in the precise replication of internal Nebari state: internal state is opaque to us, may itself be the root cause of a disaster, or may cause a new disaster due to opaque consistency issues with other components. We know that deploying Nebari in an incremental fashion is low risk, because it is something we do frequently. Our current DR approach is almost entirely manual, and we would like to improve by using automation to decrease the time involved. To reduce risk, it is critical that we retain visibility into (and thus confidence in) the changes effected by automation. We desire an approach of incremental modification, which allows us to understand changes and tailor risk. We want to maximize the observability of system state, allowing the effects of modification to be understood by administrators (who are likely learning as they go). And we’d like to decouple changes, to reduce the risk of unintended consequences. To answer your questions:
|
After reviewing the latest RFD contents and reflecting on our internal discussions and community feedback, Approach 3 seems most suited to our needs. As @tylergraff noted:
Fully replicating Nebari's state can reintroduce the problems that necessitate a restoration, making it a challenging option. However, I also see significant merits in Approach 2, especially when we consider 'user' as the basic unit for the backup/restore process. This approach offers the flexibility to restart the process after encountering any errors or exceptions, which is a limitation of the bulk process. Nevertheless, this should not be viewed as a separate approach IMO. If we proceed with the REST API approach (Approach 3), we can incorporate both bulk and per-user import/export endpoints. This integration allows us to optimize the workflow for backup/restore processes, which the user should consider. In conclusion, I think everyone seems to be on the same page regarding the serialization and endpoints approach, and this should now be voted as is, and follow-up tasks can be created to start implementation details discussions. |
Thanks to everyone for their feedback here. Based on this discussion, We will be moving forward with approach 3. Currently state is in 3 main places:
We will create a backup controller within Nebari which will expose backup and restore routes for each of these services. Specifics of each service's backup and restore will be decided on a per service basis and will be handled in individual tickets. There seems to be broad consensus that it makes sense to start with keycloak as the first service to implement this on. @pt247 will open tickets for backing up and restore for each service and we can have specific discussions on the implementation details on those tickets. |
Backup and restore - RFD
A design proposal for Backup and Restore service in Nebari.
Summary
As Nebari becomes more popular, it's essential to have dependable backup and restore capabilities. Automated scheduled backups are also necessary. Planning is vital since Nebari has several components, including conda-store environments, user profiles in KeyCloak, and user data in the NFS store.
User benefit
Design considerations:
We need to look at the development, maintenance, administration, and support requirements to decide on an appropriate strategy for this service. Following is a list of key criteria for the service:
Data protection considerations:
In the scope of this RFD:
This Request for Discussion (RFD) aims to establish a high-level strategy for backup and restoration. The goal is to reach a consensus on design choices, API, and a development plan for the backup and restoration of individual components. The implementation details of the identified design will be part of another RFD. The focus of this RFD is to develop a backup and restoration strategy for the following components:
Out of scope for this RFD:
Following Nebari components are not covered in this document.
Existing backup process
You can find the existing docs for backup on this page.
Backup and Restore strategies
There are several approaches to Nebari backup and restore. Some are closer to the current backup and restore, and some are entirely novel approaches. Each of these methods has its own set of advantages and disadvantages. In this section, we will summarise the various approaches suggested in the comments, outline the pros and cons, and briefly describe the implementation.
Backup and restore by component Approach #1
This approach aims to automate the current manual backup and restore process.
A typical Nebari deployment consists of several components like Keycloak, conda-store, user data and more.
Example Backup flow:
Example Restore flow:
Note
: Both these workflows are, for example, and must be refined/refactored.Let's look at the pros and cons of this approach:
Pros
Cons
Finer details
Vertical slices per user migration Approach #2
We could look at nebari from the perspective of the user. Each user has some shared and dedicated state in each Nebari component.
The solution recommends backing or restoring shared resources first. We can then backup/restore users in parallel or any order.
User migration workflow
Nebari migration overall
Backup flowchart
Restore flowchart
Let's look at the pros and cons of this approach:
Pros
Cons
Restful Interface Approach #3
The last two designs include the backup and restore functionality in Nebari. The central assumption was that Nebari should be able to back up and restore itself. However, thanks to helpful comments in this RFD, this design challenges this premise and proposes an alternative solution.
This design breaks the implementation into two: the interface and the strategy. It argues that Nebari should only provide the interface for importing/exporting data. The backup and restore strategy should be part of the client code. We can extend the interface by providing a Python library.
The idea is simple: instead of building a backup and restoring service, we could build a backup and restore interface. The only job of this interface will be to provide users' state and data to authenticated users outside Nebari. The entire backup and restore
logic can be built and maintained outside Nebari. This backup and restore client can then be run from anywhere, providing Admins with flexibility that other designs do not offer.
Serializable vs Non-Serializable data
An essential requirement for this design is to expose data and state. APIs like Keycloak and conda-store API already provide the bulk of serializable states. However, not all states are serializable, e.g., user data and conda packages. In this case, the design recommends APIs to download location URLs. APIs in Nebari could be completely stateless.
Let's look at a few transactions with this proposed API.
Serializable data
Non-Serializable data
Let's see the pros and cons of this design.
Pros
Cons
Design Discussion
Possible options
Each of the above-discussed designs has its pros and cons. We could also extend the
designs.For example,we could extend Approach#2 and#1 via an API toprovide simple
interfaces like/users/{uid}/backup/keycloak.
Let's look at a few possible options we can vote on. More suggestions welcome.
in the first iteration. Then extend this to
Sliced Approach #2 for normal users.
iteration evolve it to Sliced Approach #2
by exposing an API in second iteration.
Special note about conda-store
Conda store is one of the more complicated pieces to replicate among the Nebari
components. We will need to work with conda-store team to come up with a detailed plan
on backup-restore. But, here is a initial analysis based on conda-store docs.
The simplest approach (Compatible with Approach #1)
Backup the object storage and dump the database. Restore would be reverse. We might
have to ensure that database location entries for artifacts and Conda-pack are pointing
to the right location. This might involve simple find and replace operations to the
SQL dump.
Approach per user (Can be used in Approach #2 and Restful Approach #3)
environment
as deleted by settingdeleted_on
field.environment
s and resetdeleted_on
to make them available.environment
->build
->build_conda_package_build
->conda_package_build
Please note:
Relevant links:
Unresolved questions:
The text was updated successfully, but these errors were encountered: