Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFD - Backup and restore #49

Closed
pt247 opened this issue Apr 21, 2024 · 13 comments
Closed

RFD - Backup and restore #49

pt247 opened this issue Apr 21, 2024 · 13 comments
Labels

Comments

@pt247
Copy link

pt247 commented Apr 21, 2024

Status Open for comments 💬
Author(s) @pt247
Date Created 20-04-2024
Date Last updated 05-05-2024
Decision deadline 22-05-2024

Backup and restore - RFD

A design proposal for Backup and Restore service in Nebari.

Summary

As Nebari becomes more popular, it's essential to have dependable backup and restore capabilities. Automated scheduled backups are also necessary. Planning is vital since Nebari has several components, including conda-store environments, user profiles in KeyCloak, and user data in the NFS store.

User benefit

  1. Nebari admins will get a straightforward backup process using Nebari CLI.
  2. Admins will also be able to define a schedule for automated backup in Nebari config.
  3. Nebari upgrades can automatically save the state before providing upgrades.
  4. User data and other Nebari components can better protect against accidental deletion.

Design considerations:

We need to look at the development, maintenance, administration, and support requirements to decide on an appropriate strategy for this service. Following is a list of key criteria for the service:

  1. Availability: Service disruption to perform backup or restore.
  2. Observability: Visibility of progress, error, and status.
  3. Maintainability: Ease of building, maintaining and supporting.
  4. Composability: Backup and restore in small chunks independently.
  5. Security: Access control to the backup and restore service and the backup itself.
  6. Compatibility: Forward and, if possible, backwards.
  7. Flexibility: multiple entry points to the backup and restore, e.g. scheduled API
  8. Scalability: Scalability should scaled to large deployments.
  9. Feasibility: developing, maintaining, or computing resources.
  10. Compliance: with various data protection regulations.
  11. On-prem: On-prem and air-gapped deployments.

Data protection considerations:

  1. Encryption at rest and in transit: We must have data encrypted in motion and at rest to protect against unauthorized access.
  2. Backup location: Several data protection directives in the US and EU limit where we can store certain data assets. We should design the backup and restore solution with this in mind.
  3. Day zero feature: Encryption at rest and transit needs to be available in the first version of the backup and restore service.
  4. PoLP (principle of least privilege): Only authorized users should be able to access the backup and restore service.

In the scope of this RFD:

This Request for Discussion (RFD) aims to establish a high-level strategy for backup and restoration. The goal is to reach a consensus on design choices, API, and a development plan for the backup and restoration of individual components. The implementation details of the identified design will be part of another RFD. The focus of this RFD is to develop a backup and restoration strategy for the following components:

  • Nebari config
  • Keycloak
  • Conda-store
  • User data in NFS

Out of scope for this RFD:

Following Nebari components are not covered in this document.

  • Nebari plugins
  • Loki Logs + prometheus
  • Nebari migration (for, e.g. from AWS to GCP)
  • Custom backup schedules (e.g. component specific backup schedules)

Existing backup process

You can find the existing docs for backup on this page.

Backup and Restore strategies

There are several approaches to Nebari backup and restore. Some are closer to the current backup and restore, and some are entirely novel approaches. Each of these methods has its own set of advantages and disadvantages. In this section, we will summarise the various approaches suggested in the comments, outline the pros and cons, and briefly describe the implementation.

Backup and restore by component Approach #1

flowchart TD
    Backup --> Storage
    Nebari --> |1. config| Backup
    Nebari --> |2. Keycloak | Backup
    Nebari --> |3. Conda Store | Backup
    Nebari --> |3. User Data | Backup
    
    Storage --> Restore1
    Restore1 --> |1. config| Nebari1
    Restore1 --> |2. Keycloak | Nebari1
    Restore1 --> |3. Conda Store | Nebari1
    Restore1 --> |3. User Data | Nebari1
Loading

This approach aims to automate the current manual backup and restore process.

A typical Nebari deployment consists of several components like Keycloak, conda-store, user data and more.

Example Backup flow:

flowchart TD
    A1[CLI] --> B(Backup workflow)
    A2[Nebari config.backup.schedule] --> B
    A3[Argo workflows UI]  --> B
    B --> F(Backup Nebari config)
    F --> D(Backup Keycloak)
    D --> C(Backup NFS)
    D --> E(Backup Conda Store)
    C --> X(Backup Location)
    D --> X
    E --> X
    F --> X
Loading

Example Restore flow:

flowchart TD
    A[Nebari Restore CLI - Specified backup] --> B(Backup workflow - latest backup)
    A1[Argo Workflows UI] --> B
    B --> B1(Restore workflow - Specified backup)
    B1 --> F(Restore Nebari config)
    F --> D(Restore Keycloak)
    D --> C(Restore NFS)
    D --> E(Restore Conda Store)
    C --> Z(Validate restore completion)
    D --> Z
    E --> Z
    Z --> |failure| X(Restore workflow - latest backup)
    Z --> |success| Y(Stop)
    X --> |success| Y
    X --> |failure| Y
Loading

Note: Both these workflows are, for example, and must be refined/refactored.

Let's look at the pros and cons of this approach:

Pros

  1. Feasibility: We can use tried and tested tools for database dump or Restic to sync files between source and destination.
  2. Maintainability: Development of each rach task (say backup conda-store) can happen separately and iteratively.
  3. Compatibility: Excellent support is available for tried and tested production-ready tools like pg_dump, Restic, rsync and more. This design can use Nebari component agnostic tools, which means the same solution could work for multiple versions of Nebari, providing backwards and forward compatibility.

Cons

  1. Observability: If the backup fails because of a single failed sub-task, it can result in a whole backup or restore failure. This solution offers little Observability.
  2. Composability: The success of individual tasks does not guarantee success of overall success. For example, the solution might find a new user's data for backup without the user being there when Keykloak backs up.
  3. Scalability: As user data increases, this design might need to evolve to take incremental snapshots. If the time it takes to back up increases, so do the chances of Nebari state changing.
  4. Availability: The solution must implement a maintenance window for the entire Nebari during backup and restore processes.

Finer details

  1. Backup location: This design assumes Nebari has read-write access to the backup location. Nebari will manage the backup location.
  2. Local backup: If the backup location is a local directory, the client should have access to read-write to that directory.

Vertical slices per user migration Approach #2

We could look at nebari from the perspective of the user. Each user has some shared and dedicated state in each Nebari component.

Nebari Shared Dedicated
Keycloak Groups, Roles, permissions User profiles
Conda store Shared environments User environments
JupyterHub Shared user data Dedicated user data

The solution recommends backing or restoring shared resources first. We can then backup/restore users in parallel or any order.

User migration workflow

flowchart LR
    rc[Restore user] --> rs[Restore shared state] --> ru[Restore user]
    s[Storage] -.-> rs
    s -.-> ru
    
    bc[Backup user] --> bs[Migrate shared state] --> bu[Migrate user]
    bu -.-> Storage
    bs -.-> Storage

Loading

Nebari migration overall
Backup flowchart

flowchart LR
    nb[Nebari Backup] ==> rs[Backup shared state] 
    rs ==> bu1[Backup user A] & bu2[Backup user B] & bu3[Backup user C] -.-> Storage
    rs --> | ... | Storage
    rs --> |Backup user n| Storage
Loading

Restore flowchart

flowchart LR
    nr[Nebari Restore] ==> rsr[Restore shared state]
    Storage -.-> ru1[Backup user A] & ru2[Backup user B] & ru3[Backup user C] & ru4[...] & ru5[Backup user N]
    rsr ==> ru1 & ru2 & ru3 & ru4 & ru5
    Storage -.-> rsr
Loading

Let's look at the pros and cons of this approach:

Pros

  1. Fail fast approach: If all goes well, we will have backed up all users. If not, then there will be two possibilities:
    1. Shared state backup/restore failure - which will be immediate or
    2. Single User state backup/restore failure - which can be local to a particular user, e.g. another user's backup/restore might still succeed. The likelihood of failure significantly decreases after the initial few users have been successfully migrated.
  2. Composable: admins can migrate a single user at a time or a batch of users simultaneously. We could easily create workflows to the backup shared state at a higher or lower cadence than users.
  3. Monitoring: The design allows for more granular status and progress monitoring.
  4. Availability: Backup of a user should not impact service for other users.

Cons

  1. Maintainability and Compatibility: This design depends on the APIs present in the individual components and our understanding of the state and data within them. However, our understanding and, therefore, the implementation may need to be updated with version upgrades. Hence, this backup/restore solution is only compatible with a limited number of component versions.
  2. Feasibility: This design, although more evolved, will also require more building, maintenance, and support.

Restful Interface Approach #3

The last two designs include the backup and restore functionality in Nebari. The central assumption was that Nebari should be able to back up and restore itself. However, thanks to helpful comments in this RFD, this design challenges this premise and proposes an alternative solution.

This design breaks the implementation into two: the interface and the strategy. It argues that Nebari should only provide the interface for importing/exporting data. The backup and restore strategy should be part of the client code. We can extend the interface by providing a Python library.

block-beta
columns 1
    j["Client Script (Backup strategy maintained by Nebari Admin)"]
    blockArrowId6<["&nbsp;&nbsp;&nbsp;"]>(updown)
    L["Nebari backup and restore library (Python package)"]
    blockArrowId7<["&nbsp;&nbsp;&nbsp;"]>(updown)
    D["Nebari Backup and restore REST API"]
    blockArrowId6<["&nbsp;&nbsp;&nbsp;"]>(updown)
  block:ID
    A["Conda Store REST API"]
    B["User DATA REST API"]
    B2["Keycloak REST API"]
  end

Loading

The idea is simple: instead of building a backup and restoring service, we could build a backup and restore interface. The only job of this interface will be to provide users' state and data to authenticated users outside Nebari. The entire backup and restore
logic can be built and maintained outside Nebari. This backup and restore client can then be run from anywhere, providing Admins with flexibility that other designs do not offer.

flowchart LR
    subgraph Backup and restore library
        Client
    end
    Client-->I
    subgraph Nebari
        I
    I[Backup and restore interface API]-->K[Keycloak API]
    I-->C[Conda store API]
    I-->J[JupiterHub API]
    I-->N[User data API]
    end
Loading

Serializable vs Non-Serializable data

An essential requirement for this design is to expose data and state. APIs like Keycloak and conda-store API already provide the bulk of serializable states. However, not all states are serializable, e.g., user data and conda packages. In this case, the design recommends APIs to download location URLs. APIs in Nebari could be completely stateless.

Let's look at a few transactions with this proposed API.

Serializable data

sequenceDiagram
    Client->>API: GET /users
    API-->>Keycloak: GET  /admin/realms/{realm}/users/
    Keycloak-->>API: [A, B, C]
    API-->>Client: [A, B, C]
Loading

Non-Serializable data

sequenceDiagram
    Client->>API: GET /users/A/environments
    API-->>conda-store: GET /api/v1/environment/?namespace={..}
    conda-store-->>API: [E1, E2, E3 ...]
    API->>Client: [{envs:[E1, E2]}]
    Client-->NFS: FTP/Rsync/Restic FETCH Artifact from E1, E2, E3 ...
Loading

Let's see the pros and cons of this design.

Pros

  1. Flexibility: Nebari admins can write their custom backup strategy based on the organization's needs.
  2. Observability: This solution gives Nebari Admins a unique insight into the inner workings of Nebari. It makes it less opaque and, thus more
  3. Compliance: The authenticated admins are responsible for enforcing compliance with company policies.
  4. Availability: If the client interface is well-developed, this solution can achieve the highest level of availability.
  5. Clear division of responsibility: Responsibilities of Component API, Nebari backup, and restore API, client library, and client code.

Cons

  1. Maintainability & Support: This approach moves the complexity of the backup/restore strategy outside Nebari. It now requires Nebari admins to know and understand the inner workings of Nebari.
  2. Flexibility: A misconfigured client script can wreak havoc with the Nebari ecosystem.

Design Discussion

Possible options

Each of the above-discussed designs has its pros and cons. We could also extend the
designs.For example,we could extend Approach#2 and#1 via an API toprovide simple
interfaces like/users/{uid}/backup/keycloak.

Let's look at a few possible options we can vote on. More suggestions welcome.

  1. Option#1: Start with Restful Approach #3 to enable power users
    in the first iteration. Then extend this to
    Sliced Approach #2 for normal users.
  2. Option#2: Implement Bulk Backup Approach #1 in the first
    iteration evolve it to Sliced Approach #2
    by exposing an API in second iteration.
  3. Option#3: Implement Bulk Backup Approach #1.
  4. Option#4: Implement Sliced Approach #2.
  5. Option#5: Implement Restful Approach #3.

Special note about conda-store

Conda store is one of the more complicated pieces to replicate among the Nebari
components. We will need to work with conda-store team to come up with a detailed plan
on backup-restore. But, here is a initial analysis based on conda-store docs.

The S3 server is used to store all build artifacts for example logs, docker layers,
and the Conda-Pack tarball. The PostgreSQL database is used for storing all states
on environments and builds along with powering the conda-store web server UI, REST
API, and Docker registry. Redis is used for keeping track of task state and results
along with enabling locks and realtime streaming of logs.

The simplest approach (Compatible with Approach #1)

Backup the object storage and dump the database. Restore would be reverse. We might
have to ensure that database location entries for artifacts and Conda-pack are pointing
to the right location. This might involve simple find and replace operations to the
SQL dump.

Approach per user (Can be used in Approach #2 and Restful Approach #3)

Image

  • Getting the shared state:
    • Get and SQLDump of entire conda-store
    • Mark all entries in environment as deleted by setting deleted_on field.
    • Get global name spaces, for each
      • get related environments and reset deleted_on to make them available.
        • for each environment
          • Get build artifacts to backup environment -> build -> build_conda_package_build -> conda_package_build
          • Backup artifacts from source.
    • Getting the user state:
      • Same as getting the shared state except the namespace will be of the given user.

Please note:

  1. We need to get conda-store team to review this. But it gives a general idea
  2. Most of this flow can be done via API except changing environments delete status.
  3. We will need to create as separate RFD for Conda store.

Relevant links:

  1. https://www.nebari.dev/docs/how-tos/manual-backup
  2. https://www.keycloak.org/server/importExport
  3. https://argoproj.github.io/workflows/
  4. https://www.keycloak.org/docs-api/22.0.1/rest-api/index.html#_users
  5. https://conda.store/conda-store/references/api

Unresolved questions:

  1. Which design is most suitable?
  2. Is there a hybrid design that we can develop iteratively?
@viniciusdc
Copy link

viniciusdc commented Apr 22, 2024

I just finished reading @pt247; this looks great! Some considerations below:

  • Add Kubernetes CronJobs and ArgoWorkflows as part of your Scope section of the RFD

Keycloak manages user authentication. There is a recommended way of backup and restore recommended Keycloak docs - link

I would suggest doing this differently, as interacting with the kc client is troublesome, and we only care about the users and groups. This could be handled by API requests directly in a moderate way.


nebari backup user-data --backup-location <BACKUP_LOCATION>

If we do end up having this structure, I prefer that those commands (user-data, user-creds) are not exposed directly to the user (similar to what nebari render is called right now). The user should only handle this manually if the general backup fails the middle trough.


Scheduled backup of Nebari config: First, we extend the existing Nebari configuration file to provide a backup schedule to the Argo workflow template. The Argo template will encrypt the Nebari config and back it up.

We already saved the kubeconfig as a secret on Kubernetes; we could reuse that as part of this and enable versioning for that secret.

@viniciusdc
Copy link

I also have a question: would we expect the S3 or storage to be managed by Nebari's terraform during the first deployment, or would the user be responsible for that? (I do prefer the later, though we would need to make sure the cluster roles have access to that 😄 )

@pt247
Copy link
Author

pt247 commented Apr 23, 2024

I also have a question: would we expect the S3 or storage to be managed by Nebari's terraform during the first deployment, or would the user be responsible for that? (I do prefer the later, though we would need to make sure the cluster roles have access to that 😄 )

That's a good point. The backup location should not be managed by Nebari, but Nebari should have access and rights to write to the location.
I will clarify this in RFD.

@pt247
Copy link
Author

pt247 commented Apr 23, 2024

If we do end up having this structure, I prefer that those commands (user-data, user-creds) are not exposed directly to the user (similar to what nebari render is called right now). The user should only handle this manually if the general backup fails the middle trough.

You are right; it's simpler to implement a catch-all backup everything command. But, Admin, for good reasons, might be interested in backing up only specific components, for example, to back up user data only.

@viniciusdc
Copy link

viniciusdc commented Apr 23, 2024

Some of the main points from our most recent discussion on the matter:

  • We should first set aside the implementation details, such as using CronJobs and Argo. We should focus on the conceptual framework rather than specific execution methods.
  • An aspect to consider is the capability to manage backup and restore processes locally without assuming the necessity of Argo as a core component, which should be optional.

My 50 cents

We'll first need to discuss the data needed for state restoration and ensure each component is clearly defined in its role within the backup and restore operations. For instance:

  • Keycloak: This should primarily focus on users, groups, and IDPs. (A more detailed gathering will show up during development)
  • Conda-store: It needs to handle environments. How can we restore these elements to their previous state as part of an initial Proof of Concept? The solution to this problem does not exist yet and requires further discussion with the Conda-store team. In the meantime, what are our options?
    We could export the environments, rebuild them as part of the new instance, and rebuild them —see the discussion below on exporting/restoring differences.
  • User NFS data: Should we store the data in an S3 bucket using cloud provider APIs? A comparative analysis and a detailed inventory of essential data would be beneficial.

Furthermore, addressing the dependencies and interactions between services during the backup and restore processes is essential. For example, restoring Keycloak user data and groups should ideally precede the restoration of corresponding directories to maintain coherence.

Finally, our discussions have highlighted the importance of individually mapping out each service's backup and restore processes before we consider how to orchestrate these processes.

flowchart TD    
    B(Orchestrator)
    C(NFS) --> B
    D(Keycloak) --> B
    E(Grafana?) --> B
    S(Conda Store) --> B
Loading

While managing other services solely through APIs is feasible, the same cannot be said for the EFS structure, which needs to be considered as its category. As part of this RFD, we need to include the data that will be targeted as part of these stored components. Ideally, this would be facilitated through endpoints if we expose them somehow.

Let's leverage the existing CLI command descriptions already presented in this RFD to ensure that any system we implement in the future can communicate in a way that our CLI—or other necessary tools—can effectively manage.

Regarding data export versus backup

Exporting data in a serializable format does not necessarily ensure a complete service restoration to its previous state.

To better define these distinctions, it's essential to evaluate the behavior of each service. Exporting state data from one version of a service to another could restore the previous structural identity of the service but not suffice to promote the same state it was in. If classified as backup/restore, importing and exporting should ideally match the service's original structure level and state. Suppose the provided files fail to restore the original state. In that case, the process should not be considered a backup/restore but a mere export/import—often due to the service's limitations or the incompleteness of the files or sources used to "restore" it.

In discussing the RFD, we aim to identify and standardize these necessary components and files, ensuring that our state data are sufficient to equate importing/exporting with backup/restore as much as possible. In scenarios where the service offers robust API support and effectively handles new data, the distinction between backup and export becomes less significant and often negligible.

For example, although listing and restoring YAML files of namespaced environments from the Conda store might enable us to use these environments again (by rebuilding), this action does not replicate the original "build" of those same environments. As discussed, it also does not leverage the previous builds unless we manage to store all the available databases within it; in my opinion, I would prefer that the conda-store handled that by itself, and we could work together to develop such usability, but we also need to consider what we can do now.

However, this may only be the case for some services; for instance, Keycloak could adequately support backup and restore through simple import/export functions.

@tylergraff
Copy link

The comments by @viniciusdc are well organized and point the effort in a good direction. I propose the following principles and tactical plan for implementation.

Core Considerations

Nebari is a modular and configurable collection of disparate OSS components. This implies certain principles related to the backup/restore effort:

  • Atomic "snapshot in time" of a Nebari deployment is not feasible due to asynchronous internal behavior.
  • Implementation should focus individually and incrementally on individual Nebari components.
  • Interfaces should be standardized prior to implementation, including standardizing endpoint authentication. Existing CRUD APIs should be leveraged.
  • Creation mechanisms should be tolerant of internal inconsistencies. Ex: deserialization of user content should tolerate non-existent users (and vice-versa).
  • Successful completion should be prioritized over internal consistency. Ex: create user content even if the user does not exist. This provides flexibility in the order of restore operations.
  • Artifacts should be as generic as feasible, allowing for broad applicability and use. Ex: auditing capabilities, migration of users and content between deployments.

Tactical Plan

All APIs should be implemented as REST endpoints using administrator access tokens for authentication and accessible only within the VPC.
Core atomic API capabilities:

  • List all elements
  • Read entire existing single element
  • Create and populate entire new single element
    Bulk operations should be accomplished only by external tooling utilizing the above APIs. This allows for granular progress logging and ease of error-handling.

Order of implementation:

1.) User accounts (highest priority because these cannot be recreated)
Schema: username -> [password, [first-name, last-name], [groups] ]

  • Consider data protection regulations e.g. GDPR, CCPA - are they applicable? What is the shortest path to compliance?
  • Spin-off task: Audit Keycloak features and configure Nebari to remove use of keycloak features that are difficult to [de]serialize or present regulatory burden.

2.) Conda Environments (high priority as these would be very difficult to recreate)
Schema: environment name -> [ [package name, version, hash, source URL, retrieval date] ]

  • This intentionally supports environment definitions only, and not the underlying packages themselves.
  • Python package backup/restore will be complex due to idiosyncrasies of packaging systems, and should be treated separately and later.

3.) User code, notebooks, apps
Nebari should be configured to access and store user-created content via git repos. Reliability should be handled externally via integration with a git provider (github, gitlab, etc). This is a well-solved problem served by mature tooling and processes.

4.) Nebari deployment-wide asynchronous (e.g. cron) jobs
Recurring / Cron jobs should be implemented within the platform as user-create apps and stored in git repos accordingly.

@dcmcand
Copy link

dcmcand commented Apr 30, 2024

Both @viniciusdc and @tylergraff have some excellent thoughts here.

I agree with @tylergraff that having a standardized interface for backups that we can implement for each service is a good plan. That will improve the devex and make things far easier as far as maintainability. @tylergraff 's proposed api endpoints would certainly provide coverage, but I would suggest that we go even simpler to start. Just have a /backup/keycloak endpoint that requires a admin token to access and takes an optional s3 compatible location as an argument. If the location is given, the files are written there. If not, they are just returned to the caller. That would be the simplest implementation imo.

I also agree with @viniciusdc that we should utilize built in backup mechanisms whenever possible. Keycloak already provides options for backup and restore which can be accessed through its rest API. Rather than reinvent the wheel here, we should wrap the functionality so that it implements our backup interface.

For prioritization, I also agree with @tylergraff. We should first ensure that each service has backup and restore functionality before worrying about any kind of orchestration between backups.

Users and groups is obvious for our first backup target, and would be really straightforward to implement since it would just be wrapping keycloak's restapi.

After that, I would agree with conda-store next. I think conda-store backup should just be a backup of the blob storage in some form and a dump of the postgres db to start with.

Finally the nfs file system, which I think we can just do a tarball of.

Restores could be the reverse.

This is not an endstate, but would represent an MVP implementation which would allow users to try out and we could learn a lot from it. Being an MVP it will also be cheaper and quicker to implement while (hopefully) avoiding going too far down any incorrect paths.

@tylergraff
Copy link

I agree with re-using / wrapping existing capabilities, provided that the wrapper adopts a standardized authentication token pattern which would be used across future endpoints.

I'm not convinced of adding an optional S3 bucket for user backups. This adds S3 authentication implementation and administration. It also implies that a single operation would serialize all users to that S3 bucket. Bulk actions can introduce ornery complexities, such as: how to handle a fatal error which occurs after some of the users were backed up to S3? How would we get debug insight into (potentially) which individual user account caused the error?

My opinion is that we should implement list-all, serialize, and deserialize operations only; the latter two operate on single elements (e.g. users). Client-side tooling can perform S3 uploads separately and in a more modular fashion.

@pt247
Copy link
Author

pt247 commented Apr 30, 2024

From the comments, I can conclude the following:

  1. We all agree that each component in Nebari will have a different mechanism of backup and restore.
  2. We also see the importance of starting with Keycloak, which looks like a low-hanging fruit.
  3. I agree with @tylergraff that an "Atomic "snapshot in time" of a Nebari deployment is not feasible ..."
  4. I agree on a few more points, but I would like to start with Keycloak

Let's start with the requirements for Keycloak. I have a few questions:

  1. Why is the ability to serialize/deserialize Keycloak data useful?

    Ex: deserialization of user content should tolerate non-existent users (and vice-versa).
    @tylergraff Is the plan to use nebari restore to add new users?

  2. @dcmcand has an interesting suggestion of simply adding an endpoint /backup/keycloak. I think it's a great idea. And we should do it. However I am still not convinced that wrapping the Keycloak API is the simplest approach. Simplest approach IMHO is to simply backup the entire database and restore from that instead. Let's have a look at all the options:
    2.1. Keycloak REST API - docs
    - Pros: For the given version, we can backup and restore using the same shared codebase. Data can be serialized making it easier to edit/amend if needed. For example, adding or removing users.
    - Cons: Upgrading Keycloak can result in API changes, which will break backup and restore. We need to have a good understanding of how to replicate Keycloak using API. This includes all the relations of groups, users, etc.
    2.2. Importing and exporting realms - docs
    - Pros: It sounds straightforward to implement. (But I don't know if backing just the realms is enough.)
    - Cons: I am not sure if this output is serializable.
    2.3. Database Dump - blog
    - Pros: Easy to implement. Data can be zipped, encrypted, and stored in object storage. Easy to make it security compliant.
    - Cons: Not exactly serializable. If users are added in the destination DB that are not in the dump, they will be deleted. Thus, having a maintenance window becomes necessary.
    @dcmcand: Should we try the Database dump approach first? Or would you recommend we try Keycloak API?

PS: @tylergraf, I am going through your last comment just now. Can you explain in the case of Keycloak what you would like to see in "list-all, serialize, and deserialize"?

@pt247
Copy link
Author

pt247 commented Apr 30, 2024

I'm not convinced of adding an optional S3 bucket for user backups. This adds S3 authentication implementation and administration. It also implies that a single operation would serialize all users to that S3 bucket. Bulk actions can introduce ornery complexities, such as: how to handle a fatal error which occurs after some of the users were backed up to S3? How would we get debug insight into (potentially) which individual user account caused the error?

I agree; whatever solution we pick, it needs to back up all or nothing. Luckily, pg_dump behaves like that. So, in case of failures, we can have API report the status of backup as failed with reason.
We can always add an option to download the backup asset locally instead of S3. Will that help?

My opinion is that we should implement list-all, serialize, and deserialize operations only; the latter two operate on single elements (e.g. users). Client-side tooling can perform S3 uploads separately and in a more modular fashion.

We can expose Keycloak REST API to authenticated admins. This will allow admins to write Client-side tooling to manager uses as needed, for, e.g., adding or removing users.

@tylergraff
Copy link

Why is the ability to serialize/deserialize Keycloak data useful?
... what you would like to see in "list-all, serialize, and deserialize"?

Let me explain my reasoning and address those together:

My team's DR approach is to incrementally re-build a new Nebari deployment which can be used productively by our customers throughout that rebuild process. We are comfortable with this and are looking to minimize the risk and time (in that order) involved. We are not looking to precisely duplicate a Nebari deployment or its contents. We see substantial risk in the precise replication of internal Nebari state: internal state is opaque to us, may itself be the root cause of a disaster, or may cause a new disaster due to opaque consistency issues with other components. We know that deploying Nebari in an incremental fashion is low risk, because it is something we do frequently.

Our current DR approach is almost entirely manual, and we would like to improve by using automation to decrease the time involved. To reduce risk, it is critical that we retain visibility into (and thus confidence in) the changes effected by automation. We desire an approach of incremental modification, which allows us to understand changes and tailor risk. We want to maximize the observability of system state, allowing the effects of modification to be understood by administrators (who are likely learning as they go). And we’d like to decouple changes, to reduce the risk of unintended consequences.

To answer your questions:

  • We already have and use the capability to add (deserialize) individual users via endpoint. This is part of our current DR approach, it is low-risk, and we would like to further build on this.

  • A new capability to list existing users gives us clear visibility into that aspect of a deployment and a starting point to reproduce access to a new deployment.

  • A new capability to serialize [a critical subset of] a user’s account gives us a solution for user backup/restore that provides flexibility, visibility, and confidence in system state and operation. This also gives us the ability to audit users and/or migrate them to other systems, which could be valuable troubleshooting tools.

  • Providing these capabilities via Nebari serialize/deserialize endpoints (vs database dump) achieves the goals outlined above, and allows for automation without the need to generate database backup images nor rely on tooling and expertise to inspect them. We also get the ability to easily migrate users to other (potentially newer) systems without performing database migrations on software for which we have minimal experience. This approach also reduces the risk that a restored database contains the root cause of the original disaster, or otherwise introduces a new disaster via internal inconsistencies which are opaque to administrators.

@viniciusdc
Copy link

viniciusdc commented May 14, 2024

After reviewing the latest RFD contents and reflecting on our internal discussions and community feedback, Approach 3 seems most suited to our needs. As @tylergraff noted:

We see substantial risk in the precise replication of the internal Nebari state: the internal state is opaque to us, may itself be the root cause of a disaster, or may cause a new disaster due to opaque consistency issues with other components.

Fully replicating Nebari's state can reintroduce the problems that necessitate a restoration, making it a challenging option.

However, I also see significant merits in Approach 2, especially when we consider 'user' as the basic unit for the backup/restore process. This approach offers the flexibility to restart the process after encountering any errors or exceptions, which is a limitation of the bulk process. Nevertheless, this should not be viewed as a separate approach IMO. If we proceed with the REST API approach (Approach 3), we can incorporate both bulk and per-user import/export endpoints.

This integration allows us to optimize the workflow for backup/restore processes, which the user should consider.

In conclusion, I think everyone seems to be on the same page regarding the serialization and endpoints approach, and this should now be voted as is, and follow-up tasks can be created to start implementation details discussions.

@dcmcand
Copy link

dcmcand commented Jun 18, 2024

Thanks to everyone for their feedback here. Based on this discussion, We will be moving forward with approach 3.

Currently state is in 3 main places:

  1. Keycloack - stores users, groups, and permissions
  2. Conda-store - stores conda environments and builds
  3. User storage - stores user data including code and datasets

We will create a backup controller within Nebari which will expose backup and restore routes for each of these services. Specifics of each service's backup and restore will be decided on a per service basis and will be handled in individual tickets. There seems to be broad consensus that it makes sense to start with keycloak as the first service to implement this on. @pt247 will open tickets for backing up and restore for each service and we can have specific discussions on the implementation details on those tickets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants