Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed typos and language for clarity for Transactional mirroring.md #8031

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 14 additions & 16 deletions docs/howto/mirroring.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,33 +32,31 @@ Unlike conventional mirroring, data isn't simply copied between regions - lakeFS

## Uses cases

### Disaster recovery
### Disaster Recovery

Typically, object stores provide a replication/batch copy API to allow for disaster recovery: as new objects are written, they are asynchronously copied to other geo locations.

In the case of regional failure, users can rely on the other geolocations which should contain relatively-up-to-date state.
In the case of regional failure, users can rely on other geolocations which should contain relatively up-to-date states.

The problem is reasoning about what managed to arrive by the time of disaster and what hasn't:
* have all the necessary files for a given dataset arrived?
* in cases there are dependencies between datasets, are all dependencies also up to date?
* what is currently in-flight or haven't even started replicating yet?
The problem is determining what objects arrived at the time of the disaster and what objects did not arrive. Some questions to consider while triaging a disaster:

Reasoning about these is non-trivial, especially in the face of a regional disaster, however ensuring business continuity might require that we have these answers.

Using lakeFS mirroring makes it much easier to answer: we are guaranteed that the latest commit that exists in the replica is in a consistent state and is fully usable, even if it isn't the absolute latest commit - it still reflects a known, consistent, point in time.
* Have all the necessary files for a given dataset arrived?
* In cases where there are dependencies between datasets, are all dependencies also up-to-date?
* What is currently in-flight? What hasn't started replicating yet?

In the event of a regional disaster, business continuity might require we have answers to these questions. The lakeFS approach to mirroring makes it easier to arrive at answers. The latest commit that exists in the replica will be a) in a consistent state and b) fully usable. In a situation where the replica doesn’t contain the absolute latest commit, the replica will still reflect a known, consistent, point-in-time.

### Data Locality

For certain workloads, it might be cheaper to have data available in multiple regions: Expensive hardware such as GPUs might fluctuate in price, so we'd want to pick the region that currently offers the best pricing. The difference could easily offset to cost of the replicated data.
For certain workloads, it might be cheaper to have data available in multiple regions. For example, expensive hardware such as GPUs might fluctuate in price, so we'd want to pick the region that currently offers the best pricing. The difference could easily offset the cost of the replicated data.

The challenge is reproducibility - Say we have an ML training job that reads image files from a path in the object store. Which files existed at the time of training?
The challenge is reproducibility. Say we have an ML training job that reads image files from a path in the object store. Which files existed at the time of the training?

If data is constantly flowing between regions, this might be harder to answer than we think. And even if we know - how can we recreate that exact state if we want to run the process again (for example, to rebuild that model for troubleshooting).
If data is constantly flowing between regions, this might be harder to answer than we think. And even if we know, how can we recreate that exact state if we want to run the process again (for example, rebuilding that model for troubleshooting)?

Using consistent commits solves this problem - with lakeFS mirroring, it is guaranteed that a commit ID, regardless of location, will always contain the exact same data.
Using consistent commits solves this problem. With lakeFS mirroring, it is guaranteed that a commit ID, regardless of location, will always contain the exact same data.

We can train our model in region A, and a month later feed the same commit ID into another region - and get back the same results.
Coming back to the ML training job example, we can train our model in region A, and a month later feed the same commit ID into another region, and get back the same results.


## Setting up mirroring
Expand All @@ -70,7 +68,7 @@ For AWS S3, please refer to the [AWS S3 replication documentation](https://docs.

After setting the replication rule, new objects will be replicated to the destination bucket.

In order to replicate the existing objects, we'd need to manually copy them - however, we can use [S3 batch jobs](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-batch-replication-batch.html) to do this.
In order to replicate the existing objects, we need to manually copy them. However, we can use [S3 batch jobs](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-batch-replication-batch.html) to do this.


### Creating a lakeFS user with a "replicator" policy
Expand Down Expand Up @@ -213,7 +211,7 @@ Deletions from garbage collection should be replicated from the source:

## RBAC

These are the required RBAC permissions for working with the new cross-region replication feature:
These are the required RBAC permissions for working with the new cross-region replication feature.

Creating a Mirror:

Expand Down
Loading