Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB Versioning (GSI-1128) #129

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
158 changes: 158 additions & 0 deletions 62-california-condor/technical_specification.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# DB Versioning (California Condor)
**Epic Type:** Implementation Epic

Epic planning and implementation follow the
[Epic Planning and Marathon SOP](https://docs.ghga-dev.de/main/sops/sop001_epic_planning.html).

## Scope
### Outline:
This epic is for the implementation of the database versioning concept used to
transition data for a given database when a relevant schema change or systematic content
update occurs. Database versioning also provides a way to detect whether the current
database instance is the expected version or not.


### Included/Required:
- Initial implementation on single service:
- Add database version
- Logic that checks database version at startup
- Outer migration logic:
- Locking mechanism
- Start migration from detected version
- Refinements and abstraction of common logic if applicable
- Apply database versioning to remaining services


## Additional Implementation Details:

### General Migration Logic

Each microservice owns its own database, so any discussion of migrations can be assumed
to concern a single microservice. According to
[the ADR](https://github.com/ghga-de/adrs/pull/28), a given service will use a
single value to denote the version of its entire database, as opposed to versioning a
collection or the schema used for a document.

The *current* database version will be stored in a dedicated collection. The *expected*
database version will be stored in application code. When a service starts up, the
first action will be to compare the actual current version number against the expected
value. If the expected version number is not found in the database version collection
and the lock is not set, the service can start the migration process.

```json
// Database version collection upon migrating a database from version 1 to version 2:
[
{
"version": 2,
"completed": ISODate("2024-11-18T09:30:00Z"),
"duration_sec": 25
},
{
"version": 1,
"completed": ISODate("2024-10-07T09:00:00Z"),
"duration_sec": 22
}
]
```

The migration process will form a chain, where discreet migration logic exists for
every database version to migrate the data from version X to X+1. Most of the time, the
database will be current or, at most, one version behind. However, if a database restore
occurs and the data happens to be older, the migration process will begin at the
appropriate step in the migration chain and continue until the data is totally migrated.
Migration code should be preserved at least until there is no possibility of
encountering the corresponding database version again.


### Services with Multiple Entrypoints

Several services operate with more than one instance simultaneously because one serves
as a REST API and another consumes Kafka events (for example). Obviously, only one
migration process should occur in these situations, rather than executing for each
instance. This can be solved if we ensure the migration process only runs as part of
the startup for one entrypoint, e.g. the REST API. We can "lock" the database to signal
to any other potential instances that there is already a migration in progress:

```json
// locking collection with one document
[
{
"migration_in_progress": true
}
]
```

The first instance to obtain the lock is allowed to proceed, all others wait. This has
the benefit of preventing concurrent migrations if services are scaled. An alternative
to a dedicated collection is to use the database version collection -- if the
expected database version exists but the timestamp is missing, then that means a
migration is already underway.

After preventing simultaneous migrations, we need to optimize how other instances
operate while a migration is underway and consider how to handle read and
write requests. One way would be to accept some downtime and
block service instances until the migration is complete. This might be a
sufficient initial solution because database sizes are small enough that migrations will
be completed quickly. If we have to eliminate downtime, we could perform shadow
migrations and write the results to temporary collections while the old service
version continues to handle requests. When the migration is complete, the old service
can be taken offline and the collections swapped out. That's a little more complex.

### Reverse Migrations

There might be some situation where we need to apply the reverse of a migration.
If DB version 5 is applied, but we later find that we should have stayed with version 4,
then we need to move to version 6. Version 6 is not treated as some special 'undo'
version increment; the migration logic merely happens to move the data to the same
structure contained in version 4.

### Migration Structure

![Migration structure](./images/db%20migrations%20white%20bg.png)

In the image above, the green section indicates the top-level migration logic, most of
which can be implemented in a library like `hexkit`.
The *rounded* orange boxes represent distinct migrations between database versions.
The *square* orange boxes show common logic that can be abstracted into a library.
The red-dotted items would be performed once for each batch of documents if a collection
were to be processed in batches.
The gray box shows where per-collection migration logic would occur.

### Errors During Migration

If an error prevents a migration from finishing, then we should discard the processed
entries, unset the lock document, and log the error. The cleanup is straightforward if
migrated documents are stored in a temporary collection that can be dropped, rather than
modifying the original collection directly (in which case we would need to reverse
changes, which could be difficult). It's important that we test migrations thoroughly
to avoid extended downtime from unexpected errors.

### Testing Migrations

Abstracted logic should be tested wherever it lives, like `hexkit`.
Tests should cover the following, but the list is not exhaustive:
- The locking mechanism
- Error handling
- Logging
- Database version detection
- Selecting the right migration to start with
- What to do when no matching migration exists

When we test individual migration code, like the code that will update collection A for
the migration from database version X to Y, we should use some mock data that represents
documents in the database.
Tests should at least verify that the migration code applies the right changes.


### Monitoring

Migration progress should be reported at periodic intervals, and the migration duration
should be both logged and stored in the database along with a timestamp so we can
identify performance issues early on.


## Human Resource/Time Estimation:

Number of sprints required: 2

Number of developers required: 1