[Remote Store] Design - Dual Mode Replication during Remote Store migration #12413
Labels
enhancement
Enhancement or improvement to existing feature or request
RFC
Issues requesting major changes
Storage:Durability
Issues and PRs related to the durability framework
Storage:Remote
Introduction:
In order to support migration to RemoteStore backed nodes, we would be moving over shards from DocRep backed nodes to the RemoteStore backed and SegRep enabled ones. The migration would be done as:
More details on the migration process is here : #12246
During this phase, there would be a time wherein certain shard copies in a replication group resides in a DocRep engine based node while the primary has moved over to RemoteStore enabled ones. We would need to support a mixed replication mode to cater to the tenet that there would be no impact to the index and search traffic during the migration process.
Tenets:
Handling dual mode replication on
_shrink
and_split
API invocation during the migration process would be handled separately and will not be a part of this enhancement story.Proposed solution:
Today, we depend on the index metadata to determine if an index is Remote/Segrep enabled or Docrep enabled. Since index metadata update will take place after all shard copies have moved over to the remote enabled nodes, the source of truth will be moved over to node attributes instead of index metadata.
With the new
MIXED
compatibility mode introduced through #11986 , node attributes would be considered for determining the replication mode and remote upload/download enablement when compatibility mode is set toMIXED
and the migration direction is set.To ensure data consistency on failovers during this migration process, Peer Recovery Retention Lease (PRRL) publication would be kept unblocked during this time. This is done to ensure that we do not lose out on any sequence number based recovery when a DocRep enabled replica shard copy in the replication group is promoted to a primary. Checks would be introduced to ensure that there are no missing sequence numbers during this failover process.
The following diagram explains the flow for a write request in this stage:
The entire dual mode replication change set would be divided in the following 4 charters:
GlobalCheckpointSyncAction
andPublishCheckpointAction
replication actionsThe text was updated successfully, but these errors were encountered: