-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Design] Splitting Shards In-Place #13923
Comments
@vikasvb90, Amazing for the detailed. |
@kkewwei Thanks for going through the design! Plan is to assign new shard numbers and the reason is that during split, parent and child shards will co-exist. Therefore, there was a need of new shard numbers for child shards. But the side effect of this would be missing numbers between shard 0 and shard n. So, this will require changes in core and some plugins where shard id is determined by simple iteration from 0 to n. |
@vikasvb90, I have another two doubts.
We don't reply the translogs, but One of solution I thought is: blocking all incoming operations when hardlink, and put 2. |
@kkewwei Didn't understand your question completely. As far as I understand, you are referring to remote store based segment replication particularly because in DocRep, we do plan to replay translogs in child shards and filter out non-belonging operations in replay as well as when child shards are added to the replication group of parent shard. Lastly, I highly appreciate you taking time to review the design and add your comments. I will be looking forward for more of such comments and feedback from you. Thanks again for your review!. |
It makes sense to me. |
Introduction
In RFC #12918, we requested feedback on the proposal of splitting shards in-place without any data loss or without the need to stop write traffic. We mentioned two approaches and expressed bias on the physical shard split approach with supporting data points taken from various benchmarks. In this document we will dive deep in the design of physically splitting shards in-place.
OpenSearch provides capability to split an index into a new index with more number of shards. This is achieved by first blocking write traffic on the index by putting write block on it and then all shards are recovered locally by executing a split query on each shard of the source index. High level steps of splitting of a shard in this process are as follows:
High Level Approach
Since shard split is already implemented, we can reuse shard splitting components like ShardSplittingQuery and hardlinking to build in-place shard split. Major areas which need to be covered to build this are:
1. Recovering shards while handling write traffic i.e. online recovery of child shards.
2. Filtering out non-belonging operations on child shards during recovery.
3. Changes in operation routing to route docs to right child shards.
4. Handling shard routing changes of splitting primary on split start, during split and post recovery.
Online Recovery of Child Shards
Note: We will leaving out details of some of the actions performed during recovery to avoid overloading this doc with too much information. Goal is to provide a general high level sense of how recovery of in-place shard would look like.
This part of the design is inspired by the peer recovery flow which ensures safe relocation of primary or replica while handling read and write traffic. We will first outline different stages of existing peer recovery flow and then talk about how we can leverage these to build in-place shard split. Peer recovery starts when a new shard routing is received in the cluster state update event on the target node. A new instance of
IndexShard
is created on the target node and after setting the shard to the right state a recovery request is sent to the source node where source shard is present.PeerRecoverySourceService
receives this transport request and creates a newRecoverySourceHandler
which drives the entire recovery process.RecoverySourceHandler
consists of certain abstractions which are implemented by respective replication type source handlers and also provides some default implementations common to all source handlers. Similarly, on the target node a target service handles events published by the source handler which are delegated to respective recovery target implementations. Recovery process which is driven by the source handler is divided into following stages and depending on the replication type - DocRep, SegRep or SegRep + Remote Store, these stages may or may not be applicable to the recovery.Changes needed for recoveries of child shards
Online recovery of child shards is similar to the peer recovery of primary relocation but differs in this way that there are multiple different target shards having different shard ids to be recovered from a single source shard. To accommodate this and to reuse recovery mechanisms of all replication modes we would need to re-arrange and define some abstractions and override them for in-place child shard recoveries. At high level, these changes would be as follows:
InPlaceShardSplitRecoverySourceHandler: Like other replication modes, child shard recoveries would need a new source handler to drive recoveries of all child shards. Send files step of Phase 1 of the recovery process in this case will consist of hardlinking segments in child shard directories from source shard directory and then segments files under each child shard directory is split using
ShardSplittingQuery
. As we saw in the previous section that some steps of recovery differ between different replication modes, to handle this we would need to define abstractions which can be used by child recovery to delegate to respective replication mode handler. Following are some of the abstractions to be used by this handler:InPlaceShardSplitRecoveryTargetHandler: We will need a new target handler to handle actions triggered by source handler and execute them on all child shards. Following are some of the events which should be handled by the new target handler:
ShardInParentModeException
back to coordinator node to re-drive bulk operations based on the new operation routing. It is possible that coordinator is still serving from stale customer state. In this case, coordinator will re-drive operations again on parent shard. Coordinator will eventually receive the operations back to re-drive and if request is not yet timed out and if new cluster state is now available on coordinator, then they will now be re-driven on child shards.Filtering out non-belonging operations on child shards
Since each child shard will own a part of hash space of parent shard and since in recovery flow all operations will be routed to tracked child shards from parent primary, we will need to filter only those operations which should fall into the hash space of the respective child shard. Additional compute due to filtering will only be incurred during recoveries - peer or store based. Following are the two components where we will need to filter operations:
Compute cost of identifying whether an operation falls into the hash space of the shard or not will depend on the number of times a shard has been split. For e.g., if a shard was never split its filtering compute cost will be O(1) and if a shard was split n times then the cost to find right child shard after resolving parent shard id from doc id hash would be O(log(n)).
It should be noted that we also need to update shard’s local checkpoint for the filtered out operations as well so that child shards can be marked as in-sync in parent’s replication group during recovery based on their checkpoints. There are three categories of an operation - INDEX, DELETE and NO_OP (CREATE is deprecated). When a non-belonging INDEX operation is received, we create a new NO_OP operation and execute it in-place of INDEX op. With this, checkpoint also gets updated on the shard for the non-belonging operation. We don’t need to handle DELETE op since it is marked as no-op if document is not found on the shard.
Operation Routing Changes
Currently, an operation received on coordinator node is routed to its belonging shard by computing the shard ID based on operation doc id and routing (if provided). A Murmur3 32-bit hash is generated using these operation parameters and routing partition size. Floor mod is then computed between hash and number of shards (over simplified for understanding) to figure out the shard ID. An index setting called number_of_routing_shards is used to resize an index and it imposes certain constraints on the number of shards of the target index. For e.g., an index having 5 initial shards can only be split into indices with number of shards in multiples of 5. Also, there is a limit on how many times an index can be split.
To route operations on child shards, we will start maintaining hash ranges on child shards as child shards metadata. We will continue to use mod based resolution for seed shards (initial shards which were computed during index creation). This means that there won’t be any change in the routing execution if a shard isn’t split. To split a shard, hash range of the shard will be split into the number of splits required. After a split is completed, for an incoming operation, seed shard ID will be computed first by using mode of the operation hash against number of seed shards and then a binary search will be performed on the ranges of child shards to find the child shard which accommodates the given hash in its range.
As we saw earlier, there isn’t any impact of this change on shards which haven’t been split. We carried out micro-benchmarks and explored alternative approaches to route operation belonging to child shards. Details of the same are present in this Design doc of routing algorithm. Overall impact of operation routing on child shards is negligible (<<1% of indexing latency) and distribution quality of routing is also at par with the default routing algorithm. Compute cost of finding shard ID belonging to a child shard would be O(log(N)) [N=number of child shards under the seed shard]. For e.g., if a shard is split say 10 times and every time it was split into 2 shards down the tree, then compute cost incurred for operations on these shards would be log20 (base 2) = ~4.3. There are other benefits of using binary search over mod based approach for child shards which are listed in the linked doc.
Shard Routing Changes
With new shards being created in an index against the parent shard, routing changes will be required to create new shard routings based on the split request parameters and current allocations. Fundamentally, we will re-use the concept of creating sticky shard routing created for a relocating primary shard. We will create n child shard routings and stick them to the parent shard for splitting a shard into n number of child shards. In the end, child shard routings will be added to index shard routing table, sticky routings will be removed from parent routing and parent primary will be removed from the table.
We will also need to validate if a split of a shard into n shards is possible according to the current allocation constraints. Before adding split routings to cluster state, we will also need to execute allocation deciders to check if all allocation constraints are satisfied in the cluster. We will only proceed with split if we don’t get a NO decision from any allocation decider.
Replica Recoveries
In phase 1 of this feature, we don’t plan to split a shard of an index which has replicas configured. This means that in order to split a shard of an index having replica count > 0, user will need to change the replica count to 0. Naturally, there’s an availability impact with this and therefore, we will be launching this feature as experimental in phase 1 and we will work on replica recoveries in phase 2. High level approach of recovering replicas would be as follows. Please not that the following approach still requires some research for validations and may change in future depending on new findings.
Change in Number of Shards
So far we don’t have any use case in OpenSearch where we ever change the number of shards of an index. But with In-Place split feature, number of shards can dynamically change. Due to this, we may not have a serving shard for an ID between 0 and number of shards (n). In OpenSearch and some of the plugins, there are code blocks in which we iterate from 0 to n-1 to figure out shard IDs and then take respective actions on top of them. With shard split, we need to stop this and instead use
servingShardIDs
from index metadata to figure out current serving shards. Also, there may be cases where plugins cannot support dynamic sharding in their current state which may either be due to the way they operate but can potentially support dynamic shards or they cannot strictly allow it ever in between their workflow executions. CCR is one such plugin which cannot function with dynamic shard IDs on an index since it has 1-1 mapping between shards of an index on follower and leader clusters.To handle all of the above cases we will need to expose the capability to plugins to enable/disable splitting of shards in-place. To achieve this, we will take a predicate supplier from installed plugins which can make the core aware of whether in-place split can be allowed at runtime for the index passed in the predicate.
New Split Shard API
We will expose a new API for the usage of this feature. Exact format of the API endpoint will be shared in the PR itself but at high level API will require index name, shardID and number of splits to trigger split of a shard. For new API implementation, we will add new REST action, Transport Action and Metadata Service action.
Scope
Initially we will target splitting of a shard in segrep+remote store replication mode. We will phase out the work starting with this replication mode first. Another reason of scoping work initially only to remote store replication is that it was rolled out recently as a durable alternative and we have observed significantly better performance in this mode than DocRep.
Related component
Indexing:Replication
Thanks @shwetathareja for the detailed review and suggestions!
The text was updated successfully, but these errors were encountered: