[RFC] Design Proposal routing algorithm for splitting shards In-place #13925
Labels
Indexing:Replication
Issues and PRs related to core replication framework eg segrep
Indexing
Indexing, Bulk Indexing and anything related to indexing
RFC
Issues requesting major changes
Roadmap:Cost/Performance/Scale
Project-wide roadmap label
Scale
Introduction
In RFC #12918, we discussed potential approaches for splitting of a shard in-place and requested feedback on the same as well as use cases which may need to be additionally addressed while designing In-Place shard split. We published the design for splitting a shard in-place in issue #13923 where we raised the need for devising a mechanism to route operations on child shards and explained the operation routing briefly. In this issue, we will talk about how operations are routed in OpenSearch, delve into different operation routing approaches for in-place shard splitting, discuss their pros and cons and finally add benchmark results comparing the performance of each approach.
Operation Routing In OpenSearch
Currently, document routing algorithm in OpenSearch is governed by the following formulae
Here,
shard_num
is effective routing value which is derived from document’s_id
which is termed here as_routing
. Custom routing patterns can be implemented by specifying a custom routing value per document.Let’s try to understand the algorithm visually
Our hash function i.e. murmur hash has a practical key range from$0$ to $2^{31}−1$ but ideally we restrict the range to 1024 by default i.e. the allowable max number of shards on a single node so that if we divide our hash space to 1024 parts, each part can be assigned to a single shard.
When we create P primary shards, document hash space needs to be divided into equal parts. The default upper limit of max routing number of shards i.e. 1024 may not be divisible by P which could result in unequal hash space distribution.$2^n*P < 1024$ .
The value 1024 is nothing but the default number of partitions of the document hash space which is frozen during index creation. A shard of an index is assigned a subset of partitions. For e.g. if an index is created with 8 shards then each shard consists of 128 partitions. Any scaling of shards from x to y in resize actions like shrink index and split index cases will happen based on the number of partitions. For uniform distribution of hash space, all shards should have equal number of partitions otherwise it would result in a collision of keys according to pigeonhole principle. Therefore, number of partitions would be the highest multiple of P less than or equal to 1024. For e.g., for an index consisting of 5 shards, number of partitions assigned to each shard would be 204. This number is called as the number of number of routing shards of an index.
In resize action of OpenSearch like splitting an index where a new index is created and recovered from an existing index, the number of partitions are recalculated. As we saw earlier that number of partitions should always be equally distributed among shards, number of shards of the new index can only be a number exactly divisible by number of shards of the existing index. In addition, new number of shards should be less than the routing number of shards of existing index. Mathematical relation between at most n splits of a shard of an index having P primary shards and upper limit of routing shards equal to 1024 can be expressed as
Similarly, number of partitions assigned on each shard can be expressed as:$num\_routing\_shards(RS)/num\_primary\_shards(P) = (2^n*P)/P = 2^n$
This is called$routing\_factor(RF) = 2^n$
Once, we have RS and RF then its easy to figure out the shard number by figuring out the shard in which the hash of routing id should lie in, given by$shard\_id = \frac{hash(\_routing) \% RS}{RF}$
Describe the solution you'd like
Operation routing design requirements for in-place shard-split:
Approaches
We will go over two approaches and discuss how each of them can effectively route an operation on child shards.
Approach 1 - Routing using Shard Hash space [Recommended]
In this approach each seed shard(shards present at the time of index creation) is assigned a hash space. We are using
murmurhash3_x86_32
algorithm to calculate hash of_routing
before deducing the respective routing shard. It means that absolute hash values could only lie within0x00000000 - 0xffffffff
range (signed int). Therefore, each seed shard is assigned a hash space from0
to2^32 -1
. Whenever a shard split happens its hash space gets divided equally between the child shards. Let’s consider a mental model where shards are arranged according to the start and end values of acceptable hash values.If split 1:4 is performed on a primary shard , upcoming key space partition will look like:
0x00000000
-0x3FFFFFFF
will get route to child shard 0, hash between0x40000000
-0x7FFFFFFF
will get to child shard 1 and so on.We can further split any of the child shard for e.g. say we split the child shard 2 above into two shards then hash space partition will look like:
0x80000000 - 0x9FFFFFFF
and0xAFFFFFFF - 0xBFFFFFFF
respectively. Information of these “high” and “low” values needs to be saved in the cluster state.Mathematically routing algorithm can be defined as follows:
Note: Each seed shard maintains its own hash space and the distribution of the hash space in child shards
A hash space is defined as$Range_i = [l_i, r_i)$ $Range_i < Range_j$ <=> $l_i < l_j$
Each shard has its own hash range and they are comparable i.e.
$seed\_shard\ranges = Sorted{Range{child\shard_1},Range{child\_shard_2}, ... }$
Objective: To find the hash range which contains our hash value.$ceil({x_1, x_2, x_3 ....}, x)$ which outputs index i such that value $x_i \in {x_1, x_2, x_3, ...}$ just less than equal to $x$ and is based on binary search with a complexity of $O(log_2(len({x_1, x_2, x_3,....})))$
Since, shard ranges are sorted we can find a range with starting point just less than or equal to hash value by using binary search.
We define a function:
Advantages:
Disadvantages:
Implementation details:
In Opensearch we'll maintain the keyspace for each shard as SplitMetadata which keeps a map of shard id and range assigned to the specific shard. SplitMetadata also contains a flat sorted set of keyspace ranges assigned to each split shard per seed shard which makes the lookup of hash to shard id it belongs a much efficient operation by using binary search. We keep track of only those shards which have been split or are created as a result of split in SplitMetadata and therefore in case of no split there’s no overhead increase in index metadata. During shard split i.e. when recovery is happening we don’t keep child shard keyspace ranges in the sorted set but maintain them in an ephemeral list of keyspace ranges in the parent shard metadata. Once, the recovery is complete and shard split is marked completed we remove the ephemeral child shard IDs and add it to our maintained sorted set for routing of documents.
Approach 2 - “Recursive Routing ”
Based on our current routing algorithm in opensearch, routing terms are expressed as follows:
Consider, shard$p_i$ gets split into $P_i$ shards, we’ll calculate new routing parameters i.e. RS & RF for the child shards
Now, to route the document, we’ll first find out which shard out of initial shards the doc will get routed to using the following equation. Assume, the doc got routed to shard$p_i$ then
Since, shard$p_i$ has been split into $P_i$ shards, we’ll use the new routing parameters to calculate the actual shard($p_ij$ ) to which the doc will gets routed
This pattern will continue for further splits. Say, the shard$p_ij$ gets split into $P_j$ shards then we’ll get the new routing parameters as:
At runtime the doc will get routed to shard$p_{ijk}$
Base conditions:$RF_k==1$ for any shard $p_k$ then we won’t allow further splits for that shard$p_k$ is the terminal shard i.e. it hasn’t been split then the above described routing algorithm will terminate at that shard
if
Also, if some shard
High level visual representation
In the following figure, partitions are getting distributed evenly and we’re maintaining the routing parameters at each level.
Advantages:
Disadvantages
Benchmarks
Setup:
HostType: m4.2xlarge
OS: Amazon Linux 2 x86_64
region: us-west-2
vCPUs: 8
Memory: 32Gb
Architecture: x86_64
Parameters:
Number of seed shards: 1 Number of splits per shard: 2 Depth i.e. total number of split operations performed per seed shard - 1 : [0,10) Number of documented routed(numDocs): [0,100000] Benchmark Mode: avgt i.e. average time taken per routing operation in nanoseconds Number of threads used: 1Approach 1 → DocRoutingBenchmark.routeDocsRecurring
Approach 2 → DocRoutingBenchmark.routeDocsRange
On a high level we split a single shard repeatedly into two and compare latency degradation at different depths which increases with number of splits
Benchmark numbers for Approach 1 Vs Approach 2
Is hash space distribution uniform for child shards?
To assess the quality of distribution of keys among shards we define the hash distribution quality as
where$p_j$ is the number of doc ids in j-th primary shard, m is the number of shards, and n is the total number of doc ids. The sum of $p_j(p_j + 1) / 2$ estimates the number of shards that needs to be visited to find a specific doc id. The denominator (n / 2m)(n + 2m − 1) is the number of visited slots for an ideal function that puts each doc id into a random shard. So, if the function is ideal, the formula should give 1. In reality, a good distribution is somewhere between 0.95 and 1.05.
Simple benchmark
Baseline: 5 primary shards with no split, 100k keys using existing opensearch routing
Candidate 1: 5 primary shard split by a factor of 2, 100k keys using approach 1
Candidate 2: 5 primary shard split by a factor of 2, 100k keys using approach 2
hash quality of baseline: 1.02
hash quality of candidate 1: 1.03
hash quality of candidate 2: 1.03
Related component
Indexing:Replication
Thanks @vikasvb90 for all the help in refining the approaches as well as testing and benchmarking it
The text was updated successfully, but these errors were encountered: