[RFC] Design Proposal routing algorithm for splitting shards In-place #13925

CaptainDredge · 2024-06-03T03:45:08Z

Introduction

In RFC #12918, we discussed potential approaches for splitting of a shard in-place and requested feedback on the same as well as use cases which may need to be additionally addressed while designing In-Place shard split. We published the design for splitting a shard in-place in issue #13923 where we raised the need for devising a mechanism to route operations on child shards and explained the operation routing briefly. In this issue, we will talk about how operations are routed in OpenSearch, delve into different operation routing approaches for in-place shard splitting, discuss their pros and cons and finally add benchmark results comparing the performance of each approach.

Operation Routing In OpenSearch

Currently, document routing algorithm in OpenSearch is governed by the following formulae

$$routing\_factor(RF) = \frac{num\_routing\_shards(RS)}{num\_primary\_shards(P)}$$$$ $$shard\_num = \frac{hash(\_routing) \% num\_routing\_shards(RS)}{routing\_factor(RF)}$$$$

Here,shard_num is effective routing value which is derived from document’s _id which is termed here as _routing. Custom routing patterns can be implemented by specifying a custom routing value per document.

Let’s try to understand the algorithm visually

Our hash function i.e. murmur hash has a practical key range from $0$ to $2^{31}−1$ but ideally we restrict the range to 1024 by default i.e. the allowable max number of shards on a single node so that if we divide our hash space to 1024 parts, each part can be assigned to a single shard.

When we create P primary shards, document hash space needs to be divided into equal parts. The default upper limit of max routing number of shards i.e. 1024 may not be divisible by P which could result in unequal hash space distribution.
The value 1024 is nothing but the default number of partitions of the document hash space which is frozen during index creation. A shard of an index is assigned a subset of partitions. For e.g. if an index is created with 8 shards then each shard consists of 128 partitions. Any scaling of shards from x to y in resize actions like shrink index and split index cases will happen based on the number of partitions. For uniform distribution of hash space, all shards should have equal number of partitions otherwise it would result in a collision of keys according to pigeonhole principle. Therefore, number of partitions would be the highest multiple of P less than or equal to 1024. For e.g., for an index consisting of 5 shards, number of partitions assigned to each shard would be 204. This number is called as the number of number of routing shards of an index.
In resize action of OpenSearch like splitting an index where a new index is created and recovered from an existing index, the number of partitions are recalculated. As we saw earlier that number of partitions should always be equally distributed among shards, number of shards of the new index can only be a number exactly divisible by number of shards of the existing index. In addition, new number of shards should be less than the routing number of shards of existing index. Mathematical relation between at most n splits of a shard of an index having P primary shards and upper limit of routing shards equal to 1024 can be expressed as $2^n*P < 1024$.

Similarly, number of partitions assigned on each shard can be expressed as: $num\_routing\_shards(RS)/num\_primary\_shards(P) = (2^n*P)/P = 2^n$

This is called $routing\_factor(RF) = 2^n$

Once, we have RS and RF then its easy to figure out the shard number by figuring out the shard in which the hash of routing id should lie in, given by $shard\_id = \frac{hash(\_routing) \% RS}{RF}$

Describe the solution you'd like

Operation routing design requirements for in-place shard-split:

There shouldn’t be any data loss before, during or after a shard is split
The key space or the hash range which parent shard serves should be split/scaled in a way that it is uniformly distributed among new shard(s) and division should be mutually exclusive

Approaches

We will go over two approaches and discuss how each of them can effectively route an operation on child shards.

Approach 1 - Routing using Shard Hash space [Recommended]

In this approach each seed shard(shards present at the time of index creation) is assigned a hash space. We are using murmurhash3_x86_32 algorithm to calculate hash of _routing before deducing the respective routing shard. It means that absolute hash values could only lie within 0x00000000 - 0xffffffff range (signed int). Therefore, each seed shard is assigned a hash space from 0 to 2^32 -1 . Whenever a shard split happens its hash space gets divided equally between the child shards. Let’s consider a mental model where shards are arranged according to the start and end values of acceptable hash values.

If split 1:4 is performed on a primary shard , upcoming key space partition will look like:

In this case, all docs getting hash between 0x00000000 - 0x3FFFFFFF will get route to child shard 0, hash between 0x40000000 - 0x7FFFFFFF will get to child shard 1 and so on.

We can further split any of the child shard for e.g. say we split the child shard 2 above into two shards then hash space partition will look like:

In this further split, child shard 2 will cease to exist to make way to shard 4 and 5 which will serve ranges 0x80000000 - 0x9FFFFFFF and 0xAFFFFFFF - 0xBFFFFFFF respectively. Information of these “high” and “low” values needs to be saved in the cluster state.
Note that high value of key space of each primary shard is always increasing, so, one optimization we'll perform is binary searching over these values to efficiently get the routing_shard. But, the necessary information needs to be pre-computed and stored in the cluster state as well otherwise benefits of binary search would gravely out-weight construction.

Mathematically routing algorithm can be defined as follows:

$routing\_factor(RF) = \frac{num\_routing\_shards(RS)}{num\_primary\_shards(P)}$
$seed\_shard\_id = \frac{hash(\_routing) \% num\_routing\_shards(RS)}{routing\_factor(RF)}$

Note: Each seed shard maintains its own hash space and the distribution of the hash space in child shards

A hash space is defined as $Range_i = [l_i, r_i)$
Each shard has its own hash range and they are comparable i.e. $Range_i < Range_j$ <=> $l_i < l_j$
$seed\_shard\ranges = Sorted{Range{child\shard_1},Range{child\_shard_2}, ... }$

Objective: To find the hash range which contains our hash value.
Since, shard ranges are sorted we can find a range with starting point just less than or equal to hash value by using binary search.
We define a function: $ceil({x_1, x_2, x_3 ....}, x)$ which outputs index i such that value $x_i \in {x_1, x_2, x_3, ...}$ just less than equal to $x$ and is based on binary search with a complexity of $O(log_2(len({x_1, x_2, x_3,....})))$

$$shard\_id = ceil(\{ Range_{child\_shard_1}, Range_{child\_shard_2}, ... \}, hash(\_routing))$$

Advantages:

Does not limit the number of times a shard can be split in an existing index
Easier to understand and maintain
Faster if depth of split is higher than logarithmic base 2 of number of shards. For e.g., if a shard gets split 3 times and at each level 2 child shards are produced then this approach will perform better as compared to the previous approach since logarithmic base 2 of 4 shards i.e. 2 is lesser than the depth i.e. 3.

Disadvantages:

Performs slow if depth of splits is lower than logarithmic base 2 of number of shards.

Implementation details:

In Opensearch we'll maintain the keyspace for each shard as SplitMetadata which keeps a map of shard id and range assigned to the specific shard. SplitMetadata also contains a flat sorted set of keyspace ranges assigned to each split shard per seed shard which makes the lookup of hash to shard id it belongs a much efficient operation by using binary search. We keep track of only those shards which have been split or are created as a result of split in SplitMetadata and therefore in case of no split there’s no overhead increase in index metadata. During shard split i.e. when recovery is happening we don’t keep child shard keyspace ranges in the sorted set but maintain them in an ephemeral list of keyspace ranges in the parent shard metadata. Once, the recovery is complete and shard split is marked completed we remove the ephemeral child shard IDs and add it to our maintained sorted set for routing of documents.

Approach 2 - “Recursive Routing ”

Based on our current routing algorithm in opensearch, routing terms are expressed as follows:

$P = num\_initial\_primary\_shards$
$n={floor(log_2(1024/P))}$
$RF = 2^n$
$RS=RF*P$
$hash\_value = hash(\_routing)$

Consider, shard $p_i$ gets split into $P_i$ shards, we’ll calculate new routing parameters i.e. RS & RF for the child shards

$RF_i = 2^{n - log_2(P_i)}$
$RS_i = {RF_i}*P_i$

Now, to route the document, we’ll first find out which shard out of initial shards the doc will get routed to using the following equation. Assume, the doc got routed to shard $p_i$ then

$p_i = \frac{{hash\_value}\%{RS}}{RF}$

Since, shard $p_i$ has been split into $P_i$ shards, we’ll use the new routing parameters to calculate the actual shard($p_ij$) to which the doc will gets routed

$p_{ij} = \frac{hash\_value\%RS_i}{RF_i}$

This pattern will continue for further splits. Say, the shard $p_ij$ gets split into $P_j$ shards then we’ll get the new routing parameters as:

$RF_j = 2^{floor(log_2(\frac{RF_i}{P_j}))}$
$RS_j = {RF_j}*P_j$

At runtime the doc will get routed to shard $p_{ijk}$

$p_{ijk} = \frac{hash\_value\%RS_j}{RF_j}$

Base conditions:
if $RF_k==1$ for any shard $p_k$ then we won’t allow further splits for that shard
Also, if some shard $p_k$ is the terminal shard i.e. it hasn’t been split then the above described routing algorithm will terminate at that shard

High level visual representation
In the following figure, partitions are getting distributed evenly and we’re maintaining the routing parameters at each level.

Advantages:

Faster in cases where depth of splits is lesser than logarithmic base 2 of number of shards.

Disadvantages

Limit the number of times a shard can be split in an existing index which was created before the OS version which will support in-place shard split at 1024 i.e. default value of routing shards
Performs slow if depth of splits is higher than logarithmic base 2 of number of shards.
Quite complex to understand and maintain.

Benchmarks

Setup:

HostType: m4.2xlarge
OS: Amazon Linux 2 x86_64
region: us-west-2
vCPUs: 8
Memory: 32Gb
Architecture: x86_64

Parameters:

Number of seed shards: 1 Number of splits per shard: 2 Depth i.e. total number of split operations performed per seed shard - 1 : [0,10) Number of documented routed(numDocs): [0,100000] Benchmark Mode: avgt i.e. average time taken per routing operation in nanoseconds Number of threads used: 1

Approach 1 → DocRoutingBenchmark.routeDocsRecurring
Approach 2 → DocRoutingBenchmark.routeDocsRange

On a high level we split a single shard repeatedly into two and compare latency degradation at different depths which increases with number of splits

Benchmark numbers for Approach 1 Vs Approach 2

Benchmark	depth	numDocs	Mode	Score	Units
DocRoutingBenchmark.routeDocsRange	0	1	avgt	484.304	ns/op
DocRoutingBenchmark.routeDocsRange	0	100	avgt	6922.807	ns/op
DocRoutingBenchmark.routeDocsRange	0	1000	avgt	74009.181	ns/op
DocRoutingBenchmark.routeDocsRange	0	10000	avgt	763362.939	ns/op
DocRoutingBenchmark.routeDocsRange	0	100000	avgt	8378460.97	ns/op
DocRoutingBenchmark.routeDocsRange	1	1	avgt	496.923	ns/op
DocRoutingBenchmark.routeDocsRange	1	100	avgt	7432.15	ns/op
DocRoutingBenchmark.routeDocsRange	1	1000	avgt	82045.71	ns/op
DocRoutingBenchmark.routeDocsRange	1	10000	avgt	888808.969	ns/op
DocRoutingBenchmark.routeDocsRange	1	100000	avgt	9252085.75	ns/op
DocRoutingBenchmark.routeDocsRange	2	1	avgt	488.046	ns/op
DocRoutingBenchmark.routeDocsRange	2	100	avgt	7348.704	ns/op
DocRoutingBenchmark.routeDocsRange	2	1000	avgt	86551.268	ns/op
DocRoutingBenchmark.routeDocsRange	2	10000	avgt	919804.381	ns/op
DocRoutingBenchmark.routeDocsRange	2	100000	avgt	9889578.54	ns/op
DocRoutingBenchmark.routeDocsRange	3	1	avgt	488.343	ns/op
DocRoutingBenchmark.routeDocsRange	3	100	avgt	7621.063	ns/op
DocRoutingBenchmark.routeDocsRange	3	1000	avgt	95474.99	ns/op
DocRoutingBenchmark.routeDocsRange	3	10000	avgt	1025523.67	ns/op
DocRoutingBenchmark.routeDocsRange	3	100000	avgt	10797462.1	ns/op
DocRoutingBenchmark.routeDocsRange	4	1	avgt	494.31	ns/op
DocRoutingBenchmark.routeDocsRange	4	100	avgt	8065.714	ns/op
DocRoutingBenchmark.routeDocsRange	4	1000	avgt	103151.209	ns/op
DocRoutingBenchmark.routeDocsRange	4	10000	avgt	1099497.06	ns/op
DocRoutingBenchmark.routeDocsRange	4	100000	avgt	11711143.9	ns/op
DocRoutingBenchmark.routeDocsRange	5	1	avgt	488.817	ns/op
DocRoutingBenchmark.routeDocsRange	5	100	avgt	8279.029	ns/op
DocRoutingBenchmark.routeDocsRange	5	1000	avgt	109389.782	ns/op
DocRoutingBenchmark.routeDocsRange	5	10000	avgt	1183312.2	ns/op
DocRoutingBenchmark.routeDocsRange	5	100000	avgt	12082835.5	ns/op
DocRoutingBenchmark.routeDocsRange	6	1	avgt	502.501	ns/op
DocRoutingBenchmark.routeDocsRange	6	100	avgt	8583.495	ns/op
DocRoutingBenchmark.routeDocsRange	6	1000	avgt	118236.136	ns/op
DocRoutingBenchmark.routeDocsRange	6	10000	avgt	1252016.38	ns/op
DocRoutingBenchmark.routeDocsRange	6	100000	avgt	13199143.9	ns/op
DocRoutingBenchmark.routeDocsRange	7	1	avgt	497.333	ns/op
DocRoutingBenchmark.routeDocsRange	7	100	avgt	9107.973	ns/op
DocRoutingBenchmark.routeDocsRange	7	1000	avgt	125925.531	ns/op
DocRoutingBenchmark.routeDocsRange	7	10000	avgt	1330136.39	ns/op
DocRoutingBenchmark.routeDocsRange	7	100000	avgt	13872079.1	ns/op
DocRoutingBenchmark.routeDocsRange	8	1	avgt	508.262	ns/op
DocRoutingBenchmark.routeDocsRange	8	100	avgt	9511.286	ns/op
DocRoutingBenchmark.routeDocsRange	8	1000	avgt	139429.952	ns/op
DocRoutingBenchmark.routeDocsRange	8	10000	avgt	1453068.55	ns/op
DocRoutingBenchmark.routeDocsRange	8	100000	avgt	15055794.7	ns/op
DocRoutingBenchmark.routeDocsRange	9	1	avgt	510.916	ns/op
DocRoutingBenchmark.routeDocsRange	9	100	avgt	10152.875	ns/op
DocRoutingBenchmark.routeDocsRange	9	1000	avgt	150015.975	ns/op
DocRoutingBenchmark.routeDocsRange	9	10000	avgt	1602086.52	ns/op
DocRoutingBenchmark.routeDocsRange	9	100000	avgt	16859985.8	ns/op
DocRoutingBenchmark.routeDocsRecurring	0	1	avgt	481.096	ns/op
DocRoutingBenchmark.routeDocsRecurring	0	100	avgt	6915.725	ns/op
DocRoutingBenchmark.routeDocsRecurring	0	1000	avgt	71412.345	ns/op
DocRoutingBenchmark.routeDocsRecurring	0	10000	avgt	726282.631	ns/op
DocRoutingBenchmark.routeDocsRecurring	0	100000	avgt	`8129245`.07	ns/op
DocRoutingBenchmark.routeDocsRecurring	1	1	avgt	504.045	ns/op
DocRoutingBenchmark.routeDocsRecurring	1	100	avgt	9180.867	ns/op
DocRoutingBenchmark.routeDocsRecurring	1	1000	avgt	93815.905	ns/op
DocRoutingBenchmark.routeDocsRecurring	1	10000	avgt	973699.552	ns/op
DocRoutingBenchmark.routeDocsRecurring	1	100000	avgt	10224791.9	ns/op
DocRoutingBenchmark.routeDocsRecurring	2	1	avgt	530.676	ns/op
DocRoutingBenchmark.routeDocsRecurring	2	100	avgt	11601.343	ns/op
DocRoutingBenchmark.routeDocsRecurring	2	1000	avgt	117090.87	ns/op
DocRoutingBenchmark.routeDocsRecurring	2	10000	avgt	1199006.97	ns/op
DocRoutingBenchmark.routeDocsRecurring	2	100000	avgt	12630158.5	ns/op
DocRoutingBenchmark.routeDocsRecurring	3	1	avgt	554.57	ns/op
DocRoutingBenchmark.routeDocsRecurring	3	100	avgt	14004.462	ns/op
DocRoutingBenchmark.routeDocsRecurring	3	1000	avgt	141144.605	ns/op
DocRoutingBenchmark.routeDocsRecurring	3	10000	avgt	1433050.55	ns/op
DocRoutingBenchmark.routeDocsRecurring	3	100000	avgt	15022877.8	ns/op
DocRoutingBenchmark.routeDocsRecurring	4	1	avgt	585.075	ns/op
DocRoutingBenchmark.routeDocsRecurring	4	100	avgt	16373.323	ns/op
DocRoutingBenchmark.routeDocsRecurring	4	1000	avgt	165582.599	ns/op
DocRoutingBenchmark.routeDocsRecurring	4	10000	avgt	1682310.52	ns/op
DocRoutingBenchmark.routeDocsRecurring	4	100000	avgt	17376692	ns/op
DocRoutingBenchmark.routeDocsRecurring	5	1	avgt	602.198	ns/op
DocRoutingBenchmark.routeDocsRecurring	5	100	avgt	18851.1	ns/op
DocRoutingBenchmark.routeDocsRecurring	5	1000	avgt	191034.424	ns/op
DocRoutingBenchmark.routeDocsRecurring	5	10000	avgt	1932252.88	ns/op
DocRoutingBenchmark.routeDocsRecurring	5	100000	avgt	19961205.4	ns/op
DocRoutingBenchmark.routeDocsRecurring	6	1	avgt	629.351	ns/op
DocRoutingBenchmark.routeDocsRecurring	6	100	avgt	21364.155	ns/op
DocRoutingBenchmark.routeDocsRecurring	6	1000	avgt	215418.63	ns/op
DocRoutingBenchmark.routeDocsRecurring	6	10000	avgt	2187030.87	ns/op
DocRoutingBenchmark.routeDocsRecurring	6	100000	avgt	22456628.9	ns/op
DocRoutingBenchmark.routeDocsRecurring	7	1	avgt	650.714	ns/op
DocRoutingBenchmark.routeDocsRecurring	7	100	avgt	23930.799	ns/op
DocRoutingBenchmark.routeDocsRecurring	7	1000	avgt	241918.217	ns/op
DocRoutingBenchmark.routeDocsRecurring	7	10000	avgt	2421512.91	ns/op
DocRoutingBenchmark.routeDocsRecurring	7	100000	avgt	24830535.8	ns/op
DocRoutingBenchmark.routeDocsRecurring	8	1	avgt	685.33	ns/op
DocRoutingBenchmark.routeDocsRecurring	8	100	avgt	26572.623	ns/op
DocRoutingBenchmark.routeDocsRecurring	8	1000	avgt	268137.903	ns/op
DocRoutingBenchmark.routeDocsRecurring	8	10000	avgt	2706184.6	ns/op
DocRoutingBenchmark.routeDocsRecurring	8	100000	avgt	27532240.3	ns/op
DocRoutingBenchmark.routeDocsRecurring	9	1	avgt	705.721	ns/op
DocRoutingBenchmark.routeDocsRecurring	9	100	avgt	29392.19	ns/op
DocRoutingBenchmark.routeDocsRecurring	9	1000	avgt	293422.558	ns/op
DocRoutingBenchmark.routeDocsRecurring	9	10000	avgt	2952632.51	ns/op
DocRoutingBenchmark.routeDocsRecurring	9	100000	avgt	30346605.2	ns/op

Is hash space distribution uniform for child shards?

To assess the quality of distribution of keys among shards we define the hash distribution quality as

$$\sum_{j=0}^{m-1} \frac{p_j\left(p_j+1\right) / 2}{(n / 2 m)(n+2 m-1)}$$

where $p_j$ is the number of doc ids in j-th primary shard, m is the number of shards, and n is the total number of doc ids. The sum of $p_j(p_j + 1) / 2$ estimates the number of shards that needs to be visited to find a specific doc id. The denominator (n / 2m)(n + 2m − 1) is the number of visited slots for an ideal function that puts each doc id into a random shard. So, if the function is ideal, the formula should give 1. In reality, a good distribution is somewhere between 0.95 and 1.05.

Simple benchmark

Baseline: 5 primary shards with no split, 100k keys using existing opensearch routing
Candidate 1: 5 primary shard split by a factor of 2, 100k keys using approach 1
Candidate 2: 5 primary shard split by a factor of 2, 100k keys using approach 2

hash quality of baseline: 1.02
hash quality of candidate 1: 1.03
hash quality of candidate 2: 1.03

All possible combinations of splitting of n=1 to n=2^6 shards showed that the average hash quality across all combination with 10k keys was `1.13` with approach 1 and `1.12` with approach 2 Hence, keyspace is getting distributed uniformly across the split shards with both the approaches!

Related component

Indexing:Replication

Thanks @vikasvb90 for all the help in refining the approaches as well as testing and benchmarking it

The text was updated successfully, but these errors were encountered:

peternied · 2024-06-05T15:39:12Z

[Triage - attendees 1 2 3 4 5 6 7]
@CaptainDredge Thanks for creating this detailed proposal

CaptainDredge added enhancement Enhancement or improvement to existing feature or request untriaged labels Jun 3, 2024

github-actions bot added the Indexing:Replication Issues and PRs related to core replication framework eg segrep label Jun 3, 2024

CaptainDredge self-assigned this Jun 3, 2024

CaptainDredge added Indexing Indexing, Bulk Indexing and anything related to indexing Scale Roadmap:Cost/Performance/Scale Project-wide roadmap label and removed enhancement Enhancement or improvement to existing feature or request labels Jun 3, 2024

vikasvb90 mentioned this issue Jun 3, 2024

[Design] Splitting Shards In-Place #13923

Open

peternied added the RFC Issues requesting major changes label Jun 5, 2024

peternied changed the title ~~[Design] Routing algorithm for splitting shards In-place~~ [RFC] Design Proposal routing algorithm for splitting shards In-place Jun 5, 2024

github-project-automation bot added this to OpenSearch Roadmap Jun 5, 2024

github-project-automation bot moved this to Planned work items in OpenSearch Roadmap Jun 5, 2024

peternied removed the untriaged label Jun 5, 2024

vikasvb90 mentioned this issue Jul 7, 2024

[META] In-place Shard Splitting #13254

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Design Proposal routing algorithm for splitting shards In-place #13925

[RFC] Design Proposal routing algorithm for splitting shards In-place #13925

CaptainDredge commented Jun 3, 2024 •

edited

Loading

peternied commented Jun 5, 2024

[RFC] Design Proposal routing algorithm for splitting shards In-place #13925

[RFC] Design Proposal routing algorithm for splitting shards In-place #13925

Comments

CaptainDredge commented Jun 3, 2024 • edited Loading

Introduction

Operation Routing In OpenSearch

Describe the solution you'd like

Operation routing design requirements for in-place shard-split:

Approaches

Approach 1 - Routing using Shard Hash space [Recommended]

Approach 2 - “Recursive Routing ”

Benchmarks

Is hash space distribution uniform for child shards?

Related component

peternied commented Jun 5, 2024

CaptainDredge commented Jun 3, 2024 •

edited

Loading