Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Design Proposal routing algorithm for splitting shards In-place #13925

Open
CaptainDredge opened this issue Jun 3, 2024 · 1 comment
Open
Assignees
Labels
Indexing:Replication Issues and PRs related to core replication framework eg segrep Indexing Indexing, Bulk Indexing and anything related to indexing RFC Issues requesting major changes Roadmap:Cost/Performance/Scale Project-wide roadmap label Scale

Comments

@CaptainDredge
Copy link
Contributor

CaptainDredge commented Jun 3, 2024

Introduction

In RFC #12918, we discussed potential approaches for splitting of a shard in-place and requested feedback on the same as well as use cases which may need to be additionally addressed while designing In-Place shard split. We published the design for splitting a shard in-place in issue #13923 where we raised the need for devising a mechanism to route operations on child shards and explained the operation routing briefly. In this issue, we will talk about how operations are routed in OpenSearch, delve into different operation routing approaches for in-place shard splitting, discuss their pros and cons and finally add benchmark results comparing the performance of each approach.

Operation Routing In OpenSearch

Currently, document routing algorithm in OpenSearch is governed by the following formulae

$$routing\_factor(RF) = \frac{num\_routing\_shards(RS)}{num\_primary\_shards(P)}$$$$ $$shard\_num = \frac{hash(\_routing) \% num\_routing\_shards(RS)}{routing\_factor(RF)}$$$$

Here,shard_num is effective routing value which is derived from document’s _id which is termed here as _routing. Custom routing patterns can be implemented by specifying a custom routing value per document.

Let’s try to understand the algorithm visually

image

Our hash function i.e. murmur hash has a practical key range from $0$ to $2^{31}−1$ but ideally we restrict the range to 1024 by default i.e. the allowable max number of shards on a single node so that if we divide our hash space to 1024 parts, each part can be assigned to a single shard.

When we create P primary shards, document hash space needs to be divided into equal parts. The default upper limit of max routing number of shards i.e. 1024 may not be divisible by P which could result in unequal hash space distribution.
The value 1024 is nothing but the default number of partitions of the document hash space which is frozen during index creation. A shard of an index is assigned a subset of partitions. For e.g. if an index is created with 8 shards then each shard consists of 128 partitions. Any scaling of shards from x to y in resize actions like shrink index and split index cases will happen based on the number of partitions. For uniform distribution of hash space, all shards should have equal number of partitions otherwise it would result in a collision of keys according to pigeonhole principle. Therefore, number of partitions would be the highest multiple of P less than or equal to 1024. For e.g., for an index consisting of 5 shards, number of partitions assigned to each shard would be 204. This number is called as the number of number of routing shards of an index.
In resize action of OpenSearch like splitting an index where a new index is created and recovered from an existing index, the number of partitions are recalculated. As we saw earlier that number of partitions should always be equally distributed among shards, number of shards of the new index can only be a number exactly divisible by number of shards of the existing index. In addition, new number of shards should be less than the routing number of shards of existing index. Mathematical relation between at most n splits of a shard of an index having P primary shards and upper limit of routing shards equal to 1024 can be expressed as $2^n*P < 1024$.

Similarly, number of partitions assigned on each shard can be expressed as: $num\_routing\_shards(RS)/num\_primary\_shards(P) = (2^n*P)/P = 2^n$

This is called $routing\_factor(RF) = 2^n$

Once, we have RS and RF then its easy to figure out the shard number by figuring out the shard in which the hash of routing id should lie in, given by $shard\_id = \frac{hash(\_routing) \% RS}{RF}$

Describe the solution you'd like

Operation routing design requirements for in-place shard-split:

  1. There shouldn’t be any data loss before, during or after a shard is split
  2. The key space or the hash range which parent shard serves should be split/scaled in a way that it is uniformly distributed among new shard(s) and division should be mutually exclusive

Approaches

We will go over two approaches and discuss how each of them can effectively route an operation on child shards.

Approach 1 - Routing using Shard Hash space [Recommended]

In this approach each seed shard(shards present at the time of index creation) is assigned a hash space. We are using murmurhash3_x86_32 algorithm to calculate hash of _routing before deducing the respective routing shard. It means that absolute hash values could only lie within 0x00000000 - 0xffffffff range (signed int). Therefore, each seed shard is assigned a hash space from 0 to 2^32 -1 . Whenever a shard split happens its hash space gets divided equally between the child shards. Let’s consider a mental model where shards are arranged according to the start and end values of acceptable hash values.

If split 1:4 is performed on a primary shard , upcoming key space partition will look like:
image

  • In this case, all docs getting hash between 0x00000000 - 0x3FFFFFFF will get route to child shard 0, hash between 0x40000000 - 0x7FFFFFFF will get to child shard 1 and so on.

We can further split any of the child shard for e.g. say we split the child shard 2 above into two shards then hash space partition will look like:

image

  • In this further split, child shard 2 will cease to exist to make way to shard 4 and 5 which will serve ranges 0x80000000 - 0x9FFFFFFF and 0xAFFFFFFF - 0xBFFFFFFF respectively. Information of these “high” and “low” values needs to be saved in the cluster state.
  • Note that high value of key space of each primary shard is always increasing, so, one optimization we'll perform is binary searching over these values to efficiently get the routing_shard. But, the necessary information needs to be pre-computed and stored in the cluster state as well otherwise benefits of binary search would gravely out-weight construction.

Mathematically routing algorithm can be defined as follows:

$routing\_factor(RF) = \frac{num\_routing\_shards(RS)}{num\_primary\_shards(P)}$
$seed\_shard\_id = \frac{hash(\_routing) \% num\_routing\_shards(RS)}{routing\_factor(RF)}$

Note: Each seed shard maintains its own hash space and the distribution of the hash space in child shards

A hash space is defined as $Range_i = [l_i, r_i)$
Each shard has its own hash range and they are comparable i.e. $Range_i &lt; Range_j$ <=> $l_i &lt; l_j$
$seed\_shard\ranges = Sorted{Range{child\shard_1},Range{child\_shard_2}, ... }$

Objective: To find the hash range which contains our hash value.
Since, shard ranges are sorted we can find a range with starting point just less than or equal to hash value by using binary search.
We define a function: $ceil({x_1, x_2, x_3 ....}, x)$ which outputs index i such that value $x_i \in {x_1, x_2, x_3, ...}$ just less than equal to $x$ and is based on binary search with a complexity of $O(log_2(len({x_1, x_2, x_3,....})))$

$$shard\_id = ceil(\{ Range_{child\_shard_1}, Range_{child\_shard_2}, ... \}, hash(\_routing))$$

Advantages:

  • Does not limit the number of times a shard can be split in an existing index
  • Easier to understand and maintain
  • Faster if depth of split is higher than logarithmic base 2 of number of shards. For e.g., if a shard gets split 3 times and at each level 2 child shards are produced then this approach will perform better as compared to the previous approach since logarithmic base 2 of 4 shards i.e. 2 is lesser than the depth i.e. 3.

Disadvantages:

  • Performs slow if depth of splits is lower than logarithmic base 2 of number of shards.

Implementation details:

In Opensearch we'll maintain the keyspace for each shard as SplitMetadata which keeps a map of shard id and range assigned to the specific shard. SplitMetadata also contains a flat sorted set of keyspace ranges assigned to each split shard per seed shard which makes the lookup of hash to shard id it belongs a much efficient operation by using binary search. We keep track of only those shards which have been split or are created as a result of split in SplitMetadata and therefore in case of no split there’s no overhead increase in index metadata. During shard split i.e. when recovery is happening we don’t keep child shard keyspace ranges in the sorted set but maintain them in an ephemeral list of keyspace ranges in the parent shard metadata. Once, the recovery is complete and shard split is marked completed we remove the ephemeral child shard IDs and add it to our maintained sorted set for routing of documents.

Approach 2 - “Recursive Routing ”

Based on our current routing algorithm in opensearch, routing terms are expressed as follows:

$P = num\_initial\_primary\_shards$
$n={floor(log_2(1024/P))}$
$RF = 2^n$
$RS=RF*P$
$hash\_value = hash(\_routing)$

Consider, shard $p_i$​ gets split into $P_i$​ shards, we’ll calculate new routing parameters i.e. RS & RF for the child shards

$RF_i = 2^{n - log_2(P_i)}$
$RS_i = {RF_i}*P_i$

Now, to route the document, we’ll first find out which shard out of initial shards the doc will get routed to using the following equation. Assume, the doc got routed to shard $p_i$ then

$p_i = \frac{{hash\_value}\%{RS}}{RF}$

Since, shard $p_i$ has been split into $P_i$ shards, we’ll use the new routing parameters to calculate the actual shard($p_ij$​) to which the doc will gets routed

$p_{ij} = \frac{hash\_value\%RS_i}{RF_i}$

This pattern will continue for further splits. Say, the shard $p_ij$ gets split into $P_j$ shards then we’ll get the new routing parameters as:

$RF_j = 2^{floor(log_2(\frac{RF_i}{P_j}))}$
$RS_j = {RF_j}*P_j$

At runtime the doc will get routed to shard $p_{ijk}$

$p_{ijk} = \frac{hash\_value\%RS_j}{RF_j}$

Base conditions:
if $RF_k==1$ for any shard $p_k$​ then we won’t allow further splits for that shard
Also, if some shard $p_k$​ is the terminal shard i.e. it hasn’t been split then the above described routing algorithm will terminate at that shard

High level visual representation
In the following figure, partitions are getting distributed evenly and we’re maintaining the routing parameters at each level.

image

Advantages:

  • Faster in cases where depth of splits is lesser than logarithmic base 2 of number of shards.

Disadvantages

  • Limit the number of times a shard can be split in an existing index which was created before the OS version which will support in-place shard split at 1024 i.e. default value of routing shards
  • Performs slow if depth of splits is higher than logarithmic base 2 of number of shards.
  • Quite complex to understand and maintain.

Benchmarks

Setup:

HostType: m4.2xlarge
OS: Amazon Linux 2 x86_64
region: us-west-2
vCPUs: 8
Memory: 32Gb
Architecture: x86_64

Parameters: Number of seed shards: 1 Number of splits per shard: 2 Depth i.e. total number of split operations performed per seed shard - 1 : [0,10) Number of documented routed(numDocs): [0,100000] Benchmark Mode: avgt i.e. average time taken per routing operation in nanoseconds Number of threads used: 1

Approach 1 → DocRoutingBenchmark.routeDocsRecurring
Approach 2 → DocRoutingBenchmark.routeDocsRange

On a high level we split a single shard repeatedly into two and compare latency degradation at different depths which increases with number of splits

Benchmark numbers for Approach 1 Vs Approach 2
Benchmark depth numDocs Mode Score Units
DocRoutingBenchmark.routeDocsRange 0 1 avgt 484.304 ns/op
DocRoutingBenchmark.routeDocsRange 0 100 avgt 6922.807 ns/op
DocRoutingBenchmark.routeDocsRange 0 1000 avgt 74009.181 ns/op
DocRoutingBenchmark.routeDocsRange 0 10000 avgt 763362.939 ns/op
DocRoutingBenchmark.routeDocsRange 0 100000 avgt 8378460.97 ns/op
DocRoutingBenchmark.routeDocsRange 1 1 avgt 496.923 ns/op
DocRoutingBenchmark.routeDocsRange 1 100 avgt 7432.15 ns/op
DocRoutingBenchmark.routeDocsRange 1 1000 avgt 82045.71 ns/op
DocRoutingBenchmark.routeDocsRange 1 10000 avgt 888808.969 ns/op
DocRoutingBenchmark.routeDocsRange 1 100000 avgt 9252085.75 ns/op
DocRoutingBenchmark.routeDocsRange 2 1 avgt 488.046 ns/op
DocRoutingBenchmark.routeDocsRange 2 100 avgt 7348.704 ns/op
DocRoutingBenchmark.routeDocsRange 2 1000 avgt 86551.268 ns/op
DocRoutingBenchmark.routeDocsRange 2 10000 avgt 919804.381 ns/op
DocRoutingBenchmark.routeDocsRange 2 100000 avgt 9889578.54 ns/op
DocRoutingBenchmark.routeDocsRange 3 1 avgt 488.343 ns/op
DocRoutingBenchmark.routeDocsRange 3 100 avgt 7621.063 ns/op
DocRoutingBenchmark.routeDocsRange 3 1000 avgt 95474.99 ns/op
DocRoutingBenchmark.routeDocsRange 3 10000 avgt 1025523.67 ns/op
DocRoutingBenchmark.routeDocsRange 3 100000 avgt 10797462.1 ns/op
DocRoutingBenchmark.routeDocsRange 4 1 avgt 494.31 ns/op
DocRoutingBenchmark.routeDocsRange 4 100 avgt 8065.714 ns/op
DocRoutingBenchmark.routeDocsRange 4 1000 avgt 103151.209 ns/op
DocRoutingBenchmark.routeDocsRange 4 10000 avgt 1099497.06 ns/op
DocRoutingBenchmark.routeDocsRange 4 100000 avgt 11711143.9 ns/op
DocRoutingBenchmark.routeDocsRange 5 1 avgt 488.817 ns/op
DocRoutingBenchmark.routeDocsRange 5 100 avgt 8279.029 ns/op
DocRoutingBenchmark.routeDocsRange 5 1000 avgt 109389.782 ns/op
DocRoutingBenchmark.routeDocsRange 5 10000 avgt 1183312.2 ns/op
DocRoutingBenchmark.routeDocsRange 5 100000 avgt 12082835.5 ns/op
DocRoutingBenchmark.routeDocsRange 6 1 avgt 502.501 ns/op
DocRoutingBenchmark.routeDocsRange 6 100 avgt 8583.495 ns/op
DocRoutingBenchmark.routeDocsRange 6 1000 avgt 118236.136 ns/op
DocRoutingBenchmark.routeDocsRange 6 10000 avgt 1252016.38 ns/op
DocRoutingBenchmark.routeDocsRange 6 100000 avgt 13199143.9 ns/op
DocRoutingBenchmark.routeDocsRange 7 1 avgt 497.333 ns/op
DocRoutingBenchmark.routeDocsRange 7 100 avgt 9107.973 ns/op
DocRoutingBenchmark.routeDocsRange 7 1000 avgt 125925.531 ns/op
DocRoutingBenchmark.routeDocsRange 7 10000 avgt 1330136.39 ns/op
DocRoutingBenchmark.routeDocsRange 7 100000 avgt 13872079.1 ns/op
DocRoutingBenchmark.routeDocsRange 8 1 avgt 508.262 ns/op
DocRoutingBenchmark.routeDocsRange 8 100 avgt 9511.286 ns/op
DocRoutingBenchmark.routeDocsRange 8 1000 avgt 139429.952 ns/op
DocRoutingBenchmark.routeDocsRange 8 10000 avgt 1453068.55 ns/op
DocRoutingBenchmark.routeDocsRange 8 100000 avgt 15055794.7 ns/op
DocRoutingBenchmark.routeDocsRange 9 1 avgt 510.916 ns/op
DocRoutingBenchmark.routeDocsRange 9 100 avgt 10152.875 ns/op
DocRoutingBenchmark.routeDocsRange 9 1000 avgt 150015.975 ns/op
DocRoutingBenchmark.routeDocsRange 9 10000 avgt 1602086.52 ns/op
DocRoutingBenchmark.routeDocsRange 9 100000 avgt 16859985.8 ns/op
DocRoutingBenchmark.routeDocsRecurring 0 1 avgt 481.096 ns/op
DocRoutingBenchmark.routeDocsRecurring 0 100 avgt 6915.725 ns/op
DocRoutingBenchmark.routeDocsRecurring 0 1000 avgt 71412.345 ns/op
DocRoutingBenchmark.routeDocsRecurring 0 10000 avgt 726282.631 ns/op
DocRoutingBenchmark.routeDocsRecurring 0 100000 avgt 8129245.07 ns/op
DocRoutingBenchmark.routeDocsRecurring 1 1 avgt 504.045 ns/op
DocRoutingBenchmark.routeDocsRecurring 1 100 avgt 9180.867 ns/op
DocRoutingBenchmark.routeDocsRecurring 1 1000 avgt 93815.905 ns/op
DocRoutingBenchmark.routeDocsRecurring 1 10000 avgt 973699.552 ns/op
DocRoutingBenchmark.routeDocsRecurring 1 100000 avgt 10224791.9 ns/op
DocRoutingBenchmark.routeDocsRecurring 2 1 avgt 530.676 ns/op
DocRoutingBenchmark.routeDocsRecurring 2 100 avgt 11601.343 ns/op
DocRoutingBenchmark.routeDocsRecurring 2 1000 avgt 117090.87 ns/op
DocRoutingBenchmark.routeDocsRecurring 2 10000 avgt 1199006.97 ns/op
DocRoutingBenchmark.routeDocsRecurring 2 100000 avgt 12630158.5 ns/op
DocRoutingBenchmark.routeDocsRecurring 3 1 avgt 554.57 ns/op
DocRoutingBenchmark.routeDocsRecurring 3 100 avgt 14004.462 ns/op
DocRoutingBenchmark.routeDocsRecurring 3 1000 avgt 141144.605 ns/op
DocRoutingBenchmark.routeDocsRecurring 3 10000 avgt 1433050.55 ns/op
DocRoutingBenchmark.routeDocsRecurring 3 100000 avgt 15022877.8 ns/op
DocRoutingBenchmark.routeDocsRecurring 4 1 avgt 585.075 ns/op
DocRoutingBenchmark.routeDocsRecurring 4 100 avgt 16373.323 ns/op
DocRoutingBenchmark.routeDocsRecurring 4 1000 avgt 165582.599 ns/op
DocRoutingBenchmark.routeDocsRecurring 4 10000 avgt 1682310.52 ns/op
DocRoutingBenchmark.routeDocsRecurring 4 100000 avgt 17376692 ns/op
DocRoutingBenchmark.routeDocsRecurring 5 1 avgt 602.198 ns/op
DocRoutingBenchmark.routeDocsRecurring 5 100 avgt 18851.1 ns/op
DocRoutingBenchmark.routeDocsRecurring 5 1000 avgt 191034.424 ns/op
DocRoutingBenchmark.routeDocsRecurring 5 10000 avgt 1932252.88 ns/op
DocRoutingBenchmark.routeDocsRecurring 5 100000 avgt 19961205.4 ns/op
DocRoutingBenchmark.routeDocsRecurring 6 1 avgt 629.351 ns/op
DocRoutingBenchmark.routeDocsRecurring 6 100 avgt 21364.155 ns/op
DocRoutingBenchmark.routeDocsRecurring 6 1000 avgt 215418.63 ns/op
DocRoutingBenchmark.routeDocsRecurring 6 10000 avgt 2187030.87 ns/op
DocRoutingBenchmark.routeDocsRecurring 6 100000 avgt 22456628.9 ns/op
DocRoutingBenchmark.routeDocsRecurring 7 1 avgt 650.714 ns/op
DocRoutingBenchmark.routeDocsRecurring 7 100 avgt 23930.799 ns/op
DocRoutingBenchmark.routeDocsRecurring 7 1000 avgt 241918.217 ns/op
DocRoutingBenchmark.routeDocsRecurring 7 10000 avgt 2421512.91 ns/op
DocRoutingBenchmark.routeDocsRecurring 7 100000 avgt 24830535.8 ns/op
DocRoutingBenchmark.routeDocsRecurring 8 1 avgt 685.33 ns/op
DocRoutingBenchmark.routeDocsRecurring 8 100 avgt 26572.623 ns/op
DocRoutingBenchmark.routeDocsRecurring 8 1000 avgt 268137.903 ns/op
DocRoutingBenchmark.routeDocsRecurring 8 10000 avgt 2706184.6 ns/op
DocRoutingBenchmark.routeDocsRecurring 8 100000 avgt 27532240.3 ns/op
DocRoutingBenchmark.routeDocsRecurring 9 1 avgt 705.721 ns/op
DocRoutingBenchmark.routeDocsRecurring 9 100 avgt 29392.19 ns/op
DocRoutingBenchmark.routeDocsRecurring 9 1000 avgt 293422.558 ns/op
DocRoutingBenchmark.routeDocsRecurring 9 10000 avgt 2952632.51 ns/op
DocRoutingBenchmark.routeDocsRecurring 9 100000 avgt 30346605.2 ns/op

Is hash space distribution uniform for child shards?

To assess the quality of distribution of keys among shards we define the hash distribution quality as

$$\sum_{j=0}^{m-1} \frac{p_j\left(p_j+1\right) / 2}{(n / 2 m)(n+2 m-1)}$$

where $p_j$ is the number of doc ids in j-th primary shard, m is the number of shards, and n is the total number of doc ids. The sum of $p_j(p_j + 1) / 2$ estimates the number of shards that needs to be visited to find a specific doc id. The denominator (n / 2m)(n + 2m − 1) is the number of visited slots for an ideal function that puts each doc id into a random shard. So, if the function is ideal, the formula should give 1. In reality, a good distribution is somewhere between 0.95 and 1.05.

Simple benchmark

Baseline: 5 primary shards with no split, 100k keys using existing opensearch routing
Candidate 1: 5 primary shard split by a factor of 2, 100k keys using approach 1
Candidate 2: 5 primary shard split by a factor of 2, 100k keys using approach 2

hash quality of baseline: 1.02
hash quality of candidate 1: 1.03
hash quality of candidate 2: 1.03

All possible combinations of splitting of n=1 to n=2^6 shards showed that the average hash quality across all combination with 10k keys was `1.13` with approach 1 and `1.12` with approach 2 Hence, keyspace is getting distributed uniformly across the split shards with both the approaches!

Related component

Indexing:Replication

Thanks @vikasvb90 for all the help in refining the approaches as well as testing and benchmarking it

@CaptainDredge CaptainDredge added enhancement Enhancement or improvement to existing feature or request untriaged labels Jun 3, 2024
@github-actions github-actions bot added the Indexing:Replication Issues and PRs related to core replication framework eg segrep label Jun 3, 2024
@CaptainDredge CaptainDredge self-assigned this Jun 3, 2024
@CaptainDredge CaptainDredge added Indexing Indexing, Bulk Indexing and anything related to indexing Scale Roadmap:Cost/Performance/Scale Project-wide roadmap label and removed enhancement Enhancement or improvement to existing feature or request labels Jun 3, 2024
@peternied peternied added the RFC Issues requesting major changes label Jun 5, 2024
@peternied peternied changed the title [Design] Routing algorithm for splitting shards In-place [RFC] Design Proposal routing algorithm for splitting shards In-place Jun 5, 2024
@github-project-automation github-project-automation bot moved this to Planned work items in OpenSearch Roadmap Jun 5, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7]
@CaptainDredge Thanks for creating this detailed proposal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing:Replication Issues and PRs related to core replication framework eg segrep Indexing Indexing, Bulk Indexing and anything related to indexing RFC Issues requesting major changes Roadmap:Cost/Performance/Scale Project-wide roadmap label Scale
Projects
Status: New
Development

No branches or pull requests

2 participants