-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cluster primary balance contraint for rebalancing with buffer #12656
Add cluster primary balance contraint for rebalancing with buffer #12656
Conversation
Signed-off-by: Arpit Bandejiya <[email protected]>
Signed-off-by: Arpit Bandejiya <[email protected]>
❌ Gradle check result for 8aed71b: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Compatibility status:Checks if related components are compatible with change 9f94ba5 Incompatible componentsSkipped componentsCompatible componentsCompatible components: [https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/sql.git] |
❌ Gradle check result for a2eaddd: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Arpit Bandejiya <[email protected]>
❌ Gradle check result for 761adc3: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Arpit Bandejiya <[email protected]>
❌ Gradle check result for f842b41: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Arpit Bandejiya <[email protected]>
Signed-off-by: Arpit Bandejiya <[email protected]>
Signed-off-by: Arpit Bandejiya <[email protected]>
❌ Gradle check result for 938ac4d: Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Arpit Bandejiya <[email protected]>
Signed-off-by: Arpit Bandejiya <[email protected]>
❌ Gradle check result for ece3b26: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Arpit Bandejiya <[email protected]>
❌ Gradle check result for 80dbcb9: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Arpit Bandejiya <[email protected]>
❌ Gradle check result for 3d4d865: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Arpit Bandejiya <[email protected]>
❌ Gradle check result for d7657af: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Arpit Bandejiya <[email protected]>
❌ Gradle check result for 566c7ef: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Failed test:
|
#11933 --> The above test is flaky |
Signed-off-by: Arpit Bandejiya <[email protected]>
server/src/test/java/org/opensearch/cluster/routing/allocation/BalanceConfigurationTests.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have left some minor comments, lgtm otherwise.
server/src/main/java/org/opensearch/cluster/routing/allocation/AllocationConstraints.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need AllocationBenchmark class changes for JMH benchmarks to be checked in?
...c/main/java/org/opensearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java
Show resolved
Hide resolved
The current benchmark contains iterations for 200 nodes setup. I have added iterations to benchmark on higher node setup(Upto 1000 nodes). I think we can open an issue to see if we really want to add iterations for higher number of nodes setup for all of the benchmark. Let me know your thoughts on it. |
…primary-rebalacing
Signed-off-by: Arpit-Bandejiya <[email protected]>
The backport to
To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-12656-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 3491bcb23d6b398117cfd11c5d273b2e83798d0b
# Push it to GitHub
git push --set-upstream origin backport/backport-12656-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x Then, create a pull request where the |
…ensearch-project#12656) Signed-off-by: Arpit-Bandejiya <[email protected]> (cherry picked from commit 3491bcb)
…ensearch-project#12656) Signed-off-by: Arpit-Bandejiya <[email protected]> (cherry picked from commit 3491bcb) Signed-off-by: Arpit Bandejiya <[email protected]>
…2656) (#13014) (cherry picked from commit 3491bcb) Signed-off-by: Arpit-Bandejiya <[email protected]>
…ensearch-project#12656) Signed-off-by: Arpit-Bandejiya <[email protected]> Signed-off-by: Shivansh Arora <[email protected]>
…ensearch-project#12656) Signed-off-by: Arpit-Bandejiya <[email protected]>
Description
Currently, the
cluster.routing.allocation.balance.prefer_primary
is used for balancing the primary during allocation. This change introduces primary balancing during rebalancing phase using a new settingcluster.routing.allocation.rebalance.prefer_primary
. Also we introduce the concept of buffer to relax the constraints to control the degree of balance we need in the rebalancing phase.We also introduced the. On doing extended testing(resulted attached below), we saw we were not seeing gains with random allocation hence not going again with the change.cluster.routing.allocation.balance.prefer_random_allocation
setting to randomly allocate the nodes instead of going in round robin fashion when multiple nodes have MIN_WEIGHTRelated Issues
Resolves #12250
Benchmarking of the changes:
We have used the AllocationBenchmark to perform the benchmarking of the change. We have altered the testcases to be the following.
Results
For original algorithm:
We initially compared it with rebalancing of primary shards with 5% buffer allowed.
We ran the benchmark with random allocation of MIN_WEIGHT nodes to see if we are getting any gains, we found it isn't helping much and the avg, max scores were comparitively high than the normal allocation. Therefore, we decided to not go ahead with it.
We then also performed benchmarking with different buffer percent.
For 10% buffer:
For 1% buffer:
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.