Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit RW separation to remote store enabled clusters and update recovery flow #16760

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Dec 2, 2024

Description

This PR includes multiple changes to search replica recovery to further decouple these shards from primaries.

  1. Change to recover as empty store instead of peer. This will run a store recovery that syncs segments from remote store directly and eliminate any primary communication.
  2. Remove search replicas from the in-sync allocation ID set and update routing table to exclude them from allAllocationIds. This ensures primaries aren't tracking or validating the routing table for any search replica's presence.
  3. Simplify RW separation by limiting to only remote store enabled clusters. There are versions of the above changes that are still possible with primary based node-node replication but they require additional public api changes and I don't think we have the need at this time.

Related Issues

Resolves #15952

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added the v2.19.0 Issues and PRs related to version 2.19.0 label Dec 2, 2024
Copy link
Contributor

github-actions bot commented Dec 2, 2024

❌ Gradle check result for a932d59:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Dec 2, 2024

❌ Gradle check result for a932d59:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Dec 3, 2024

✅ Gradle check result for a932d59: SUCCESS

Copy link

codecov bot commented Dec 3, 2024

Codecov Report

Attention: Patch coverage is 57.14286% with 15 lines in your changes missing coverage. Please review.

Project coverage is 72.18%. Comparing base (5ba909a) to head (ff24d1d).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...in/java/org/opensearch/index/shard/IndexShard.java 16.66% 3 Missing and 2 partials ⚠️
...search/cluster/routing/IndexShardRoutingTable.java 42.85% 2 Missing and 2 partials ⚠️
...a/org/opensearch/cluster/routing/ShardRouting.java 33.33% 0 Missing and 2 partials ⚠️
...uster/routing/allocation/IndexMetadataUpdater.java 75.00% 0 Missing and 1 partial ⚠️
...org/opensearch/index/seqno/ReplicationTracker.java 0.00% 0 Missing and 1 partial ⚠️
...a/org/opensearch/index/shard/ReplicationGroup.java 50.00% 0 Missing and 1 partial ⚠️
...java/org/opensearch/index/shard/StoreRecovery.java 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #16760      +/-   ##
============================================
+ Coverage     72.11%   72.18%   +0.06%     
- Complexity    65237    65298      +61     
============================================
  Files          5318     5318              
  Lines        304003   304032      +29     
  Branches      43992    44005      +13     
============================================
+ Hits         219228   219461     +233     
+ Misses        66874    66564     -310     
- Partials      17901    18007     +106     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mch2 added 3 commits December 3, 2024 11:39
This PR includes multiple changes to search replica recovery.
1. Change search only replica copies to recover as empty store instead of PEER. This will run a store recovery that syncs segments from remote store directly and eliminate any primary communication.
2. Remove search replicas from the in-sync allocation ID set and update routing table to exclude them from allAllocationIds.  This ensures primaries aren't tracking or validating the routing table for any search replica's presence.
3. Change search replica validation to require remote store.  There are versions of the above changes that are still possible with primary based node-node replication, but I don't think they are worth making  at this time.

Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
@mch2 mch2 added the backport 2.x Backport to 2.x branch label Dec 3, 2024
Copy link
Contributor

github-actions bot commented Dec 3, 2024

✅ Gradle check result for 8935bc7: SUCCESS

@shwetathareja
Copy link
Member

Thanks @mch2 for the changes. On high level changes look good.
I have couple of clarifications:

  1. Search-replicas are not tracked in ReplicationGroup, how are we ensuring search-replicas are not lagging behind? Is there separate tracking for search-replicas?
  2. When primary turn red, would search-replica continue to serve, they don't know which copy to follow from remote anymore. It may break certain invariant in the code where search-replicas are still assigned without any active primary and would require code changes to handle this case.
  3. Test node drops or full cluster restarts scenarios to ensure search-replicas are started automatically (without recovering from scratch) and no changes are needed in AsyncShardFetch logic.

Also just noticed, indices with auto-expand replicas shouldn't support search-replicas otherwise total copies can get more than no. of data nodes as well.

@mch2
Copy link
Member Author

mch2 commented Dec 5, 2024

Thanks for taking a look @shwetathareja, you bring up great points.

  1. Search-replicas are not tracked in ReplicationGroup, how are we ensuring search-replicas are not lagging behind? Is there separate tracking for search-replicas?

Not yet, reguar segrep has a check to fail lagging replicas based on their replication checkpoint (basically segment infos version + a timer) compared to the primary's active set of segments. We could implement something similar to fetch and compare only among the search replica group and remove any outlier. I am thinking to start that we rely on a failure mechanism local to the shards that fail if a single download event takes too long without any progress. Will add a separate task for this.

  1. When primary turn red, would search-replica continue to serve, they don't know which copy to follow from remote anymore. It may break certain invariant in the code where search-replicas are still assigned without any active primary and would require code changes to handle this case.

SR would need to continue syncing up to the latest set uploaded by the dropped primary. Will add a test specifically for red cluster outside of the scale to zero case and see if we need any special handling.

  1. Test node drops or full cluster restarts scenarios to ensure search-replicas are started automatically (without recovering from scratch) and no changes are needed in AsyncShardFetch logic.

ack will add some qa tests with this PR.

Also just noticed, indices with auto-expand replicas shouldn't support search-replicas otherwise total copies can get more than no. of data nodes as well.

Yeah, I will add a task to handle this. We had brought this up before a while back, I was thinking to support this for search replicas within their own set of nodes if they exist and otherwise reject it, but I think blocking it outright makes sense to start.

@@ -440,7 +440,7 @@ public ShardRouting moveToUnassigned(UnassignedInfo unassignedInfo) {
assert state != ShardRoutingState.UNASSIGNED : this;
final RecoverySource recoverySource;
if (active()) {
if (primary()) {
if (primary() || isSearchOnly()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isSearchOnly() recovery would be EmptyStore or Existing, then search replica can turn red as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is possible, do you mean in terms of reporting cluster health? Right now the cluster would report as yellow we need to revisit the API to report accordingly.

mch2 added 4 commits December 9, 2024 16:20
… the AllAllocationIds set in the routing table

Signed-off-by: Marc Handalian <[email protected]>
…e store cluster.

This check had previously only checked for segrep

Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
Copy link
Contributor

❌ Gradle check result for 8e0240f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

mch2 added 2 commits December 10, 2024 09:55
… a remote store cluster."

reverting this, we already check for remote store earlier.

This reverts commit 48ca1a3.
Copy link
Contributor

✅ Gradle check result for ff24d1d: SUCCESS

Signed-off-by: Marc Handalian <[email protected]>
@mch2
Copy link
Member Author

mch2 commented Dec 18, 2024

When primary turn red, would search-replica continue to serve, they don't know which copy to follow from remote anymore. It may break certain invariant in the code where search-replicas are still assigned without any active primary and would require code changes to handle this case.

@shwetathareja, got some time to work on this. I've added a test for this case to ensure a searcher is still searchable post primary failure. First when there is no writer replica and the cluster turns red and second when there is a writer-replica and turns yellow & the new primary writes to the store. This works without issue, however in the first case we should still be able to restore only writers if required and assign the primary to a new node, restore fails in this case because an open index with same name already exists in the cluster. , taking a look there.

edit - this should be possible with _remotestore/_restore api, i think i'm calling wrong restore api in this test.

edit 2 - updated remote store restore logic to support this & the test passes.

@mch2
Copy link
Member Author

mch2 commented Dec 18, 2024

I don't see any existing QA package with remote store enabled to test full cluster / rolling restarts, looking at adding one

Copy link
Contributor

❌ Gradle check result for f915a31: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch v2.19.0 Issues and PRs related to version 2.19.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RW Separation] Change search replica recovery flow
2 participants