Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Change validation for SegRep during Shrink operation to be async #11352

Closed
mch2 opened this issue Nov 27, 2023 · 4 comments · Fixed by #12117
Closed

[Segment Replication] Change validation for SegRep during Shrink operation to be async #11352

mch2 opened this issue Nov 27, 2023 · 4 comments · Fixed by #12117
Assignees
Labels
API Issues with external APIs enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep

Comments

@mch2
Copy link
Member

mch2 commented Nov 27, 2023

Is your feature request related to a problem? Please describe.
Today when we invoke a Shrink request on a SegRep index, it verifies if shards are up to date before accepting the request and fails fast. This causes issues for users/ism as they would have to manually retry the request at some future point.

Describe the solution you'd like
Move the validation for replicas being up to date later in the Shrink operation, and asynchronously perform the validation. I think this can be during the recovery step to wait until all local shards are current.

Describe alternatives you've considered
Leaving as is

Additional context
None

@mch2 mch2 added enhancement Enhancement or improvement to existing feature or request untriaged API Issues with external APIs Indexing:Replication Issues and PRs related to core replication framework eg segrep labels Nov 27, 2023
@Poojita-Raj
Copy link
Contributor

In the event of a shrink operation, we expect users to move all the source shards onto the node performing the shrink. We fail in the event that they haven't been moved. In a similar manner, we also want to fail if the replica shards being used as source shards are not up to date with their primary in their replication group. In document replication, none of the shards were possibly behind so this wasn't an issue.

In the event that a user performs a shrink operation and replica shards are selected as the source shards in a segment replication enabled cluster, we want to ensure that no data loss takes place on shrink because the replicas were not up to date.

Considered two different approaches for the moving the validation check of whether the replicas are up to date:

  1. Try checking against the replica's latestSegmentInfosSnapshot's version against the latestReceivedCheckpoint SegmentInfos version that is received by the primary. This would not require any network call. However, in the case of primary failover it could have not received the checkpoint yet.
  • This might result in data loss.
  • If we move to a pull based replication model in the future this would be an issue.
  1. Add in a network call from the replica to the primary to ensure that their latestSegmentInfos versions match and we can confirm that the replicas are up to date. The call will take place after we block any further indexing of docs so it should be accurate.

Moving ahead with approach 2 to be 100% sure that replicas are up to date and this approach also works well if we move towards a pull based replication model in the future.

@Poojita-Raj
Copy link
Contributor

Poojita-Raj commented Jan 31, 2024

Need to make a decision regarding approach2:

We complete a relocation before we do a shrink. We also block any further indexing of documents. So in the majority of cases the replicas should be up to date when we run a shrink operation but since we're dealing with data loss, it's important we check this as well.

In the current scenario, if we leave it as is - we are failing fast. Right at the beginning of the call - before accepting the request, if the replicas are not up to date, the shrink operation fails and the user will have to retry.
Link:

if (indicesStatsResponse.getIndex(sourceIndex)
.getTotal()
.getSegments()
.getReplicationStats().maxBytesBehind != 0) {
throw new IllegalStateException(
" For index ["
+ sourceIndex
+ "] replica shards haven't caught up with primary, please retry after sometime."
);
}

For a better user experience, we wanted to make this check async - i.e., wait for the replicas to be up to date and continue with the shrink operation.
In approach2,

  • If the shrink operation fails because the replicas are not up to date (in the case of a timeout), it fails and throws an IllegalStateException which gets bubbled up to AllocationService and IndicesClusterStateService.
  • The allocation service sees the first failure as a failure to create shard and then it tries 3 more times to allocate this shard - on different nodes as well. A shrink operation needs to happen on a particular node - so this is wrong unexpected behavior.
  • If all of these fail, it then fails the shard resulting in a red cluster because these new target shards are unassigned.

We might be optimizing for an edge case since this is not a frequent occurrence. Need to decide if it is better to introduce a wait if there's a possibility on timeout that we end up with a red cluster because shards are not caught up.

@mch2
Copy link
Member Author

mch2 commented Jan 31, 2024

The allocation service sees the first failure as a failure to create shard and then it tries 3 more times to allocate this shard - on different nodes as well. A shrink operation needs to happen on a particular node - so this is wrong unexpected behavior.

The resize operations themselves are expected to be made as part of workflow where

  1. user makes api call to create the write block.
  2. user makes api call to relocate shards.
  3. user makes a resize request that triggers an async create index.
  4. the server async creates the index, allocates shards and then finally ...
  5. issues the recovery from local shards.

The shard recovery step 5 is ensuring that step 2 completed successfully by validating a local started copy of each shard exists. If any of those shards are not active we do not wait, we simply throw the exception and in that case the cluster will turn red because the new index cannot be created.

Any shard copy that is relocated to the node will be up to date already once it is active. So the shard we would be concerned about is any that had previously been on that node that did not go through a relocation cycle.

@Poojita-Raj
Copy link
Contributor

We need to ensure that replicas are up to date with their primary for the source shard in a completely async manner before we move ahead with triggering the shrink operation.

Currently, we don't have any way of guaranteeing that post a write and refresh, all shards are in sync and up to date. This is a problem we need to deal with but it is outside the scope of this issue.

Keeping the user experience in mind, what we want to avoid is incomplete shrink operations leaving the cluster in a red state. Instead the plan of action is:

  1. Keep the fast fail check as it is.
  2. Modify the exception thrown so that the end user knows how to check if the replicas are up to date instead of vaguely instructing them to perform retries later on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Issues with external APIs enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants