[Segment Replication] Change validation for SegRep during Shrink operation to be async #11352

mch2 · 2023-11-27T22:04:46Z

Is your feature request related to a problem? Please describe.
Today when we invoke a Shrink request on a SegRep index, it verifies if shards are up to date before accepting the request and fails fast. This causes issues for users/ism as they would have to manually retry the request at some future point.

Describe the solution you'd like
Move the validation for replicas being up to date later in the Shrink operation, and asynchronously perform the validation. I think this can be during the recovery step to wait until all local shards are current.

Describe alternatives you've considered
Leaving as is

Additional context
None

Poojita-Raj · 2024-01-29T18:36:09Z

In the event of a shrink operation, we expect users to move all the source shards onto the node performing the shrink. We fail in the event that they haven't been moved. In a similar manner, we also want to fail if the replica shards being used as source shards are not up to date with their primary in their replication group. In document replication, none of the shards were possibly behind so this wasn't an issue.

In the event that a user performs a shrink operation and replica shards are selected as the source shards in a segment replication enabled cluster, we want to ensure that no data loss takes place on shrink because the replicas were not up to date.

Considered two different approaches for the moving the validation check of whether the replicas are up to date:

Try checking against the replica's latestSegmentInfosSnapshot's version against the latestReceivedCheckpoint SegmentInfos version that is received by the primary. This would not require any network call. However, in the case of primary failover it could have not received the checkpoint yet.

This might result in data loss.
If we move to a pull based replication model in the future this would be an issue.

Add in a network call from the replica to the primary to ensure that their latestSegmentInfos versions match and we can confirm that the replicas are up to date. The call will take place after we block any further indexing of docs so it should be accurate.

Moving ahead with approach 2 to be 100% sure that replicas are up to date and this approach also works well if we move towards a pull based replication model in the future.

Poojita-Raj · 2024-01-31T18:07:24Z

Need to make a decision regarding approach2:

We complete a relocation before we do a shrink. We also block any further indexing of documents. So in the majority of cases the replicas should be up to date when we run a shrink operation but since we're dealing with data loss, it's important we check this as well.

In the current scenario, if we leave it as is - we are failing fast. Right at the beginning of the call - before accepting the request, if the replicas are not up to date, the shrink operation fails and the user will have to retry.
Link:

OpenSearch/server/src/main/java/org/opensearch/action/admin/indices/shrink/TransportResizeAction.java

Lines 166 to 175 in 16c5257

    
           if (indicesStatsResponse.getIndex(sourceIndex) 
        
               .getTotal() 
        
               .getSegments() 
        
               .getReplicationStats().maxBytesBehind != 0) { 
        
               throw new IllegalStateException( 
        
                   " For index [" 
        
                       + sourceIndex 
        
                       + "] replica shards haven't caught up with primary, please retry after sometime." 
        
               ); 
        
           }

For a better user experience, we wanted to make this check async - i.e., wait for the replicas to be up to date and continue with the shrink operation.
In approach2,

If the shrink operation fails because the replicas are not up to date (in the case of a timeout), it fails and throws an IllegalStateException which gets bubbled up to AllocationService and IndicesClusterStateService.
The allocation service sees the first failure as a failure to create shard and then it tries 3 more times to allocate this shard - on different nodes as well. A shrink operation needs to happen on a particular node - so this is wrong unexpected behavior.
If all of these fail, it then fails the shard resulting in a red cluster because these new target shards are unassigned.

We might be optimizing for an edge case since this is not a frequent occurrence. Need to decide if it is better to introduce a wait if there's a possibility on timeout that we end up with a red cluster because shards are not caught up.

mch2 · 2024-01-31T20:23:09Z

The allocation service sees the first failure as a failure to create shard and then it tries 3 more times to allocate this shard - on different nodes as well. A shrink operation needs to happen on a particular node - so this is wrong unexpected behavior.

The resize operations themselves are expected to be made as part of workflow where

user makes api call to create the write block.
user makes api call to relocate shards.
user makes a resize request that triggers an async create index.
the server async creates the index, allocates shards and then finally ...
issues the recovery from local shards.

The shard recovery step 5 is ensuring that step 2 completed successfully by validating a local started copy of each shard exists. If any of those shards are not active we do not wait, we simply throw the exception and in that case the cluster will turn red because the new index cannot be created.

Any shard copy that is relocated to the node will be up to date already once it is active. So the shard we would be concerned about is any that had previously been on that node that did not go through a relocation cycle.

Poojita-Raj · 2024-01-31T22:56:28Z

We need to ensure that replicas are up to date with their primary for the source shard in a completely async manner before we move ahead with triggering the shrink operation.

Currently, we don't have any way of guaranteeing that post a write and refresh, all shards are in sync and up to date. This is a problem we need to deal with but it is outside the scope of this issue.

Keeping the user experience in mind, what we want to avoid is incomplete shrink operations leaving the cluster in a red state. Instead the plan of action is:

Keep the fast fail check as it is.
Modify the exception thrown so that the end user knows how to check if the replicas are up to date instead of vaguely instructing them to perform retries later on.

mch2 added enhancement Enhancement or improvement to existing feature or request untriaged API Issues with external APIs Indexing:Replication Issues and PRs related to core replication framework eg segrep labels Nov 27, 2023

anasalkouz removed the untriaged label Nov 27, 2023

kotwanikunal assigned Poojita-Raj Jan 2, 2024

Poojita-Raj mentioned this issue Jan 31, 2024

[Segment Replication] Modify shrink exception to be more informative #12117

Merged

8 tasks

andrross closed this as completed in #12117 Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Segment Replication] Change validation for SegRep during Shrink operation to be async #11352

[Segment Replication] Change validation for SegRep during Shrink operation to be async #11352

mch2 commented Nov 27, 2023

Poojita-Raj commented Jan 29, 2024

Poojita-Raj commented Jan 31, 2024 •

edited

Loading

mch2 commented Jan 31, 2024 •

edited

Loading

Poojita-Raj commented Jan 31, 2024

[Segment Replication] Change validation for SegRep during Shrink operation to be async #11352

[Segment Replication] Change validation for SegRep during Shrink operation to be async #11352

Comments

mch2 commented Nov 27, 2023

Poojita-Raj commented Jan 29, 2024

Poojita-Raj commented Jan 31, 2024 • edited Loading

mch2 commented Jan 31, 2024 • edited Loading

Poojita-Raj commented Jan 31, 2024

Poojita-Raj commented Jan 31, 2024 •

edited

Loading

mch2 commented Jan 31, 2024 •

edited

Loading