Gracefully cancel the replication when the replica shard is closed #8089

ankitkala · 2023-06-15T16:04:44Z

ReplicationSource defines a cancel method intended to allow source implementations to cancel and clean up any ongoing requests. The remote store implementation does not currently provide an implementation.

We need to validate cancellation with remote store completes successfully. We can reuse ITs introduced with #8478 to do this validation.

I think we will need to treat the remote store implementation as a sync call and wrap in CancellableThreads to perform interrupt, but maybe there is a similar cancel mechanism to RetryableTransportClient.cancel to support this.

mch2 · 2023-07-11T17:21:44Z

Adding description here -
ReplicationSource defines a cancel method intended to allow source implementations to cancel and clean up any ongoing requests. The remote store implementation does not currently provide an implementation.

We need to validate cancellation with remote store completes successfully. We can reuse ITs introduced with #8478 to do this validation.

I think we will need to treat the remote store implementation as a sync call and wrap in CancellableThreads to perform interrupt, but maybe there is a similar cancel mechanism to RetryableTransportClient.cancel to support this.

dreamer-89 · 2023-09-01T19:13:22Z

Discussed separately with @mch2, as we are close to 2.10.0 release (Sep 04, 2023 code freeze) and feature is pretty stable (non-flaky tests for node-node) and no data points around use-cases this change (PR #9234) will solve. We decided to not merge this in for 2.10.0. Removing 2.10.0 label from the issue and moving to backlog.

PR : #9234

CC @anasalkouz @mch2

mch2 · 2023-10-16T22:49:05Z

I'm not sure if this is an issue with cancellation as a replica, but it looks like post failover a corrupt file is breaking recovery. I don't know if this is because the replica is still writing concurrently in a previous sync, or if its not handling reading cksum properly.

org.opensearch.index.shard.IndexShardRecoveryException: Exception while copying segment files from remote segment store
	at org.opensearch.index.shard.IndexShard.syncSegmentsFromRemoteSegmentStore(IndexShard.java:4825) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard.syncSegmentsFromRemoteSegmentStore(IndexShard.java:4765) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard.resetEngineToGlobalCheckpoint(IndexShard.java:4702) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard.lambda$updateShardState$4(IndexShard.java:701) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:3996) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:3966) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard.lambda$asyncBlockOperations$36(IndexShard.java:3917) ~[classes/:?]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [classes/:?]
	at org.opensearch.index.shard.IndexShardOperationPermits$1.doRun(IndexShardOperationPermits.java:157) [classes/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [classes/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [classes/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.lang.Thread.run(Thread.java:1589) [?:?]
Caused by: java.io.IOException: Error occurred when downloading segments from remote store
	at org.opensearch.index.shard.IndexShard.copySegmentFiles(IndexShard.java:4931) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard.syncSegmentsFromRemoteSegmentStore(IndexShard.java:4805) ~[classes/:?]
	... 13 more
Caused by: java.nio.file.FileAlreadyExistsException: File "_0.cfs" was already written to.
	at org.apache.lucene.tests.store.MockDirectoryWrapper.createOutput(MockDirectoryWrapper.java:704) ~[lucene-test-framework-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:75) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.opensearch.index.store.ByteSizeCachingDirectory.createOutput(ByteSizeCachingDirectory.java:153) ~[classes/:?]
	at org.apache.lucene.store.FilterDirectory.createOutput(FilterDirectory.java:75) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.opensearch.index.store.Store$StoreDirectory.copyFileAndValidateChecksum(Store.java:1000) ~[classes/:?]
	at org.opensearch.index.store.Store$StoreDirectory.copyFrom(Store.java:981) ~[classes/:?]
	at org.opensearch.index.store.RemoteStoreFileDownloader.lambda$copyOneFile$1(RemoteStoreFileDownloader.java:132) ~[classes/:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) ~[classes/:?]
	... 3 more

I log the local files inside of syncSegmentsFromRemoteSegmentStore.

[2023-10-17T01:43:37,936][INFO ][o.o.i.s.IndexShard ] [node_t3] [test-idx-1][0] File _0.cfe already exists
[2023-10-17T01:43:37,936][INFO ][o.o.i.s.IndexShard ] [node_t3] [test-idx-1][0] File _0.si already exists
[2023-10-17T01:43:37,943][WARN ][o.o.i.s.IndexShard ] [node_t3] [test-idx-1][0] Exception while reading checksum of file: _0.cfs, this can happen if file is corrupted
[2023-10-17T01:43:37,943][INFO ][o.o.i.s.IndexShard ] [node_t3] [test-idx-1][0] Local dir [_0.cfe, _0.cfs, _0.si, segments_6, write.lock]

ankitkala added bug Something isn't working untriaged labels Jun 15, 2023

anasalkouz added distributed framework and removed untriaged labels Jun 15, 2023

anasalkouz added this to Segment Replication Jun 15, 2023

github-project-automation bot moved this to Todo in Segment Replication Jun 15, 2023

anasalkouz assigned anasalkouz and mch2 and unassigned anasalkouz Jun 15, 2023

anasalkouz moved this from Todo to In Progress in Segment Replication Jun 22, 2023

mch2 removed their assignment Jul 10, 2023

dreamer-89 added enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep and removed bug Something isn't working labels Aug 4, 2023

dreamer-89 self-assigned this Aug 7, 2023

dreamer-89 added the v2.10.0 label Aug 7, 2023

dreamer-89 mentioned this issue Aug 10, 2023

[Segment Replication] Add cancellation support in RemoteStoreReplicationSource #9234

Closed

6 tasks

dreamer-89 removed the v2.10.0 label Sep 1, 2023

mch2 mentioned this issue Oct 19, 2023

Make RemoteStoreReplicationSource#getSegmentFiles asynchronous #10725

Merged

7 tasks

mch2 closed this as completed in #10725 Oct 20, 2023

github-project-automation bot moved this from In Progress to Done in Segment Replication Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gracefully cancel the replication when the replica shard is closed #8089

Gracefully cancel the replication when the replica shard is closed #8089

ankitkala commented Jun 15, 2023 •

edited by mch2

Loading

mch2 commented Jul 11, 2023 •

edited

Loading

dreamer-89 commented Sep 1, 2023 •

edited

Loading

mch2 commented Oct 16, 2023

Gracefully cancel the replication when the replica shard is closed #8089

Gracefully cancel the replication when the replica shard is closed #8089

Comments

ankitkala commented Jun 15, 2023 • edited by mch2 Loading

mch2 commented Jul 11, 2023 • edited Loading

dreamer-89 commented Sep 1, 2023 • edited Loading

mch2 commented Oct 16, 2023

ankitkala commented Jun 15, 2023 •

edited by mch2

Loading

mch2 commented Jul 11, 2023 •

edited

Loading

dreamer-89 commented Sep 1, 2023 •

edited

Loading