-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flaky SegRep test testScrollCreatedOnReplica #12077
Conversation
Signed-off-by: Marc Handalian <[email protected]>
Compatibility status:Checks if related components are compatible with change f9184b9 Incompatible componentsIncompatible components: [https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/cross-cluster-replication.git] Skipped componentsCompatible componentsCompatible components: [https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/sql.git] |
Grr, hit a failure here locally after about ~1k more iterations for the remote store version of this test where segments are still part of the latest local commit on the replica. I see this error in logs but i'm not sure its related... debugging: [2024-01-30T01:03:06,153][ERROR][o.o.i.s.RemoteStoreRefreshListener] [node_t2] [test-idx-1][0] Exception in RemoteStoreRefreshListener.afterRefresh()
java.lang.AssertionError: already started
at org.opensearch.indices.replication.common.ReplicationTimer.start(ReplicationTimer.java:51) ~[classes/:?]
at org.opensearch.index.seqno.ReplicationTracker.lambda$startReplicationLagTimers$22(ReplicationTracker.java:1285) ~[classes/:?]
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) ~[?:?]
at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179) ~[?:?]
at java.base/java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1850) ~[?:?]
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) ~[?:?]
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) ~[?:?]
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596) ~[?:?]
at org.opensearch.index.seqno.ReplicationTracker.startReplicationLagTimers(ReplicationTracker.java:1277) ~[classes/:?]
at org.opensearch.index.shard.IndexShard.onCheckpointPublished(IndexShard.java:1936) ~[classes/:?]
at org.opensearch.indices.replication.checkpoint.SegmentReplicationCheckpointPublisher.publish(SegmentReplicationCheckpointPublisher.java:39) ~[classes/:?]
at org.opensearch.index.shard.RemoteStoreRefreshListener.onSuccessfulSegmentsSync(RemoteStoreRefreshListener.java:309) ~[classes/:?]
at org.opensearch.index.shard.RemoteStoreRefreshListener$1.onResponse(RemoteStoreRefreshListener.java:229) ~[classes/:?]
at org.opensearch.index.shard.RemoteStoreRefreshListener$1.onResponse(RemoteStoreRefreshListener.java:220) ~[classes/:?]
at org.opensearch.action.LatchedActionListener.onResponse(LatchedActionListener.java:58) ~[classes/:?]
at org.opensearch.index.shard.RemoteStoreRefreshListener.uploadNewSegments(RemoteStoreRefreshListener.java:378) ~[classes/:?]
at org.opensearch.index.shard.RemoteStoreRefreshListener.syncSegments(RemoteStoreRefreshListener.java:254) [classes/:?]
at org.opensearch.index.shard.RemoteStoreRefreshListener.performAfterRefreshWithPermit(RemoteStoreRefreshListener.java:152) [classes/:?]
at org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:160) [classes/:?]
at org.opensearch.index.shard.ReleasableRetryableRefreshListener.afterRefresh(ReleasableRetryableRefreshListener.java:66) [classes/:?]
at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275) [lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182) [lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240) [lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1771) [classes/:?]
at org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1886) [classes/:?]
at org.opensearch.index.engine.InternalEngine.forceMerge(InternalEngine.java:2025) [classes/:?]
at org.opensearch.index.shard.IndexShard.forceMerge(IndexShard.java:1521) [classes/:?]
at org.opensearch.action.admin.indices.forcemerge.TransportForceMergeAction.shardOperation(TransportForceMergeAction.java:113) [classes/:?]
at org.opensearch.action.admin.indices.forcemerge.TransportForceMergeAction.shardOperation(TransportForceMergeAction.java:60) [classes/:?]
at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:495) [classes/:?]
at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:469) [classes/:?]
at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:456) [classes/:?]
at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) [classes/:?]
at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:480) [classes/:?]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913) [classes/:?]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [classes/:?]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.base/java.lang.Thread.run(Thread.java:840) [?:?] |
Don't think that trace is the cause here. The issue is the replica is not making a new local commit after receiving the new segments. I think this may be that we need a force flush. Going to leave this running overnight to confirm. |
Ok the issue here is the refresh upon force merge is refreshing on the newly created merged into segment but the commit has not yet been made. With some added logs in InternalEngine ---
Segment 5 is our newly force merge'd down to 1 segment, but commit gen is still 4. A competing refresh is running after the new segments are created but the flush has not finished. The replica will see commit gen is not yet updated so it has no reason to commit locally. The fix here is to disable scheduled refresh. We can also repro this more consistently by reducing refresh interval to 1ms. |
Signed-off-by: Marc Handalian <[email protected]>
❌ Gradle check result for d5002a2: Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Signed-off-by: Marc Handalian <[email protected]>
❌ Gradle check result for 8200c98: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❕ Gradle check result for d5002a2: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Signed-off-by: Marc Handalian <[email protected]>
|
❌ Gradle check result for f9184b9: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
This one is a bit odd - I don't think this is related to this PR but I don't think its test specific flakiness - cut #12114 with more |
* Fix flaky test testScrollCreatedOnReplica Signed-off-by: Marc Handalian <[email protected]> * Disable scheduled refresh Signed-off-by: Marc Handalian <[email protected]> * Clean up segment collection assertions Signed-off-by: Marc Handalian <[email protected]> * Fix spotless Signed-off-by: Marc Handalian <[email protected]> --------- Signed-off-by: Marc Handalian <[email protected]> (cherry picked from commit 247e2ee) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Fix flaky test testScrollCreatedOnReplica * Disable scheduled refresh * Clean up segment collection assertions * Fix spotless --------- (cherry picked from commit 247e2ee) Signed-off-by: Marc Handalian <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…12077) * Fix flaky test testScrollCreatedOnReplica Signed-off-by: Marc Handalian <[email protected]> * Disable scheduled refresh Signed-off-by: Marc Handalian <[email protected]> * Clean up segment collection assertions Signed-off-by: Marc Handalian <[email protected]> * Fix spotless Signed-off-by: Marc Handalian <[email protected]> --------- Signed-off-by: Marc Handalian <[email protected]>
…12077) * Fix flaky test testScrollCreatedOnReplica Signed-off-by: Marc Handalian <[email protected]> * Disable scheduled refresh Signed-off-by: Marc Handalian <[email protected]> * Clean up segment collection assertions Signed-off-by: Marc Handalian <[email protected]> * Fix spotless Signed-off-by: Marc Handalian <[email protected]> --------- Signed-off-by: Marc Handalian <[email protected]> Signed-off-by: Aman Khare <[email protected]>
…12077) * Fix flaky test testScrollCreatedOnReplica Signed-off-by: Marc Handalian <[email protected]> * Disable scheduled refresh Signed-off-by: Marc Handalian <[email protected]> * Clean up segment collection assertions Signed-off-by: Marc Handalian <[email protected]> * Fix spotless Signed-off-by: Marc Handalian <[email protected]> --------- Signed-off-by: Marc Handalian <[email protected]>
…12077) * Fix flaky test testScrollCreatedOnReplica Signed-off-by: Marc Handalian <[email protected]> * Disable scheduled refresh Signed-off-by: Marc Handalian <[email protected]> * Clean up segment collection assertions Signed-off-by: Marc Handalian <[email protected]> * Fix spotless Signed-off-by: Marc Handalian <[email protected]> --------- Signed-off-by: Marc Handalian <[email protected]> Signed-off-by: Shivansh Arora <[email protected]>
Description
This change fixes flakiness with segrep test for scroll requests. The flakiness was with an assertion that all segments snapshotted by the scroll were deleted from disk after the scroll was cleared. The failures were caused because the snapshotted segments were still referenced by the latest on-disk commit point that is always preserved.
The test was doing a lot of crazy random merges/deletes etc that made it difficult to follow why those files were sneaking in as part of the latest commit.
To make this easier to reason about I've simplified the test flow to:
I've run this ~6k times without failure.
This PR also contains a bug fix to ReplicaFileDeleter to take a Consumer instead of a BiConsumer<String, String>. We use
Store::deleteQuiet
to purge files from disk which accepts a varargs of file names and does not have a reason parameter. This ensures we aren't passing the literal "delete unreferenced" as a file to be deleted. This would get passed to deleteQuiet and a NoSuchFileException would get swallowed.Related Issues
Resolves #10769
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.