Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky SegRep test testScrollCreatedOnReplica #12077

Merged
merged 4 commits into from
Feb 21, 2024

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Jan 29, 2024

Description

This change fixes flakiness with segrep test for scroll requests. The flakiness was with an assertion that all segments snapshotted by the scroll were deleted from disk after the scroll was cleared. The failures were caused because the snapshotted segments were still referenced by the latest on-disk commit point that is always preserved.

The test was doing a lot of crazy random merges/deletes etc that made it difficult to follow why those files were sneaking in as part of the latest commit.
To make this easier to reason about I've simplified the test flow to:

  1. Index 1 doc
  2. Create the scroll
  3. Index a few more docs
  4. Refresh to create segments
  5. Copy out segments
  6. force merge to 1 segment
  7. Copy out post force merge
  8. Assert segments still exist on-disk
  9. Clear scroll & assert segments are cleared

I've run this ~6k times without failure.

This PR also contains a bug fix to ReplicaFileDeleter to take a Consumer instead of a BiConsumer<String, String>. We use Store::deleteQuiet to purge files from disk which accepts a varargs of file names and does not have a reason parameter. This ensures we aren't passing the literal "delete unreferenced" as a file to be deleted. This would get passed to deleteQuiet and a NoSuchFileException would get swallowed.

Related Issues

Resolves #10769

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented Jan 30, 2024

Compatibility status:

Checks if related components are compatible with change f9184b9

Incompatible components

Incompatible components: [https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/cross-cluster-replication.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/sql.git]

@mch2
Copy link
Member Author

mch2 commented Jan 30, 2024

Grr, hit a failure here locally after about ~1k more iterations for the remote store version of this test where segments are still part of the latest local commit on the replica. I see this error in logs but i'm not sure its related... debugging:

[2024-01-30T01:03:06,153][ERROR][o.o.i.s.RemoteStoreRefreshListener] [node_t2] [test-idx-1][0] Exception in RemoteStoreRefreshListener.afterRefresh()
java.lang.AssertionError: already started
	at org.opensearch.indices.replication.common.ReplicationTimer.start(ReplicationTimer.java:51) ~[classes/:?]
	at org.opensearch.index.seqno.ReplicationTracker.lambda$startReplicationLagTimers$22(ReplicationTracker.java:1285) ~[classes/:?]
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) ~[?:?]
	at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179) ~[?:?]
	at java.base/java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1850) ~[?:?]
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) ~[?:?]
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) ~[?:?]
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) ~[?:?]
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596) ~[?:?]
	at org.opensearch.index.seqno.ReplicationTracker.startReplicationLagTimers(ReplicationTracker.java:1277) ~[classes/:?]
	at org.opensearch.index.shard.IndexShard.onCheckpointPublished(IndexShard.java:1936) ~[classes/:?]
	at org.opensearch.indices.replication.checkpoint.SegmentReplicationCheckpointPublisher.publish(SegmentReplicationCheckpointPublisher.java:39) ~[classes/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.onSuccessfulSegmentsSync(RemoteStoreRefreshListener.java:309) ~[classes/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener$1.onResponse(RemoteStoreRefreshListener.java:229) ~[classes/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener$1.onResponse(RemoteStoreRefreshListener.java:220) ~[classes/:?]
	at org.opensearch.action.LatchedActionListener.onResponse(LatchedActionListener.java:58) ~[classes/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.uploadNewSegments(RemoteStoreRefreshListener.java:378) ~[classes/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.syncSegments(RemoteStoreRefreshListener.java:254) [classes/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.performAfterRefreshWithPermit(RemoteStoreRefreshListener.java:152) [classes/:?]
	at org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:160) [classes/:?]
	at org.opensearch.index.shard.ReleasableRetryableRefreshListener.afterRefresh(ReleasableRetryableRefreshListener.java:66) [classes/:?]
	at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275) [lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182) [lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240) [lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
	at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1771) [classes/:?]
	at org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1886) [classes/:?]
	at org.opensearch.index.engine.InternalEngine.forceMerge(InternalEngine.java:2025) [classes/:?]
	at org.opensearch.index.shard.IndexShard.forceMerge(IndexShard.java:1521) [classes/:?]
	at org.opensearch.action.admin.indices.forcemerge.TransportForceMergeAction.shardOperation(TransportForceMergeAction.java:113) [classes/:?]
	at org.opensearch.action.admin.indices.forcemerge.TransportForceMergeAction.shardOperation(TransportForceMergeAction.java:60) [classes/:?]
	at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:495) [classes/:?]
	at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:469) [classes/:?]
	at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:456) [classes/:?]
	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) [classes/:?]
	at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:480) [classes/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913) [classes/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [classes/:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]

Copy link
Contributor

✅ Gradle check result for a12eee1: SUCCESS

@mch2
Copy link
Member Author

mch2 commented Jan 30, 2024

Don't think that trace is the cause here. The issue is the replica is not making a new local commit after receiving the new segments. I think this may be that we need a force flush. Going to leave this running overnight to confirm.

@mch2
Copy link
Member Author

mch2 commented Jan 30, 2024

Ok the issue here is the refresh upon force merge is refreshing on the newly created merged into segment but the commit has not yet been made.

With some added logs in InternalEngine ---

[2024-01-30T16:16:19,237][INFO ][o.o.i.e.Engine           ] [node_t2] [test-idx-1][0] After refresh Segments gen: 4 Segments version: 23 Files: [_5.kdm, _5.fdt, _5_Lucene90_0.dvm, _5_Lucene99_0.tmd, _5.kdi, _5.fdm, _5.kdd, _5.fnm, _5_Lucene90_0.dvd, _5_Lucene99_0.tip, _5_Lucene99_0.tim, _5_Lucene99_0.doc, _5.si, _5.fdx]

Segment 5 is our newly force merge'd down to 1 segment, but commit gen is still 4. A competing refresh is running after the new segments are created but the flush has not finished. The replica will see commit gen is not yet updated so it has no reason to commit locally.

The fix here is to disable scheduled refresh. We can also repro this more consistently by reducing refresh interval to 1ms.

Signed-off-by: Marc Handalian <[email protected]>
Copy link
Contributor

❌ Gradle check result for d5002a2:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@mch2
Copy link
Member Author

mch2 commented Jan 30, 2024

❌ Gradle check result for d5002a2:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

#9191
#10755

Copy link
Contributor

❌ Gradle check result for 8200c98: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❕ Gradle check result for d5002a2: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.search.SearchWeightedRoutingIT.testMultiGetWithNetworkDisruption_FailOpenEnabled
      1 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Signed-off-by: Marc Handalian <[email protected]>
@mch2
Copy link
Member Author

mch2 commented Jan 31, 2024

❕ Gradle check result for d5002a2: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.search.SearchWeightedRoutingIT.testMultiGetWithNetworkDisruption_FailOpenEnabled
      1 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

#10755
#9191

Copy link
Contributor

❌ Gradle check result for f9184b9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@mch2
Copy link
Member Author

mch2 commented Jan 31, 2024

❌ Gradle check result for f9184b9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

This one is a bit odd - I don't think this is related to this PR but I don't think its test specific flakiness - cut #12114 with more

Copy link
Contributor

✅ Gradle check result for f9184b9: SUCCESS

@mch2 mch2 merged commit 247e2ee into opensearch-project:main Feb 21, 2024
28 of 29 checks passed
@mch2 mch2 added the backport 2.x Backport to 2.x branch label Feb 21, 2024
opensearch-trigger-bot bot pushed a commit that referenced this pull request Feb 21, 2024
* Fix flaky test testScrollCreatedOnReplica

Signed-off-by: Marc Handalian <[email protected]>

* Disable scheduled refresh

Signed-off-by: Marc Handalian <[email protected]>

* Clean up segment collection assertions

Signed-off-by: Marc Handalian <[email protected]>

* Fix spotless

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
(cherry picked from commit 247e2ee)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
kotwanikunal pushed a commit that referenced this pull request Feb 21, 2024
* Fix flaky test testScrollCreatedOnReplica



* Disable scheduled refresh



* Clean up segment collection assertions



* Fix spotless



---------


(cherry picked from commit 247e2ee)

Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
peteralfonsi pushed a commit to peteralfonsi/OpenSearch that referenced this pull request Mar 1, 2024
…12077)

* Fix flaky test testScrollCreatedOnReplica

Signed-off-by: Marc Handalian <[email protected]>

* Disable scheduled refresh

Signed-off-by: Marc Handalian <[email protected]>

* Clean up segment collection assertions

Signed-off-by: Marc Handalian <[email protected]>

* Fix spotless

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
amkhar pushed a commit to amkhar/OpenSearch that referenced this pull request Mar 12, 2024
…12077)

* Fix flaky test testScrollCreatedOnReplica

Signed-off-by: Marc Handalian <[email protected]>

* Disable scheduled refresh

Signed-off-by: Marc Handalian <[email protected]>

* Clean up segment collection assertions

Signed-off-by: Marc Handalian <[email protected]>

* Fix spotless

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Aman Khare <[email protected]>
rayshrey pushed a commit to rayshrey/OpenSearch that referenced this pull request Mar 18, 2024
…12077)

* Fix flaky test testScrollCreatedOnReplica

Signed-off-by: Marc Handalian <[email protected]>

* Disable scheduled refresh

Signed-off-by: Marc Handalian <[email protected]>

* Clean up segment collection assertions

Signed-off-by: Marc Handalian <[email protected]>

* Fix spotless

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…12077)

* Fix flaky test testScrollCreatedOnReplica

Signed-off-by: Marc Handalian <[email protected]>

* Disable scheduled refresh

Signed-off-by: Marc Handalian <[email protected]>

* Clean up segment collection assertions

Signed-off-by: Marc Handalian <[email protected]>

* Fix spotless

Signed-off-by: Marc Handalian <[email protected]>

---------

Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Shivansh Arora <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch bug Something isn't working flaky-test Random test failure that succeeds on second run skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] SegmentReplicationIT.testScrollCreatedOnReplica is flaky
3 participants