Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Segment Replication] testRelocateWhileContinuouslyIndexingAndWaitingForRefresh flaky test failure #6778

Open
dreamer-89 opened this issue Mar 21, 2023 · 4 comments · Fixed by #7053
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep

Comments

@dreamer-89
Copy link
Member

Flaky test failure testRelocateWhileContinuouslyIndexingAndWaitingForRefresh

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationRelocationIT.testRelocateWhileContinuouslyIndexingAndWaitingForRefresh" -Dtests.seed=9BB65F799209C57E -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-LY -Dtests.timezone=Europe/Volgograd -Druntime.java=19

org.opensearch.indices.replication.SegmentReplicationRelocationIT > testRelocateWhileContinuouslyIndexingAndWaitingForRefresh FAILED
    java.lang.AssertionError: Expected search hits on node: node_t1 to be at least 1000 but was: 283
        at org.junit.Assert.fail(Assert.java:89)
        at org.opensearch.indices.replication.SegmentReplicationBaseIT.lambda$waitForSearchableDocs$0(SegmentReplicationBaseIT.java:132)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1060)
        at org.opensearch.indices.replication.SegmentReplicationBaseIT.waitForSearchableDocs(SegmentReplicationBaseIT.java:127)
        at org.opensearch.indices.replication.SegmentReplicationBaseIT.waitForSearchableDocs(SegmentReplicationBaseIT.java:122)
        at org.opensearch.indices.replication.SegmentReplicationBaseIT.waitForSearchableDocs(SegmentReplicationBaseIT.java:139)
        at org.opensearch.indices.replication.SegmentReplicationRelocationIT.testRelocateWhileContinuouslyIndexingAndWaitingForRefresh(SegmentReplicationRelocationIT.java:282)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=360, name=opensearch[node_t0][generic][T#3], state=RUNNABLE, group=TGRP-SegmentReplicationRelocationIT]

        Caused by:
        java.lang.AssertionError: shard [test-idx-1][0], node[Kfsg4NrXRZKc4SnPRbeX7g], relocating [hi-iMUoeTC-Qb5Lw0uuWwg], [P], s[RELOCATING], a[id=z00v6GiARJ2pmOa0xIFFgg, rId=Mh8_S8ChSOq5fxRJAWVdNw], expected_shard_size[230] is not a primary shard in primary mode
            at __randomizedtesting.SeedInfo.seed([9BB65F799209C57E]:0)
            at org.opensearch.index.shard.IndexShard.assertPrimaryMode(IndexShard.java:2340)
            at org.opensearch.index.shard.IndexShard.getReplicationGroup(IndexShard.java:3042)
            at org.opensearch.indices.replication.SegmentReplicationSourceHandler.sendFiles(SegmentReplicationSourceHandler.java:140)
            at org.opensearch.indices.replication.OngoingSegmentReplications.startSegmentCopy(OngoingSegmentReplications.java:130)
            at org.opensearch.indices.replication.SegmentReplicationSourceService$GetSegmentFilesRequestHandler.messageReceived(SegmentReplicationSourceService.java:157)
            at org.opensearch.indices.replication.SegmentReplicationSourceService$GetSegmentFilesRequestHandler.messageReceived(SegmentReplicationSourceService.java:154)
            at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106)
            at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453)
            at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806)
            at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
            at java.****/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
            at java.****/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
            at java.****/java.lang.Thread.run(Thread.java:1589)

Gradle check : https://build.ci.opensearch.org/job/gradle-check/12741

@dreamer-89
Copy link
Member Author

dreamer-89 commented Apr 7, 2023

Ran this test 1000 times with above failing seed without any failure. Ran test without any seed 500 without any failure. Though it is not repro'able easily, it is still possible for this test to fail. Added a check for cancellation via #7053

@reta
Copy link
Collaborator

reta commented Dec 13, 2023

The issue is not fixed: https://build.ci.opensearch.org/job/gradle-check/31146/

@dreamer-89
Copy link
Member Author

Looking into it.

@dreamer-89
Copy link
Member Author

Ran this test locally with/without failing test seed but without any success. I ran script to find builds failures due to flaky tests here and from results it seems that this test failed couple of times.

2 org.opensearch.indices.replication.SegmentReplicationRelocationIT.testRelocateWhileContinuouslyIndexingAndWaitingForRefresh (30438,31146)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

6 participants