Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fix flaky org.opensearch.search.PitMultiNodeTests.testCreatePitWhileNodeDropWithAllowPartialCreationFalse test #4259

Open
reta opened this issue Aug 18, 2022 · 5 comments · Fixed by #5655
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run

Comments

@reta
Copy link
Collaborator

reta commented Aug 18, 2022

Describe the bug
Another flaky test, introduced by https://github.com/opensearch-project/OpenSearch/pull/3921/files

@Bukhtawar could you please take a look?

To Reproduce

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.search.PitMultiNodeTests.testCreatePitWhileNodeDropWithAllowPartialCreationFalse" -Dtests.seed=96F3F0BE216E9128 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=zh-Hant-TW -Dtests.timezone=America/Paramaribo -Druntime.java=17

Expected behavior
Test must always pass

Plugins
Standard

Screenshots

<h3>Error Message</h3><pre>java.lang.AssertionError</pre><h3>Stacktrace</h3><pre>java.lang.AssertionError
	at __randomizedtesting.SeedInfo.seed([96F3F0BE216E9128:D64BFD8616D1599]:0)
	at org.junit.Assert.fail(Assert.java:87)
	at org.junit.Assert.assertTrue(Assert.java:42)
	at org.junit.Assert.assertTrue(Assert.java:53)
	at org.opensearch.search.PitMultiNodeTests$1.onNodeStopped(PitMultiNodeTests.java:99)
	at org.opensearch.test.InternalTestCluster$NodeAndClient.closeForRestart(InternalTestCluster.java:1028)
	at org.opensearch.test.InternalTestCluster.restartNode(InternalTestCluster.java:1931)
	at org.opensearch.test.InternalTestCluster.restartRandomDataNode(InternalTestCluster.java:1879)
	at org.opensearch.search.PitMultiNodeTests.testCreatePitWhileNodeDropWithAllowPartialCreationFalse(PitMultiNodeTests.java:93)</pre>

Host/Environment (please complete the following information):

  • OS: CI
  • Version CI

Additional context
See please https://build.ci.opensearch.org/job/gradle-check/1902/testReport/junit/org.opensearch.search/PitMultiNodeTests/testCreatePitWhileNodeDropWithAllowPartialCreationFalse/

@bharath-techie
Copy link
Contributor

bharath-techie commented Nov 23, 2022

https://github.com/opensearch-project/OpenSearch/pull/4632/files -- this has the fix and it was merged already. I was not able to reproduce the issue with 500 iterationss in local.

Post 2.x I don't see this issue being tagged as well. We can close the issue if this issue is not seen in the latest builds.

@dbwiddis
Copy link
Member

NOTE: The fix in #4632 was for the initial report in #4089, not for this issue. This bug still exists and is referenced in comments starting here

@dbwiddis
Copy link
Member

Copying a summary of comment over from #4089 where several repro logs and other comments were made. See that issue for background.

Have spent several hours digging into this issue.

For the create failure; it's obviously a race condition and close to a Heisenbug. While regularly reproducible at least once in a run of size 500, adding logging extended successful runs with a failure on run 2463.

Commonalities when it fails:

  • the dropped node was always the cluster manager
  • there is always an uncommitted operation in the translog

Tracing through the debug logs I added, I'm at the "how did this ever work" phase of debugging. The boolean allowPartialSearchResults is never referred to in the create workflow called as part of this. In particular, the PIT creation failure occurs in this block of code:

void executeCreatePit(Task task, SearchRequest searchRequest, StepListener<SearchResponse> createPitListener) {
logger.debug(
() -> new ParameterizedMessage("Executing creation of PIT context for indices [{}]", Arrays.toString(searchRequest.indices()))
);
transportSearchAction.executeRequest(
task,
searchRequest,
TransportCreatePitAction.CREATE_PIT_ACTION,
true,
new TransportSearchAction.SinglePhaseSearchAction() {
@Override
public void executeOnShardTarget(
SearchTask searchTask,
SearchShardTarget target,
Transport.Connection connection,
ActionListener<SearchPhaseResult> searchPhaseResultActionListener
) {
searchTransportService.createPitContext(
connection,
new TransportCreatePitAction.CreateReaderContextRequest(
target.getShardId(),
PIT_INIT_KEEP_ALIVE.get(clusterService.getSettings())
),
searchTask,
ActionListener.wrap(r -> searchPhaseResultActionListener.onResponse(r), searchPhaseResultActionListener::onFailure)
);
}
},
createPitListener
);
}

The getter searchRequest.allowPartialSearchResults() is never referred to here; call hierarchy indicates it's called on multi phase requests at the beginning of AbstractSearchAsyncAction.executeNextPhase() but there is no reference to it in SinglePhaseSearchAction which is called here. The SearchRequest variable request isn't even passed to the constructor, just the search task.

Looks like this code was implemented in #3921. @bharath-techie do you have any observations/comments that can further help debug this?

@dbwiddis
Copy link
Member

Digging into this, the original commit #3291 did not cause this failure; the unneeded test was added by #4030. I'll back out the line causing the failures.

@reta
Copy link
Collaborator Author

reta commented Mar 12, 2024

java.lang.AssertionError: expected:<0> but was:<1>
	at __randomizedtesting.SeedInfo.seed([561A7A0B4C9B2562:CD8D356D0C98A1D3]:0)
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.failNotEquals(Assert.java:835)
	at org.junit.Assert.assertEquals(Assert.java:647)
	at org.junit.Assert.assertEquals(Assert.java:633)
	at org.opensearch.search.pit.PitMultiNodeIT.validatePitStats(PitMultiNodeIT.java:363)
	at org.opensearch.search.pit.PitMultiNodeIT$1.onNodeStopped(PitMultiNodeIT.java:118)
	at org.opensearch.test.InternalTestCluster$NodeAndClient.closeForRestart(InternalTestCluster.java:1072)
	at org.opensearch.test.InternalTestCluster.restartNode(InternalTestCluster.java:2032)
	at org.opensearch.test.InternalTestCluster.restartRandomDataNode(InternalTestCluster.java:1980)
	at org.opensearch.search.pit.PitMultiNodeIT.testCreatePitWhileNodeDropWithAllowPartialCreationFalse(PitMultiNodeIT.java:112)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at org.opensearch.test.OpenSearchTestClusterRule$1.evaluate(OpenSearchTestClusterRule.java:369)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.base/java.lang.Thread.run(Thread.java:1583)

The issue is back:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run
Projects
None yet
5 participants