Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RemoteClusterClientTests.testConnectAndExecuteRequest is flaky #12338

Open
peternied opened this issue Feb 15, 2024 · 16 comments
Open

[BUG] RemoteClusterClientTests.testConnectAndExecuteRequest is flaky #12338

peternied opened this issue Feb 15, 2024 · 16 comments
Labels
bug Something isn't working Cluster Manager flaky-test Random test failure that succeeds on second run :test Adding or fixing a test

Comments

@peternied
Copy link
Member

Describe the bug

org.opensearch.transport.RemoteClusterClientTests.testConnectAndExecuteRequest seems to be able to get network exceptions during this workflow

2> junit.framework.AssertionFailedError: Unexpected exception type, expected ActionNotFoundTransportException but got NodeDisconnectedException[[remote_node][127.0.0.1:10600][indices:data/read/scroll] disconnected]
      at org.apache.lucene.tests.util.LuceneTestCase.expectThrows(LuceneTestCase.java:2894)
      at org.apache.lucene.tests.util.LuceneTestCase.expectThrows(LuceneTestCase.java:2875)
      at org.opensearch.transport.RemoteClusterClientTests.testConnectAndExecuteRequest(RemoteClusterClientTests.java:103)

      Caused by:
      NodeDisconnectedException[[remote_node][127.0.0.1:10600][indices:data/read/scroll] disconnected]

  java.lang.AssertionError
      at __randomizedtesting.SeedInfo.seed([C42E5842F94566E0]:0)
      at org.opensearch.transport.InboundMessage.openOrGetStreamInput(InboundMessage.java:116)
      at org.opensearch.transport.TransportLogger.format(TransportLogger.java:150)
      at org.opensearch.transport.TransportLogger.logInboundMessage(TransportLogger.java:70)
      at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:123)
      at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:770)
      at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:175)
      at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:150)
      at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:115)
      at org.opensearch.transport.nio.MockNioTransport$MockTcpReadWriteHandler.consumeReads(MockNioTransport.java:343)
      at org.opensearch.nio.SocketChannelContext.handleReadBytes(SocketChannelContext.java:246)
      at org.opensearch.nio.BytesChannelContext.read(BytesChannelContext.java:59)
      at org.opensearch.nio.EventHandler.handleRead(EventHandler.java:152)
      at org.opensearch.transport.nio.TestEventHandler.handleRead(TestEventHandler.java:167)
      at org.opensearch.nio.NioSelector.handleRead(NioSelector.java:438)
      at org.opensearch.nio.NioSelector.processKey(NioSelector.java:264)
      at org.opensearch.nio.NioSelector.singleLoop(NioSelector.java:191)
      at org.opensearch.nio.NioSelector.runLoop(NioSelector.java:148)
      at java.base/java.lang.Thread.run(Thread.java:1583)

Related component

Storage:Remote

To Reproduce

Initial failure on developer desktop, was not able to reproduce it on rerun

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.transport.RemoteClusterClientTests.testConnectAndExecuteRequest" -Dtests.seed=C42E5842F94566E0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nl-NL -Dtests.timezone=Pacific/Wallis -Druntime.java=21

Expected behavior

All tests should pass reliably

Additional Details

Host/Environment (please complete the following information):

% uname -a
Linux dev-dsk-petern 5.10.209-175.812.amzn2int.x86_64 #1 SMP Tue Jan 30 21:29:45 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
@linuxpi
Copy link
Collaborator

linuxpi commented Feb 19, 2024

This test in not related to RemoteStore. This seems related to cross cluster, can we create a label for cross cluster?

@peternied
Copy link
Member Author

[Triage - attendees 1 2 3 4 5]
@peternied Thanks for filing, consider making a pull request to resolve this issue

@andrross
Copy link
Member

andrross commented Feb 23, 2024

@peternied I've got some circumstantial evidence from local testing that suggests #11957 may have introduced this flakiness. Can you take a look?

@peternied
Copy link
Member Author

1000 iterations, couldn't reproduce locally.

./gradlew ':server:test' --tests "org.opensearch.transport.RemoteClusterClientTests.testConnectAndExecuteRequest" -Dtests.seed=C42E5842F94566E0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nl-NL -Dtests.timezone=Pacific/Wallis -Druntime.java=21 -Dtests.iters=1000

Reviewing how NodeDisconnectedException can be thrown, it seems like its only thrown from events triggered in ClusterConnectionManager.internalOpenConnection, where after the connection was opened there was an exception and then the finally clause was hit. Since this issue doesn't have any more logs (doh!) there really isn't much to go on.

private void internalOpenConnection(
DiscoveryNode node,
ConnectionProfile connectionProfile,
ActionListener<Transport.Connection> listener
) {
transport.openConnection(node, connectionProfile, ActionListener.map(listener, connection -> {
assert Transports.assertNotTransportThread("internalOpenConnection success");
try {
connectionListener.onConnectionOpened(connection);
} finally {
connection.addCloseListener(ActionListener.wrap(() -> connectionListener.onConnectionClosed(connection)));
}
if (connection.isClosed()) {
throw new ConnectTransportException(node, "a channel closed while connecting");
}
return connection;
}));
}

@andrross Did you have theories on how that PR impacted this test case? Otherwise I'm leaning towards closing this 'not reproducible' and we can always reopen if rediscovered.

@andrross
Copy link
Member

@peternied No theories, sorry! @kotwanikunal was doing some testing with a local Jenkins instance running ./gradle check continuously to get nice reports on test failures over time, and this failure popped up in his testing correlated with the views commit. That's all I got :(

@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Cluster Manager Project Board Feb 23, 2024
@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Storage Project Board Feb 23, 2024
@ajaymovva
Copy link
Contributor

ajaymovva commented Mar 16, 2024

Seeing this test failures. https://build.ci.opensearch.org/job/gradle-check/35068/

@ajaymovva ajaymovva reopened this Mar 16, 2024
@github-project-automation github-project-automation bot moved this from ✅ Done to 🏗 In progress in Storage Project Board Mar 16, 2024
@github-project-automation github-project-automation bot moved this from ✅ Done to 🏗 In progress in Cluster Manager Project Board Mar 16, 2024
@ajaymovva
Copy link
Contributor

ajaymovva commented Mar 16, 2024

@sachinpkale
Copy link
Member

Muting the test until we provide the fix.

@sachinpkale
Copy link
Member

Already muted as part of #12720

@skumawat2025
Copy link
Contributor

This test in not related to RemoteStore. This is related to cross cluster.

@skumawat2025 skumawat2025 removed their assignment May 15, 2024
@mohitamg
Copy link

Ran the following command

 ./gradlew ':server:test' --tests "org.opensearch.transport.RemoteClusterClientTests.testConnectAndExecuteRequest" -Dtests.seed=C42E5842F94566E0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nl-NL -Dtests.timezone=Asia/Kolkata -Druntime.java=21 -Dtests.iters=100000

> Configure project :
========================= WARNING =========================
         Backwards compatibility tests are disabled!
See https://github.com/opensearch-project/OpenSearch/issues/4173
===========================================================
=======================================
OpenSearch Build Hamster says Hello!
  Gradle Version        : 8.7
  OS Info               : Mac OS X 14.5 (aarch64)
  Runtime JDK Version   : 21 (Amazon Corretto JDK)
  Runtime java.home     : /Library/Java/JavaVirtualMachines/amazon-corretto-21.jdk/Contents/Home
  Gradle JDK Version    : 21 (Amazon Corretto JDK)
  Gradle java.home      : /Library/Java/JavaVirtualMachines/amazon-corretto-21.jdk/Contents/Home
  Random Testing Seed   : C42E5842F94566E0
  In FIPS 140 mode      : false
=======================================

> Task :server:test
WARNING: Using incubator modules: jdk.incubator.vector
May 31, 2024 2:10:57 PM sun.util.locale.provider.LocaleProviderAdapter <clinit>
WARNING: COMPAT locale provider will be removed in a future release

WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.BootstrapForTesting (file:/Users/mohitamg/Documents/GitHub/OpenSearch/test/framework/build/distributions/framework-3.0.0-SNAPSHOT.jar)
WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.BootstrapForTesting
WARNING: System::setSecurityManager will be removed in a future release
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.gradle.api.internal.tasks.testing.worker.TestWorker (file:/Users/mohitamg/.gradle/wrapper/dists/gradle-8.7-all/aan3ydargesu18aqyqjwhr3pc/gradle-8.7/lib/plugins/gradle-testing-base-8.7.jar)
WARNING: Please consider reporting this to the maintainers of org.gradle.api.internal.tasks.testing.worker.TestWorker
WARNING: System::setSecurityManager will be removed in a future release

BUILD SUCCESSFUL in 20s
54 actionable tasks: 1 executed, 53 up-to-date
WARNING: The following functionality has been deprecated and will be removed in the next major release of the Develocity Gradle plugin. Run with '-Ddevelocity.deprecation.captureOrigin=true' to see where the deprecated functionality is being used. For assistance with migration, see https://gradle.com/help/gradle-plugin-develocity-migration.
- The deprecated "gradle.enterprise.testretry.enabled" system property has been replaced by "develocity.testretry.enabled"
- The "com.gradle.enterprise" plugin has been replaced by "com.gradle.develocity"

Also ran the above command for different set of iterations such as 1000, 10000 and 100000, but everytime it seems to be passing

@andrross
Copy link
Member

@mohitamg Did you remove the @AwaitsFix annotation that mutes the test? I would expect it to take more than 20 seconds to run 100k iterations if the test isn't muted.

@mohitamg
Copy link

mohitamg commented Jun 3, 2024

I didn't @andrross , should I remove it and then run?

@mohitamg
Copy link

mohitamg commented Jun 3, 2024

Commented
// @AwaitsFix(bugUrl = "#12338")
and ran for 10k iterations
succeeded

./gradlew ':server:test' --tests "org.opensearch.transport.RemoteClusterClientTests.testConnectAndExecuteRequest" -Dtests.seed=C42E5842F94566E0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nl-NL -Dtests.timezone=Asia/Kolkata -Druntime.java=21 -Dtests.iters=10000
Starting a Gradle Daemon, 3 stopped Daemons could not be reused, use --status for details

> Configure project :
========================= WARNING =========================
         Backwards compatibility tests are disabled!
See https://github.com/opensearch-project/OpenSearch/issues/4173
===========================================================
=======================================
OpenSearch Build Hamster says Hello!
  Gradle Version        : 8.7
  OS Info               : Mac OS X 14.5 (aarch64)
  Runtime JDK Version   : 21 (Amazon Corretto JDK)
  Runtime java.home     : /Library/Java/JavaVirtualMachines/amazon-corretto-21.jdk/Contents/Home
  Gradle JDK Version    : 21 (Amazon Corretto JDK)
  Gradle java.home      : /Library/Java/JavaVirtualMachines/amazon-corretto-21.jdk/Contents/Home
  Random Testing Seed   : C42E5842F94566E0
  In FIPS 140 mode      : false
=======================================

> Task :server:compileTestJava
Note: /Users/mohitamg/Documents/GitHub/OpenSearch/server/src/test/java/org/opensearch/transport/RemoteClusterClientTests.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

> Task :server:test
WARNING: Using incubator modules: jdk.incubator.vector
Jun 03, 2024 12:38:51 PM sun.util.locale.provider.LocaleProviderAdapter <clinit>
WARNING: COMPAT locale provider will be removed in a future release

WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.BootstrapForTesting (file:/Users/mohitamg/Documents/GitHub/OpenSearch/test/framework/build/distributions/framework-3.0.0-SNAPSHOT.jar)
WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.BootstrapForTesting
WARNING: System::setSecurityManager will be removed in a future release
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.gradle.api.internal.tasks.testing.worker.TestWorker (file:/Users/mohitamg/.gradle/wrapper/dists/gradle-8.7-all/aan3ydargesu18aqyqjwhr3pc/gradle-8.7/lib/plugins/gradle-testing-base-8.7.jar)
WARNING: Please consider reporting this to the maintainers of org.gradle.api.internal.tasks.testing.worker.TestWorker
WARNING: System::setSecurityManager will be removed in a future release

BUILD SUCCESSFUL in 3m 8s
54 actionable tasks: 2 executed, 52 up-to-date
WARNING: The following functionality has been deprecated and will be removed in the next major release of the Develocity Gradle plugin. Run with '-Ddevelocity.deprecation.captureOrigin=true' to see where the deprecated functionality is being used. For assistance with migration, see https://gradle.com/help/gradle-plugin-develocity-migration.
- The deprecated "gradle.enterprise.testretry.enabled" system property has been replaced by "develocity.testretry.enabled"
- The "com.gradle.enterprise" plugin has been replaced by "com.gradle.develocity"
- ```

@mohitamg
Copy link

mohitamg commented Jun 3, 2024

Result for 20k iterations

 ./gradlew ':server:test' --tests "org.opensearch.transport.RemoteClusterClientTests.testConnectAndExecuteRequest" -Dtests.seed=C42E5842F94566E0 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=nl-NL -Dtests.timezone=Asia/Kolkata -Druntime.java=21 -Dtests.iters=20000

> Configure project :
========================= WARNING =========================
         Backwards compatibility tests are disabled!
See https://github.com/opensearch-project/OpenSearch/issues/4173
===========================================================
=======================================
OpenSearch Build Hamster says Hello!
  Gradle Version        : 8.7
  OS Info               : Mac OS X 14.5 (aarch64)
  Runtime JDK Version   : 21 (Amazon Corretto JDK)
  Runtime java.home     : /Library/Java/JavaVirtualMachines/amazon-corretto-21.jdk/Contents/Home
  Gradle JDK Version    : 21 (Amazon Corretto JDK)
  Gradle java.home      : /Library/Java/JavaVirtualMachines/amazon-corretto-21.jdk/Contents/Home
  Random Testing Seed   : C42E5842F94566E0
  In FIPS 140 mode      : false
=======================================

> Task :server:test
WARNING: Using incubator modules: jdk.incubator.vector
Jun 03, 2024 1:31:05 PM sun.util.locale.provider.LocaleProviderAdapter <clinit>
WARNING: COMPAT locale provider will be removed in a future release

WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.BootstrapForTesting (file:/Users/mohitamg/Documents/GitHub/OpenSearch/test/framework/build/distributions/framework-3.0.0-SNAPSHOT.jar)
WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.BootstrapForTesting
WARNING: System::setSecurityManager will be removed in a future release
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.gradle.api.internal.tasks.testing.worker.TestWorker (file:/Users/mohitamg/.gradle/wrapper/dists/gradle-8.7-all/aan3ydargesu18aqyqjwhr3pc/gradle-8.7/lib/plugins/gradle-testing-base-8.7.jar)
WARNING: Please consider reporting this to the maintainers of org.gradle.api.internal.tasks.testing.worker.TestWorker
WARNING: System::setSecurityManager will be removed in a future release

BUILD SUCCESSFUL in 5m 46s
54 actionable tasks: 1 executed, 53 up-to-date
WARNING: The following functionality has been deprecated and will be removed in the next major release of the Develocity Gradle plugin. Run with '-Ddevelocity.deprecation.captureOrigin=true' to see where the deprecated functionality is being used. For assistance with migration, see https://gradle.com/help/gradle-plugin-develocity-migration.
- The deprecated "gradle.enterprise.testretry.enabled" system property has been replaced by "develocity.testretry.enabled"
- The "com.gradle.enterprise" plugin has been replaced by "com.gradle.develocity"
- ```

@sandeshkr419
Copy link
Contributor

I think the test failed because of some non ideal criteria, likely some sort of resource constraint because of which the connection could not be established as expected. Running the test in silo rules out the resource constraint which can probably lead to a failure. Wondering if we can introduce a resource constraint in the test case to validate this.

Also, can running the invalid API path on constraint resources [indices:data/read/scroll] cause OOM or any reason for node drops as well?

I mean can we assert if the node is healthy if we reach NodeDisconnectedException and then retry the same test case as a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager flaky-test Random test failure that succeeds on second run :test Adding or fixing a test
Projects
Status: 🏗 In progress
Development

No branches or pull requests

9 participants