-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Getting DeadlineTimeoutException error during security tests after upgrading to 3.0 #580
Comments
Is this the first run of tests in 3.0? |
@saratvemulapalli The referenced PR is the first run of the tests in 3.0, yes. The PR itself has been open for some time though and we had seen the errors earlier after we bumped. Do you know if there is known root cause for how that bump has affected the client pool in this way? Or if there is a workaround for getting rid of these test errors? These errors are the only things blocking us from merging this PR in and we'd like to get |
@dblock I saw you had reviewed the OpenSearch PR to migrate the client transports. Any thoughts? |
@reta do you have any suggestions? |
@saratvemulapalli @qreshi interesting, I will take a look shortly, but for cause so far for such kind of exceptions was the issues with plugin itself (fe it was not installed properly), hence the timeout. |
Ok, @qreshi I cannot reproduce the issue right now, here is what could have happened: the OpenSearch or / and plugin did not start properly in the Docker container, causing all tests to fail. |
Hey @reta thanks for looking into it. I'm not sure that's the cause because not all the tests failed. If you look at the summary of the failures in the issue description as well, only some of the tests failed. Even within the |
Oh, yes, you are right, I looked at JDK-17 first (it has more failures), but JDK-11 has only one test case |
It's definitely flaky. I re-ran the security workflow for both and this time JDK-17 only had one test failure and JDK-11 had a different set fail. Is it possible to find out what the connection pool size and lease timeout is that OpenSearch uses? If so, we can see if these values changed. Also, if it's possible to override those values, we can try making them more lenient and see if the failures go away. Just to rule out causes. |
@qreshi Ok, so one problem is this guy opensearch-project/common-utils#287. Regarding timeouts, SecureRestClientBuilder has methods to set all all of them, for
Besides that, there is |
@qreshi there is something very fishy here, the
|
That's a good catch. That millisecond overdue time is ridiculously large as a result. |
@qreshi my apologies, I didn't get time to resume looking into the issue today but I will continue next week |
Nice, thanks for looking into this @reta. So I'm assuming something changed in the newer version of the underlying library that made the strict connection pool not tolerant enough to pass without a larger timeout? |
The are many changes indeed, this is major version bump |
What is the bug?
With the upgrade to 3.0 in #556, the dependency on
org.apache.httpcomponents:httpclient
andorg.apache.httpcomponents:httpcore
changed toorg.apache.httpcomponents.client5:httpclient5
andorg.apache.httpcomponents.core5:httpcore5
respectively (these are coming from OpenSearch core).After this upgrade we're starting to see some of the tests fail during the security test CI workflow with the following error:
Not all of the tests failed in this example workflow that is being referenced:
Given the stacktrace, it seems that the client used to make the calls to the cluster during the integration tests needed to obtain a lease from a connection pool and that lease either took too long or could not be obtained. We should check with the OpenSearch core team if the changes to these
httpcomponents
related dependencies has changed anything with the connection pool resourcing/timeouts leading to these occasional errors. Strangely enough, this error was only seen during the security test runs in the CI. The regular integration tests did not run into this problem. Either it's flaky or the conditions of the Docker security tests make it more likely to happen.How can one reproduce the bug?
Steps to reproduce the behavior:
Run the security integration test CI in a PR or locally run the security tests against a Docker image running OpenSearch 3.0 with security
What is the expected behavior?
The client is able to get the lease in time and perform the operation during the tests.
The text was updated successfully, but these errors were encountered: