Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test] Fix Flaky Test SearchTimeoutIT.testSimpleTimeout #16828

Merged
merged 1 commit into from
Dec 19, 2024

Conversation

kkewwei
Copy link
Contributor

@kkewwei kkewwei commented Dec 11, 2024

Description

When numDocs=1000:

  1. The testSimpleTimeout will cost several minutes, when scoring each doc, it will cost 500ms, it's a long time to iterating all the doc in queryphase.
  2. ReaderContext is created before executing queryphase and released after the fetchphase.
  3. the life time of ReaderContext is 1min(determined by search.keep_alive_interval)
  4. If queryPhase costs too much time, the ReaderContext may be released before fetchphase, so the fetch/Id will be failed, which hit the case.
    image

Related Issues

Resolves #16056 #9401

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

✅ Gradle check result for 51c35eb: SUCCESS

Copy link

codecov bot commented Dec 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.05%. Comparing base (b5f651f) to head (3a34c9d).
Report is 3 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #16828      +/-   ##
============================================
- Coverage     72.21%   72.05%   -0.16%     
+ Complexity    65335    65231     -104     
============================================
  Files          5318     5318              
  Lines        304081   304081              
  Branches      43995    43995              
============================================
- Hits         219578   219103     -475     
- Misses        66541    66984     +443     
- Partials      17962    17994      +32     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

❌ Gradle check result for d1fc8be: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@kkewwei
Copy link
Contributor Author

kkewwei commented Dec 13, 2024

❌ Gradle check result for d1fc8be: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

SpecificClusterManagerNodesIT.testElectOnlyBetweenClusterManagerNodes #15944

@kkewwei
Copy link
Contributor Author

kkewwei commented Dec 16, 2024

@reta @jed326, please have a look when you are free.

Copy link
Collaborator

@jed326 jed326 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kkewwei this was a great find!

@jed326 jed326 added the backport 2.x Backport to 2.x branch label Dec 16, 2024
@reta
Copy link
Collaborator

reta commented Dec 16, 2024

@reta @jed326, please have a look when you are free.

Thanks @kkewwei , I am actually a bit surprised by your findings

If queryPhase costs too much time, the ReaderContext may be released before fetchphase, so the fetch/Id will be failed, which hit the case.

The query phase has to be terminated early by timeout, right? So it should be not much longer then timeout itself?

Copy link
Contributor

❕ Gradle check result for d1fc8be: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@kkewwei
Copy link
Contributor Author

kkewwei commented Dec 18, 2024

The query phase has to be terminated early by timeout, right? So it should be not much longer then timeout itself

@reta Yes, the query phase has to be terminated early by timeout, but it may be much longer than timeout. timeout is checked every bulk docs(scoreRange, initial_interval=1024), In case the size of bulk docs exceeds 120, the query phase will consume more than 1min.


In additional, if we should decrease the upper interval(100m)

@reta
Copy link
Collaborator

reta commented Dec 18, 2024

@reta Yes, the query phase has to be terminated early by timeout, but it may be much longer than timeout. timeout is checked every bulk docs(scoreRange, initial_interval=1024), In case the size of bulk docs exceeds 120, the query phase will consume more than 1min.

Thanks @kkewwei , I think this is the problem, not the test, right? If timeout does not early terminate the query within reasonable time margin, it is not very useful.

@kkewwei
Copy link
Contributor Author

kkewwei commented Dec 18, 2024

right

@reta Yes, the query phase has to be terminated early by timeout, but it may be much longer than timeout. timeout is checked every bulk docs(scoreRange, initial_interval=1024), In case the size of bulk docs exceeds 120, the query phase will consume more than 1min.

Thanks @kkewwei , I think this is the problem, not the test, right? If timeout does not early terminate the query within reasonable time margin, it is not very useful.

@reta timeout is a shard-level setting, there exists a possibility where the actual timeout could be several times of the timeout, I tried to use cancel_after_time_interval instead of timeout (maybe not suitable) #11642.

To avoid excessive timeouts, maybe we should decrease the upper interval(100m), such as 100k?

@reta
Copy link
Collaborator

reta commented Dec 18, 2024

@reta timeout is a shard-level setting, there exists a possibility where the actual timeout could be several times of the timeout, I tried to use cancel_after_time_interval instead of timeout (maybe not suitable) #11642.

@kkewwei thanks for staying with me, timeout is search request based setting (which we pass to shards), the intuition regarding the timeout (at least the one I built out of documentation https://www.elastic.co/guide/en/elasticsearch/reference/current/search-your-data.html#search-timeout) is that each shard has to respect the timeout (more or less), it is fine to be not precise but the search request with timeout of 5ms should not take 1m to complete for this specific test.

I only briefly looked at overall implementation and it looks like we may lost some timeout handling guarantees, likely while implementing the concurrent search. Do you mind if I push some changes into your pull request? Thank you.

Opened up #16882

@kkewwei
Copy link
Contributor Author

kkewwei commented Dec 19, 2024

@reta timeout is a shard-level setting, there exists a possibility where the actual timeout could be several times of the timeout, I tried to use cancel_after_time_interval instead of timeout (maybe not suitable) #11642.

@kkewwei thanks for staying with me, timeout is search request based setting (which we pass to shards), the intuition regarding the timeout (at least the one I built out of documentation https://www.elastic.co/guide/en/elasticsearch/reference/current/search-your-data.html#search-timeout) is that each shard has to respect the timeout (more or less), it is fine to be not precise but the search request with timeout of 5ms should not take 1m to complete for this specific test.

I only briefly looked at overall implementation and it looks like we may lost some timeout handling guarantees, likely while implementing the concurrent search. Do you mind if I push some changes into your pull request? Thank you.

Opened up #16882

@reta Of course, please free free to go ahead.

@kkewwei
Copy link
Contributor Author

kkewwei commented Dec 19, 2024

@reta, it seems #16882 can't completely solve the flay test. Short-circuiting only accelerates the response time of the shard, if first searchLeaf call costs 1min(for example, the first segment contains 150 docs, each doc wait time is 500ms), it will also continue hit the case.

@reta
Copy link
Collaborator

reta commented Dec 19, 2024

@reta, it seems #16882 can't completely solve the flay test.

You are absolutely correct @kkewwei , we only check for timeouts in some places, but in general - we could not preemptively abort the process. Thanks!

Copy link
Contributor

❕ Gradle check result for 3a34c9d: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@reta reta merged commit 7050ecf into opensearch-project:main Dec 19, 2024
38 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Dec 19, 2024
Signed-off-by: kkewwei <[email protected]>
(cherry picked from commit 7050ecf)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
reta pushed a commit that referenced this pull request Dec 20, 2024
(cherry picked from commit 7050ecf)

Signed-off-by: kkewwei <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@kkewwei kkewwei deleted the fix_9401 branch December 20, 2024 03:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autocut backport 2.x Backport to 2.x branch flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for SearchTimeoutIT
3 participants