[Flaky Test] Fix Flaky Test SearchTimeoutIT.testSimpleTimeout #16828

kkewwei · 2024-12-11T14:57:55Z

Description

When numDocs=1000:

The testSimpleTimeout will cost several minutes, when scoring each doc, it will cost 500ms, it's a long time to iterating all the doc in queryphase.

OpenSearch/server/src/internalClusterTest/java/org/opensearch/search/SearchTimeoutIT.java

Line 138 in 5aa6509

Thread.sleep(500);
ReaderContext is created before executing queryphase and released after the fetchphase.
the life time of ReaderContext is 1min(determined by search.keep_alive_interval)
If queryPhase costs too much time, the ReaderContext may be released before fetchphase, so the fetch/Id will be failed, which hit the case.

Related Issues

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2024-12-11T15:54:12Z

✅ Gradle check result for 51c35eb: SUCCESS

codecov · 2024-12-11T15:55:09Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.05%. Comparing base (b5f651f) to head (3a34c9d).
Report is 3 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #16828      +/-   ##
============================================
- Coverage     72.21%   72.05%   -0.16%     
+ Complexity    65335    65231     -104     
============================================
  Files          5318     5318              
  Lines        304081   304081              
  Branches      43995    43995              
============================================
- Hits         219578   219103     -475     
- Misses        66541    66984     +443     
- Partials      17962    17994      +32

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-12-13T05:20:09Z

❌ Gradle check result for d1fc8be: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

kkewwei · 2024-12-13T07:26:56Z

❌ Gradle check result for d1fc8be: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

SpecificClusterManagerNodesIT.testElectOnlyBetweenClusterManagerNodes #15944

kkewwei · 2024-12-16T04:07:36Z

@reta @jed326, please have a look when you are free.

jed326

Thanks @kkewwei this was a great find!

reta · 2024-12-16T17:26:24Z

@reta @jed326, please have a look when you are free.

Thanks @kkewwei , I am actually a bit surprised by your findings

If queryPhase costs too much time, the ReaderContext may be released before fetchphase, so the fetch/Id will be failed, which hit the case.

The query phase has to be terminated early by timeout, right? So it should be not much longer then timeout itself?

github-actions · 2024-12-16T17:56:39Z

❕ Gradle check result for d1fc8be: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

kkewwei · 2024-12-18T03:22:00Z

The query phase has to be terminated early by timeout, right? So it should be not much longer then timeout itself

@reta Yes, the query phase has to be terminated early by timeout, but it may be much longer than timeout. timeout is checked every bulk docs(scoreRange, initial_interval=1024), In case the size of bulk docs exceeds 120, the query phase will consume more than 1min.

OpenSearch/server/src/main/java/org/opensearch/search/internal/CancellableBulkScorer.java

Line 68 in 57a6605

while (min < max) {

In additional, if we should decrease the upper interval(100m)

reta · 2024-12-18T14:20:28Z

@reta Yes, the query phase has to be terminated early by timeout, but it may be much longer than timeout. timeout is checked every bulk docs(scoreRange, initial_interval=1024), In case the size of bulk docs exceeds 120, the query phase will consume more than 1min.

Thanks @kkewwei , I think this is the problem, not the test, right? If timeout does not early terminate the query within reasonable time margin, it is not very useful.

kkewwei · 2024-12-18T15:45:15Z

right

@reta Yes, the query phase has to be terminated early by timeout, but it may be much longer than timeout. timeout is checked every bulk docs(scoreRange, initial_interval=1024), In case the size of bulk docs exceeds 120, the query phase will consume more than 1min.

Thanks @kkewwei , I think this is the problem, not the test, right? If timeout does not early terminate the query within reasonable time margin, it is not very useful.

@reta timeout is a shard-level setting, there exists a possibility where the actual timeout could be several times of the timeout, I tried to use cancel_after_time_interval instead of timeout (maybe not suitable) #11642.

To avoid excessive timeouts, maybe we should decrease the upper interval(100m), such as 100k?

reta · 2024-12-18T17:02:25Z

@reta timeout is a shard-level setting, there exists a possibility where the actual timeout could be several times of the timeout, I tried to use cancel_after_time_interval instead of timeout (maybe not suitable) #11642.

@kkewwei thanks for staying with me, timeout is search request based setting (which we pass to shards), the intuition regarding the timeout (at least the one I built out of documentation https://www.elastic.co/guide/en/elasticsearch/reference/current/search-your-data.html#search-timeout) is that each shard has to respect the timeout (more or less), it is fine to be not precise but the search request with timeout of 5ms should not take 1m to complete for this specific test.

I only briefly looked at overall implementation and it looks like we may lost some timeout handling guarantees, likely while implementing the concurrent search. ~~Do you mind if I push some changes into your pull request?~~ Thank you.

Opened up #16882

kkewwei · 2024-12-19T00:11:54Z

@reta timeout is a shard-level setting, there exists a possibility where the actual timeout could be several times of the timeout, I tried to use cancel_after_time_interval instead of timeout (maybe not suitable) #11642.

@kkewwei thanks for staying with me, timeout is search request based setting (which we pass to shards), the intuition regarding the timeout (at least the one I built out of documentation https://www.elastic.co/guide/en/elasticsearch/reference/current/search-your-data.html#search-timeout) is that each shard has to respect the timeout (more or less), it is fine to be not precise but the search request with timeout of 5ms should not take 1m to complete for this specific test.

I only briefly looked at overall implementation and it looks like we may lost some timeout handling guarantees, likely while implementing the concurrent search. ~~Do you mind if I push some changes into your pull request?~~ Thank you.

Opened up #16882

@reta Of course, please free free to go ahead.

kkewwei · 2024-12-19T03:44:23Z

@reta, it seems #16882 can't completely solve the flay test. Short-circuiting only accelerates the response time of the shard, if first searchLeaf call costs 1min(for example, the first segment contains 150 docs, each doc wait time is 500ms), it will also continue hit the case.

reta · 2024-12-19T14:18:30Z

@reta, it seems #16882 can't completely solve the flay test.

You are absolutely correct @kkewwei , we only check for timeouts in some places, but in general - we could not preemptively abort the process. Thanks!

server/src/internalClusterTest/java/org/opensearch/search/SearchTimeoutIT.java

Signed-off-by: kkewwei <[email protected]>

github-actions · 2024-12-19T21:42:19Z

❕ Gradle check result for 3a34c9d: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Signed-off-by: kkewwei <[email protected]> (cherry picked from commit 7050ecf) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

(cherry picked from commit 7050ecf) Signed-off-by: kkewwei <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

kkewwei requested review from anasalkouz, andrross, ashking94, Bukhtawar, CEHENKLE, dblock, dbwiddis, gbbafna, jed326, kotwanikunal, mch2, msfroh, nknize, owaiskazi19, reta, Rishikesh1159, sachinpkale, saratvemulapalli, shwetathareja, sohami and VachaShah as code owners December 11, 2024 14:57

github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Dec 11, 2024

kkewwei force-pushed the fix_9401 branch from 51c35eb to d1fc8be Compare December 13, 2024 04:26

opensearch-ci-bot mentioned this pull request Dec 13, 2024

[AUTOCUT] Gradle Check Flaky Test Report for SpecificClusterManagerNodesIT #15944

Open

jed326 approved these changes Dec 16, 2024

View reviewed changes

jed326 added the backport 2.x Backport to 2.x branch label Dec 16, 2024

reta reviewed Dec 19, 2024

View reviewed changes

server/src/internalClusterTest/java/org/opensearch/search/SearchTimeoutIT.java Outdated Show resolved Hide resolved

Fix Flaky Test SearchTimeoutIT.testSimpleTimeout

3a34c9d

Signed-off-by: kkewwei <[email protected]>

kkewwei force-pushed the fix_9401 branch from d1fc8be to 3a34c9d Compare December 19, 2024 20:32

reta approved these changes Dec 19, 2024

View reviewed changes

reta merged commit 7050ecf into opensearch-project:main Dec 19, 2024
38 checks passed

opensearch-trigger-bot bot mentioned this pull request Dec 19, 2024

[Backport 2.x] [Flaky Test] Fix Flaky Test SearchTimeoutIT.testSimpleTimeout #16886

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flaky Test] Fix Flaky Test SearchTimeoutIT.testSimpleTimeout #16828

[Flaky Test] Fix Flaky Test SearchTimeoutIT.testSimpleTimeout #16828

kkewwei commented Dec 11, 2024 •

edited

Loading

github-actions bot commented Dec 11, 2024

codecov bot commented Dec 11, 2024 •

edited

Loading

github-actions bot commented Dec 13, 2024

kkewwei commented Dec 13, 2024

kkewwei commented Dec 16, 2024

jed326 left a comment

reta commented Dec 16, 2024

github-actions bot commented Dec 16, 2024

kkewwei commented Dec 18, 2024 •

edited

Loading

reta commented Dec 18, 2024

kkewwei commented Dec 18, 2024

reta commented Dec 18, 2024 •

edited

Loading

kkewwei commented Dec 19, 2024

kkewwei commented Dec 19, 2024 •

edited

Loading

reta commented Dec 19, 2024

github-actions bot commented Dec 19, 2024

[Flaky Test] Fix Flaky Test SearchTimeoutIT.testSimpleTimeout #16828

[Flaky Test] Fix Flaky Test SearchTimeoutIT.testSimpleTimeout #16828

Conversation

kkewwei commented Dec 11, 2024 • edited Loading

Description

Related Issues

Check List

github-actions bot commented Dec 11, 2024

codecov bot commented Dec 11, 2024 • edited Loading

Codecov Report

github-actions bot commented Dec 13, 2024

kkewwei commented Dec 13, 2024

kkewwei commented Dec 16, 2024

jed326 left a comment

Choose a reason for hiding this comment

reta commented Dec 16, 2024

github-actions bot commented Dec 16, 2024

kkewwei commented Dec 18, 2024 • edited Loading

reta commented Dec 18, 2024

kkewwei commented Dec 18, 2024

reta commented Dec 18, 2024 • edited Loading

kkewwei commented Dec 19, 2024

kkewwei commented Dec 19, 2024 • edited Loading

reta commented Dec 19, 2024

github-actions bot commented Dec 19, 2024

kkewwei commented Dec 11, 2024 •

edited

Loading

codecov bot commented Dec 11, 2024 •

edited

Loading

kkewwei commented Dec 18, 2024 •

edited

Loading

reta commented Dec 18, 2024 •

edited

Loading

kkewwei commented Dec 19, 2024 •

edited

Loading