Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TermsAggregatorTests flaky test fix #13567

Merged
merged 1 commit into from
May 17, 2024

Conversation

sandeshkr419
Copy link
Contributor

@sandeshkr419 sandeshkr419 commented May 6, 2024

Description

Adding documents in a single chunk in unit test to avoid multiple segments in the unit test.

The following assertion was failing:

                    assertEquals(expectedCollectCount, aggregator.getCollectCount().get());

The number of collect() calls is used to check whether the optimization has kicked in or not. The tests written with tight assertions assuminga single segment.

In local development, running this test in isolation, it is not feasible to get the test failing as it might be creating a single segment. However, in gradle CI (see linked issue) and running the test with other tests, it may happen that multiple segments are created (multiple commits after individual document addition), which is the only explanation why the collect() count assertion is failing case.

So inspired from https://github.com/opensearch-project/OpenSearch/blob/main/server/src/test/java/org/opensearch/search/aggregations/bucket/histogram/DateHistogramAggregatorTests.java#L1655-L1668 editing the test to add all documents in a single call to avoid committing changes to index and avoid multiple segments.

Did not add changelog as this is a flaky test fix.

Related Issues

Resolves #12640

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented May 6, 2024

❌ Gradle check result for a358230: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented May 6, 2024

✅ Gradle check result for 6b95eeb: SUCCESS

Copy link

codecov bot commented May 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.60%. Comparing base (b15cb0c) to head (6b95eeb).
Report is 289 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #13567      +/-   ##
============================================
+ Coverage     71.42%   71.60%   +0.18%     
- Complexity    59978    61094    +1116     
============================================
  Files          4985     5052      +67     
  Lines        282275   287126    +4851     
  Branches      40946    41604     +658     
============================================
+ Hits         201603   205594    +3991     
- Misses        63999    64595     +596     
- Partials      16673    16937     +264     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kkmr
Copy link
Contributor

kkmr commented May 6, 2024

Looks like SpotLess check failed. Can you take a look?

@kkmr
Copy link
Contributor

kkmr commented May 6, 2024

While you are at it, could also add some info about the validation you did for the change?

@jainankitk
Copy link
Collaborator

Adding documents in a single chunk in unit test to avoid multiple segments in the unit test.

Is the test dependent on the number of segments? Is that the assertion failing in the original bug?

@jed326
Copy link
Collaborator

jed326 commented May 7, 2024

If the test depends on a specific number of segments then can we [1] add an assertion for the segment count and [2] force merge to the correct number of segments within the test?

Not sure about the viability of [2] since I see a test path for deleted docs.

@sandeshkr419
Copy link
Contributor Author

@jed326

If the test depends on a specific number of segments then can we [1] add an assertion for the segment count and [2] force merge to the correct number of segments within the test?

[1] No its not feasible to add assertion of segment count - I could not find a utility to do so in our codebase. Introducing such utility may not be within the scope of this change.
[2] Adding documents in a single call (addDocuments()) ensures commits are not happening in between multiple calls (addDocument()).

@jainankitk @kkmr

Added more details in description now.

Tests are not failing locally (most likely running in silo). But looking at failures, it looks like if we can force the documents are added in a single call, it will result in a single segment like the one in DateHistogramTest given we already have no merge policy in place during index creation. The failures in flaky tests reported in different CI calls can only occur when there are multiple segments so this is more of a calculated change that we can do to solve the flakiness with no side-effects.

@jainankitk
Copy link
Collaborator

The failures in flaky tests reported in different CI calls can only occur when there are multiple segments so this is more of a calculated change that we can do to solve the flakiness with no side-effects.

Can you try to run the test locally until failure with and without this change? And report back the number of successful runs without failure?

@sandeshkr419
Copy link
Contributor Author

sandeshkr419 commented May 8, 2024

The failures in flaky tests reported in different CI calls can only occur when there are multiple segments so this is more of a calculated change that we can do to solve the flakiness with no side-effects.

Can you try to run the test locally until failure with and without this change? And report back the number of successful runs without failure?

So I have iterated through the test 250 times in a loop for both scenarios (with & without my change) with 0 failures in each case. So still no luck reproducing it in present way.

@jainankitk
Copy link
Collaborator

So I have iterated through the test 250 times in a loop for both scenarios (with & without my change) with 0 failures in each case. So still no luck reproducing it in present way.

Can you explicitly issue IndexWriter.commit to create multiple segments and confirm the issue? Just want to ensure that this change is helping

@sandeshkr419
Copy link
Contributor Author

sandeshkr419 commented May 9, 2024

Can you explicitly issue IndexWriter.commit to create multiple segments and confirm the issue? Just want to ensure that this change is helping

Hey @jainankitk, thanks for suggesting this. In the original code, I added indexWriter.commit(); after 2 addDocument() and I was able to fail the tests accordingly.

Meaning cases where there will 2 segments + deleted document case and optimization will kick in for only 1 of the 2 segments and not both - test will fail with the same CI exception.

So this validates my suspect of multiple segments being the issue.

@jainankitk
Copy link
Collaborator

@sandeshkr419 - Thanks for confirming the issue. Approved!

@sandeshkr419 sandeshkr419 added the backport 2.x Backport to 2.x branch label May 9, 2024
@msfroh
Copy link
Collaborator

msfroh commented May 16, 2024

No its not feasible to add assertion of segment count - I could not find a utility to do so in our codebase. Introducing such utility may not be within the scope of this change.

Since you have an IndexReader, could expectedCollectCount be a Function<IndexReader, Integer>, then for the case where the optimization kicks in, you can pass in a lambda like (r) -> r.leaves().size()? For the non-optimized case, it would be (r) -> docCount, I guess.

@sandeshkr419
Copy link
Contributor Author

I think I missed where I can check the number of segments using reader.leaves.size()

Since you have an IndexReader, could expectedCollectCount be a Function<IndexReader, Integer>, then for the case where the optimization kicks in, you can pass in a lambda like (r) -> r.leaves().size()? For the non-optimized case, it would be (r) -> docCount, I guess.

The cases which are failing are when one of the segment can use optimized flow (where (r) -> 0) and another segment use the un-optimized flow (r) -> docCount. I think I can assert expectedCollectCount to be number of LiveDocs in segments having deleted documents (unoptimized segments) as segments with 0 deletes will not contribute to in collect() calls.

@mch2 mch2 merged commit d26cd46 into opensearch-project:main May 17, 2024
47 of 62 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request May 17, 2024
Signed-off-by: Sandesh Kumar <[email protected]>
(cherry picked from commit d26cd46)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@sandeshkr419
Copy link
Contributor Author

@msfroh I see this is merged. Would you recommend reverting this back and have a multiple segment case as we discussed offline.

Tbh, I like it in present state since its much more atomic as the improvements were on segment level and we are forcing the index creation with a single segment. If required, we can have a multiple segment case separately independent of the tests in this PR.

deshsidd pushed a commit to deshsidd/OpenSearch that referenced this pull request May 17, 2024
@sandeshkr419 sandeshkr419 deleted the agg-test branch May 20, 2024 17:55
@sandeshkr419
Copy link
Contributor Author

As per offline discussion with @msfroh - will keep this change and will add a separate test to cover multi-segment scenario.

dblock pushed a commit that referenced this pull request May 20, 2024
(cherry picked from commit d26cd46)

Signed-off-by: Sandesh Kumar <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
parv0201 pushed a commit to parv0201/OpenSearch that referenced this pull request Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch bug Something isn't working flaky-test Random test failure that succeeds on second run Search:Aggregations skip-changelog
Projects
None yet
8 participants