Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block merging code/features into 2.12 until all flaky tests are fixed. #10371

Closed
dblock opened this issue Oct 4, 2023 · 15 comments
Closed

Block merging code/features into 2.12 until all flaky tests are fixed. #10371

dblock opened this issue Oct 4, 2023 · 15 comments
Labels
enhancement Enhancement or improvement to existing feature or request Meta Meta issue, not directly linked to a PR untriaged

Comments

@dblock
Copy link
Member

dblock commented Oct 4, 2023

Is your feature request related to a problem? Please describe.

As of the time of writing this there are 91 flaky tests. Even with retries, your chances of having a passing gradle check are close to none.

Describe the solution you'd like

In a discussion on slack we proposed not to merge anything other than flaky test failures until we hit zero into 2.x (2.12).

Describe alternatives you've considered

  • Run after people to fix flaky tests.
  • Create an economy where to get 1 feature in you need to fix 1 flaky test. Pay up!
  • For each PR that failed CI on an unrelated flaky test, only merge when those tests are fixed.
@dblock dblock added enhancement Enhancement or improvement to existing feature or request untriaged labels Oct 4, 2023
@dblock
Copy link
Member Author

dblock commented Oct 4, 2023

@andrross how do I get the recent numbers on the worst offenders?

@dblock dblock changed the title Block merging code into 2.12 until all flaky tests are fixed. Block merging code/features into 2.12 until all flaky tests are fixed. Oct 4, 2023
@andrross
Copy link
Member

andrross commented Oct 4, 2023

@dblock My previous flaky-test-finder.rb worked by finding the UNSTABLE builds, which means it contained tests that failed at least once but succeeded on retry. However, we now have an explicit list of tests that are allowed to retry, and the tests causing us pain are flaky but not allowed to retry and therefore fail the gradle check on a single failed. I have a tweaked script (failed-test-finder.rb) that will find the test failures, but note there can be false positives here due to failures from PR revisions that had legit bugs that were fixed. The best signal we have here is that tests with many failures are likely the flakiest.

I'm running the above script like below to get current results:

ruby ~/failed-test-finder.rb -s 26500 -e 26950

Here is an edited list of the tests with more than 10 failures in the past 450 runs:

70 org.opensearch.backwards.IndexingIT.testSeqNoCheckpoints (26502,26503,26504,26508,26512,26515,26520,26526,26527,26529,26530,26532,26533,26543,26549,26559,26573,26574,26598,26604,26619,26620,26653,26660,26664,26672,26680,26681,26683,26689,26690,26702,26728,26730,26734,26737,26741,26743,26746,26747,26765,26766,26767,26768,26775,26776,26777,26778,26781,26782,26799,26801,26811,26815,26820,26821,26829,26831,26845,26855,26857,26862,26867,26880,26898,26899,26917,26922,26944,26945)
69 org.opensearch.backwards.IndexingIT.testIndexVersionPropagation (26502,26503,26504,26508,26512,26515,26520,26526,26527,26529,26530,26532,26533,26543,26549,26559,26574,26598,26604,26619,26620,26653,26660,26664,26672,26680,26681,26683,26689,26690,26702,26728,26730,26734,26737,26741,26743,26746,26747,26765,26766,26767,26768,26775,26776,26777,26778,26781,26782,26799,26801,26811,26815,26820,26821,26829,26831,26845,26855,26857,26862,26867,26880,26898,26899,26917,26922,26944,26945)
67 org.opensearch.backwards.IndexingIT.testIndexingWithPrimaryOnBwcNodes (26502,26503,26504,26508,26512,26515,26520,26526,26527,26529,26530,26532,26533,26543,26549,26559,26573,26574,26598,26604,26619,26660,26672,26680,26681,26683,26689,26690,26702,26728,26730,26734,26737,26741,26743,26746,26747,26765,26766,26767,26768,26775,26776,26777,26778,26781,26782,26799,26801,26811,26815,26820,26821,26829,26831,26845,26855,26857,26862,26867,26880,26898,26899,26917,26922,26944,26945)
37 org.opensearch.upgrades.RecoveryIT.testRelocationWithConcurrentIndexing (26503,26508,26520,26527,26533,26543,26559,26573,26598,26604,26619,26620,26672,26683,26702,26734,26737,26746,26765,26767,26768,26775,26776,26777,26778,26782,26799,26801,26811,26821,26829,26845,26857,26880,26898,26917,26944)
28 org.opensearch.upgrades.RefreshVersionInClusterStateIT.testRefresh (26503,26504,26508,26512,26520,26526,26527,26529,26530,26532,26543,26559,26573,26598,26604,26619,26620,26660,26664,26672,26680,26683,26689,26702,26765,26778,26815,26945)
28 org.opensearch.upgrades.SystemIndicesUpgradeIT.testSystemIndicesUpgrades (26503,26504,26508,26512,26520,26526,26527,26529,26530,26532,26543,26559,26573,26598,26604,26619,26620,26660,26664,26672,26680,26683,26689,26702,26765,26778,26815,26945)
28 org.opensearch.script.expression.MoreExpressionIT.testSpecialValueVariable {p0={"search.concurrent_segment_search.enabled":"true"}} (26539,26558,26592,26595,26682,26690,26691,26693,26697,26710,26722,26726,26740,26744,26745,26750,26751,26755,26782,26817,26832,26856,26866,26879,26886,26908,26918,26931)
28 org.opensearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.classMethod (26503,26504,26508,26512,26520,26526,26527,26529,26530,26532,26543,26559,26573,26598,26604,26619,26620,26660,26664,26672,26680,26683,26689,26702,26765,26778,26815,26945)
27 org.opensearch.upgrades.RecoveryIT.testOperationBasedRecovery (26502,26503,26504,26512,26520,26526,26527,26530,26543,26549,26559,26573,26574,26598,26604,26619,26620,26664,26672,26680,26681,26689,26702,26765,26778,26815,26945)
26 org.opensearch.backwards.IndexingIT.testIndexingWithReplicaOnBwcNodes (26526,26532,26533,26573,26604,26620,26664,26680,26728,26730,26737,26741,26746,26766,26767,26768,26778,26799,26829,26831,26845,26857,26862,26880,26922,26944)
25 org.opensearch.upgrades.RecoveryIT.testRetentionLeasesEstablishedWhenPromotingPrimary (26502,26503,26504,26508,26520,26526,26527,26529,26530,26532,26549,26559,26598,26604,26660,26672,26683,26702,26737,26741,26765,26778,26815,26944,26945)
24 org.opensearch.search.scroll.SearchScrollWithFailingNodesIT.testScanScrollWithShardExceptions {p0={"search.concurrent_segment_search.enabled":"true"}} (26514,26529,26538,26545,26589,26605,26648,26683,26705,26708,26718,26723,26747,26753,26760,26770,26836,26847,26897,26905,26906,26920,26940,26942)
24 org.opensearch.search.aggregations.metrics.CardinalityWithRequestBreakerIT.testRequestBreaker {p0={"search.concurrent_segment_search.enabled":"true"}} (26521,26531,26538,26543,26577,26579,26583,26584,26587,26598,26604,26614,26621,26699,26702,26706,26800,26819,26826,26836,26876,26909,26911,26914)
24 org.opensearch.search.scroll.SearchScrollWithFailingNodesIT.testScanScrollWithShardExceptions {p0={"search.concurrent_segment_search.enabled":"false"}} (26514,26529,26538,26545,26589,26605,26648,26683,26705,26708,26718,26723,26747,26753,26760,26770,26836,26847,26897,26905,26906,26920,26940,26942)
23 org.opensearch.search.aggregations.bucket.DiversifiedSamplerIT.testNestedSamples {p0={"search.concurrent_segment_search.enabled":"true"}} (26531,26577,26584,26591,26596,26601,26605,26621,26627,26645,26670,26671,26675,26723,26748,26753,26789,26834,26853,26876,26883,26900,26915)
23 org.opensearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/10_basic/Verify custom cluster metadata still exists during upgrade} (26503,26504,26508,26512,26520,26526,26527,26529,26530,26532,26543,26559,26573,26598,26604,26660,26664,26680,26689,26702,26765,26778,26815)
21 org.opensearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/20_date_range/Insert more docs to joda index} (26508,26512,26520,26526,26529,26530,26532,26559,26598,26604,26619,26620,26660,26664,26672,26683,26689,26765,26778,26815,26945)
21 org.opensearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/10_basic/Verify that we can still find things with the template} (26503,26504,26508,26520,26526,26529,26532,26543,26559,26573,26598,26604,26619,26620,26660,26672,26680,26683,26702,26765,26778)
21 org.opensearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/10_basic/Get index works} (26503,26504,26508,26512,26526,26527,26530,26532,26543,26559,26573,26598,26619,26620,26664,26672,26680,26683,26689,26765,26945)
20 org.opensearch.upgrades.RecoveryIT.testRecovery (26504,26508,26512,26526,26530,26532,26573,26574,26660,26680,26681,26683,26689,26778,26781,26801,26855,26880,26899,26917)
20 org.opensearch.upgrades.RecoveryIT.classMethod (26504,26508,26512,26520,26526,26527,26530,26532,26543,26559,26573,26604,26620,26680,26683,26689,26702,26765,26778,26815)
20 org.opensearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/10_basic/Use the percolate query in mixed cluster} (26503,26508,26520,26527,26529,26530,26532,26543,26573,26604,26619,26620,26660,26664,26683,26702,26765,26778,26815,26945)
19 org.opensearch.upgrades.RecoveryIT.testRetentionLeasesEstablishedWhenRelocatingPrimary (26503,26515,26543,26549,26573,26619,26620,26672,26681,26702,26730,26734,26743,26767,26777,26778,26815,26862,26922)
19 org.opensearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/10_basic/Verify nodes usage works} (26504,26512,26520,26526,26527,26559,26573,26598,26604,26619,26620,26664,26672,26680,26683,26689,26702,26815,26945)
18 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock (26522,26567,26597,26635,26653,26657,26692,26705,26727,26732,26783,26808,26830,26858,26870,26930,26932,26937)
18 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testWriteLargeBlob (26639,26639,26639,26639,26640,26640,26640,26640,26738,26738,26738,26738,26740,26740,26740,26740,26755,26912)
18 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadBlobWithRetries (26532,26639,26639,26639,26639,26640,26640,26640,26640,26738,26738,26738,26738,26740,26740,26740,26740,26795)
17 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadNonexistentBlobThrowsNoSuchFileException (26639,26639,26639,26639,26640,26640,26640,26640,26738,26738,26738,26738,26740,26740,26740,26740,26799)
16 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadRangeBlobWithRetries (26639,26639,26639,26639,26640,26640,26640,26640,26738,26738,26738,26738,26740,26740,26740,26740)
16 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testWriteBlobWithRetries (26639,26639,26639,26639,26640,26640,26640,26640,26738,26738,26738,26738,26740,26740,26740,26740)
16 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testRetryUntilFail (26639,26639,26639,26639,26640,26640,26640,26640,26738,26738,26738,26738,26740,26740,26740,26740)
15 org.opensearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/20_date_range/Insert more docs to java index} (26503,26504,26512,26527,26529,26530,26543,26660,26672,26680,26689,26702,26778,26815,26945)
15 org.opensearch.upgrades.RecoveryIT.testUpdateDoc (26515,26520,26526,26529,26532,26543,26604,26620,26672,26681,26683,26689,26765,26778,26815)
14 org.opensearch.upgrades.RecoveryIT.testRecoveryWithConcurrentIndexing (26502,26504,26515,26529,26532,26543,26549,26559,26573,26619,26620,26689,26765,26945)
14 org.opensearch.backwards.IndexingIT.testUpdateSnapshotStatus (26502,26526,26530,26532,26549,26559,26573,26574,26598,26620,26653,26664,26680,26778)
13 org.opensearch.search.aggregations.metrics.CardinalityWithRequestBreakerIT.classMethod (26501,26501,26501,26577,26587,26587,26587,26684,26684,26684,26826,26826,26826)
12 org.opensearch.upgrades.RecoveryIT.testRecoveryClosedIndex (26503,26504,26512,26526,26527,26532,26559,26598,26604,26702,26765,26815)
12 org.opensearch.action.admin.cluster.node.tasks.CancellableTasksIT.testCancelOrphanedTasks {p0={"search.concurrent_segment_search.enabled":"false"}} (26716,26747,26808,26819,26823,26853,26865,26877,26906,26913,26923,26942)
12 org.opensearch.action.admin.cluster.node.tasks.CancellableTasksIT.testCancelOrphanedTasks {p0={"search.concurrent_segment_search.enabled":"true"}} (26716,26747,26808,26819,26823,26853,26865,26877,26906,26913,26923,26942)
11 org.opensearch.upgrades.RecoveryIT.testClosedIndexNoopRecovery (26502,26504,26520,26529,26559,26573,26620,26681,26702,26765,26945)
11 org.opensearch.upgrades.RecoveryIT.testHistoryUUIDIsGenerated (26503,26520,26529,26604,26660,26672,26680,26689,26702,26778,26815)
11 org.opensearch.repositories.blobstore.BlobStoreRepositoryRemoteIndexTests.testRetrieveShallowCopySnapshotCase1 (26594,26609,26613,26633,26656,26667,26717,26781,26873,26881,26903)
10 org.opensearch.upgrades.RecoveryIT.testSoftDeletesDisabledWarning (26512,26520,26527,26529,26532,26543,26660,26683,26702,26945)

@andrross
Copy link
Member

andrross commented Oct 4, 2023

I like the intent behind "block merges until all tests are fixed" but in reality we should probably aim for something like "block merges until tests reliably pass" and use a combination of fixing tests, adding retries limited retries to existing flaky tests, muting known bad tests, and assigning owners to all muted/retried tests in order to get there.

@nknize
Copy link
Collaborator

nknize commented Oct 4, 2023

-1. We should never block incremental progress (even on minor branches) for PRs unrelated to the flaky tests.

I propose:

  1. Continue to mute the flaky tests to not block any incremental progress.
  2. Ensure #1715 is current, update it if it's not. (I think there's another similar issue we may need to merge with [Meta] Fix random test failures #1715 but I haven't gone hunting for it)
  3. Start assigning those issues up among the core maintainers (put an @{maintainer_id} next to the issue. Make the maintainer the owner but they don't have to be the one to fix it, just the one to seek out the expertise to fix.

@dblock
Copy link
Member Author

dblock commented Oct 5, 2023

I am not against muting tests (1), (2) is an obvious yes.

Doing (3) means we're asking maintainers to run after people, which I am a -1. I think given the state of affairs we need more than a carrot (a stick?) for contributors that wrote flaky tests to come back to them and fix them. I feel pretty strongly that this work shouldn't be dumped on others, while the authors of the flaky tests race to build more features with more flaky tests.

@nknize
Copy link
Collaborator

nknize commented Oct 5, 2023

I feel pretty strongly that this work shouldn't be dumped on others, while the authors of the flaky tests race to build more features with more flaky tests.

It's not that simple. We've (informally) defined a "flaky test" as one that only "periodically" fails and can't be reproduced by the maintainer that discovered it. "If a test fails and you haven't seen it fail in a week (or ever) and I can't repro in the heat of battle it must be a flake, right?"

This loose definition has been broadly applied by ALL contributors/code authors when their PR triggers an unrelated test (that they didn't write) to fail. The author/contributor proceeds to flag that test as "flaky" by (if we're lucky) adding it to the list so it's thrown behind a retry, commenting on the PR that it's a "flake", muting the test, and re-pushing to the PR in an effort to get their contribution to pass (go green) so they can merge the contribution and move on with life.

This means the original authors (not always a maintainer) of the now deemed "flaky test" are most likely in the dark that their test actually failed. This is a bigger problem caused by us only unit and integ testing contributions on PRs. The test framework uses a Random seed!! And that randomization is all over the test harness (to include Lucene DirectoryFactory choices in LuceneTestcase). So only testing scenario based contributions like SegRep and Remote store once (or even a handful of times) is likely not going to catch esoteric corner cases until some random unrelated PR picks the right seed on the right test container under the right conditions that aren't reproducible on a developers dev environment. (As a side note this is why Lucene has "beast" testing which we used extensively when working on BKD and Geo).

What we're finding now, are many of the SegRep "flakes" are not flakes at all but actual bugs that can be reproduced by diving deeper beyond the simple retry. Those original authors have been working hard to widdle down that list and not dump it on others.

Doing (3) means we're asking maintainers to run after people..

Maybe. It's also asking maintainers to maintain the codebase by notifying original authors (that are most likely unaware) when a flake is discovered in their test. This is why I said that maintainer doesn't have to fix the test themselves, just find the expertise to help. This is the responsibility of a maintainer of an OSS project.

I also propose:

  1. Turn on Jenkins random testing so that we can run the full test suite more often than just on PRs. Bonus points for posting those failures to a mailing list, or a dedicated slack channel (#core-build)

@peternied
Copy link
Member

we proposed not to merge anything other than flaky test failures until we hit zero into 2.x (2.12).

I think this proposal goes against the fundamental tenants of this repository - unilaterally blocking 'feature' contributions will immediately be challenged and exceptions will be cut. If there is no enforcement mechanism than this is a signal of intentions, which is nice, but not equally applied or visible to contributions creating friction.

I think there might be other ways to tackle the goal of reducing flaky tests, I'll make them their own comments for 👍 👎

@peternied
Copy link
Member

peternied commented Oct 5, 2023

Proposal: Requiring more PR approvals

By changing the main branch approval count required from 1 to 2, there is more opportunities for feedback and improve PR quality - and therefore improve test quality.

Pros:

  • Trivial to implement for maintainers
  • Clearly articulated to all contributors via PR merge UX
  • By its nature more maintainers will have a deeper understanding about changes that are being merged.

Cons:

  • Maintainers will need to invest more time in approving more PRs
  • Contributors will might to response to to more feedback
  • PR approvals will take longer

@peternied
Copy link
Member

peternied commented Oct 5, 2023

Proposal: Require 3 passing gradle check runs to merge PRs

This would make the pain of flaky test harder to get around. Note; this might effectively 'block all feature development' until substian progress is made - maybe that is a feature not a bug.

Pros:

  • Trivial to implement for maintainers
  • Clearly articulated to all contributors via PR merge UX

Cons:

  • Contributions no matter the size/area will be impacted deeply
  • 3x increase in utilization of CI resources

@reta
Copy link
Collaborator

reta commented Oct 12, 2023

Proposal: Require 3 passing gradle check runs to merge PRs

@peternied this is something we have discussed with @andrross, the idea is close enough but the process differs slightly:

  • we keep single Gradle check as it is now
  • BUT have a manually triggerable premerge Github action that runs the Gradle check N times (at least 10) on Jenkins right before pull request is about to be merged

@peternied
Copy link
Member

@nknize or @andrross is there an issue or PR associated with this proposal, could we see about including those details here to see if that would satisfy the requirements of this issue?

As it stands I don't see any proposals that have 👍's - @dblock did you have a mechanisms in mind that would meet your criteria?

@dblock
Copy link
Member Author

dblock commented Oct 13, 2023

As it stands I don't see any proposals that have 👍's - @dblock did you have a mechanisms in mind that would meet your criteria?

My plan was to simply ask maintainers to stop merging features and comment on those along the lines of "Please go fix a flaky test from this long list to get your feature in faster." 🤷‍♂️

@shwetathareja
Copy link
Member

shwetathareja commented Oct 23, 2023

Maybe. It's also asking maintainers to maintain the codebase by notifying original authors (that are most likely unaware) when a flake is discovered in their test. This is why I said that maintainer doesn't have to fix the test themselves, just find the expertise to help. This is the responsibility of a maintainer of an OSS project.

+1 to @nknize's point where maintainers help identify the author/ owner to fix the flaky test. Also, prevent merging any new PRs from them until flaky tests are fixed.

There is no clean gradle check run these days.

@peternied peternied added the Meta Meta issue, not directly linked to a PR label Nov 30, 2023
@peternied
Copy link
Member

@CEHENKLE @macohen Do you have thoughts on what we do with this issue. Doesn't seem right to mark this issue as triaged as we don't have a clear outcome/owner

@dblock
Copy link
Member Author

dblock commented Nov 30, 2023

As far as this specific proposal of blocking the merge of new code/features I think it's clearly not accepted. I appreciate everyone who contributed a comment or suggestion here about how to fix our flaky tests, but as far as this issue goes, the right thing to do is to close it.

Flaky tests obviously remain a problem as we went from 91 to 120 between when I opened this issue now me writing this. If someone has a good idea about how to tackle that, please open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Meta Meta issue, not directly linked to a PR untriaged
Projects
None yet
Development

No branches or pull requests

6 participants