Block merging code/features into 2.12 until all flaky tests are fixed. #10371

dblock · 2023-10-04T20:52:06Z

Is your feature request related to a problem? Please describe.

As of the time of writing this there are 91 flaky tests. Even with retries, your chances of having a passing gradle check are close to none.

Describe the solution you'd like

In a discussion on slack we proposed not to merge anything other than flaky test failures until we hit zero into 2.x (2.12).

Describe alternatives you've considered

Run after people to fix flaky tests.
Create an economy where to get 1 feature in you need to fix 1 flaky test. Pay up!
For each PR that failed CI on an unrelated flaky test, only merge when those tests are fixed.

dblock · 2023-10-04T20:55:12Z

@andrross how do I get the recent numbers on the worst offenders?

andrross · 2023-10-04T21:05:35Z

@dblock My previous flaky-test-finder.rb worked by finding the UNSTABLE builds, which means it contained tests that failed at least once but succeeded on retry. However, we now have an explicit list of tests that are allowed to retry, and the tests causing us pain are flaky but not allowed to retry and therefore fail the gradle check on a single failed. I have a tweaked script (failed-test-finder.rb) that will find the test failures, but note there can be false positives here due to failures from PR revisions that had legit bugs that were fixed. The best signal we have here is that tests with many failures are likely the flakiest.

I'm running the above script like below to get current results:

ruby ~/failed-test-finder.rb -s 26500 -e 26950

Here is an edited list of the tests with more than 10 failures in the past 450 runs:

70 org.opensearch.backwards.IndexingIT.testSeqNoCheckpoints (26502,26503,26504,26508,26512,26515,26520,26526,26527,26529,26530,26532,26533,26543,26549,26559,26573,26574,26598,26604,26619,26620,26653,26660,26664,26672,26680,26681,26683,26689,26690,26702,26728,26730,26734,26737,26741,26743,26746,26747,26765,26766,26767,26768,26775,26776,26777,26778,26781,26782,26799,26801,26811,26815,26820,26821,26829,26831,26845,26855,26857,26862,26867,26880,26898,26899,26917,26922,26944,26945)
69 org.opensearch.backwards.IndexingIT.testIndexVersionPropagation (26502,26503,26504,26508,26512,26515,26520,26526,26527,26529,26530,26532,26533,26543,26549,26559,26574,26598,26604,26619,26620,26653,26660,26664,26672,26680,26681,26683,26689,26690,26702,26728,26730,26734,26737,26741,26743,26746,26747,26765,26766,26767,26768,26775,26776,26777,26778,26781,26782,26799,26801,26811,26815,26820,26821,26829,26831,26845,26855,26857,26862,26867,26880,26898,26899,26917,26922,26944,26945)
67 org.opensearch.backwards.IndexingIT.testIndexingWithPrimaryOnBwcNodes (26502,26503,26504,26508,26512,26515,26520,26526,26527,26529,26530,26532,26533,26543,26549,26559,26573,26574,26598,26604,26619,26660,26672,26680,26681,26683,26689,26690,26702,26728,26730,26734,26737,26741,26743,26746,26747,26765,26766,26767,26768,26775,26776,26777,26778,26781,26782,26799,26801,26811,26815,26820,26821,26829,26831,26845,26855,26857,26862,26867,26880,26898,26899,26917,26922,26944,26945)
37 org.opensearch.upgrades.RecoveryIT.testRelocationWithConcurrentIndexing (26503,26508,26520,26527,26533,26543,26559,26573,26598,26604,26619,26620,26672,26683,26702,26734,26737,26746,26765,26767,26768,26775,26776,26777,26778,26782,26799,26801,26811,26821,26829,26845,26857,26880,26898,26917,26944)
28 org.opensearch.upgrades.RefreshVersionInClusterStateIT.testRefresh (26503,26504,26508,26512,26520,26526,26527,26529,26530,26532,26543,26559,26573,26598,26604,26619,26620,26660,26664,26672,26680,26683,26689,26702,26765,26778,26815,26945)
28 org.opensearch.upgrades.SystemIndicesUpgradeIT.testSystemIndicesUpgrades (26503,26504,26508,26512,26520,26526,26527,26529,26530,26532,26543,26559,26573,26598,26604,26619,26620,26660,26664,26672,26680,26683,26689,26702,26765,26778,26815,26945)
28 org.opensearch.script.expression.MoreExpressionIT.testSpecialValueVariable {p0={"search.concurrent_segment_search.enabled":"true"}} (26539,26558,26592,26595,26682,26690,26691,26693,26697,26710,26722,26726,26740,26744,26745,26750,26751,26755,26782,26817,26832,26856,26866,26879,26886,26908,26918,26931)
28 org.opensearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.classMethod (26503,26504,26508,26512,26520,26526,26527,26529,26530,26532,26543,26559,26573,26598,26604,26619,26620,26660,26664,26672,26680,26683,26689,26702,26765,26778,26815,26945)
27 org.opensearch.upgrades.RecoveryIT.testOperationBasedRecovery (26502,26503,26504,26512,26520,26526,26527,26530,26543,26549,26559,26573,26574,26598,26604,26619,26620,26664,26672,26680,26681,26689,26702,26765,26778,26815,26945)
26 org.opensearch.backwards.IndexingIT.testIndexingWithReplicaOnBwcNodes (26526,26532,26533,26573,26604,26620,26664,26680,26728,26730,26737,26741,26746,26766,26767,26768,26778,26799,26829,26831,26845,26857,26862,26880,26922,26944)
25 org.opensearch.upgrades.RecoveryIT.testRetentionLeasesEstablishedWhenPromotingPrimary (26502,26503,26504,26508,26520,26526,26527,26529,26530,26532,26549,26559,26598,26604,26660,26672,26683,26702,26737,26741,26765,26778,26815,26944,26945)
24 org.opensearch.search.scroll.SearchScrollWithFailingNodesIT.testScanScrollWithShardExceptions {p0={"search.concurrent_segment_search.enabled":"true"}} (26514,26529,26538,26545,26589,26605,26648,26683,26705,26708,26718,26723,26747,26753,26760,26770,26836,26847,26897,26905,26906,26920,26940,26942)
24 org.opensearch.search.aggregations.metrics.CardinalityWithRequestBreakerIT.testRequestBreaker {p0={"search.concurrent_segment_search.enabled":"true"}} (26521,26531,26538,26543,26577,26579,26583,26584,26587,26598,26604,26614,26621,26699,26702,26706,26800,26819,26826,26836,26876,26909,26911,26914)
24 org.opensearch.search.scroll.SearchScrollWithFailingNodesIT.testScanScrollWithShardExceptions {p0={"search.concurrent_segment_search.enabled":"false"}} (26514,26529,26538,26545,26589,26605,26648,26683,26705,26708,26718,26723,26747,26753,26760,26770,26836,26847,26897,26905,26906,26920,26940,26942)
23 org.opensearch.search.aggregations.bucket.DiversifiedSamplerIT.testNestedSamples {p0={"search.concurrent_segment_search.enabled":"true"}} (26531,26577,26584,26591,26596,26601,26605,26621,26627,26645,26670,26671,26675,26723,26748,26753,26789,26834,26853,26876,26883,26900,26915)
23 org.opensearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/10_basic/Verify custom cluster metadata still exists during upgrade} (26503,26504,26508,26512,26520,26526,26527,26529,26530,26532,26543,26559,26573,26598,26604,26660,26664,26680,26689,26702,26765,26778,26815)
21 org.opensearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/20_date_range/Insert more docs to joda index} (26508,26512,26520,26526,26529,26530,26532,26559,26598,26604,26619,26620,26660,26664,26672,26683,26689,26765,26778,26815,26945)
21 org.opensearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/10_basic/Verify that we can still find things with the template} (26503,26504,26508,26520,26526,26529,26532,26543,26559,26573,26598,26604,26619,26620,26660,26672,26680,26683,26702,26765,26778)
21 org.opensearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/10_basic/Get index works} (26503,26504,26508,26512,26526,26527,26530,26532,26543,26559,26573,26598,26619,26620,26664,26672,26680,26683,26689,26765,26945)
20 org.opensearch.upgrades.RecoveryIT.testRecovery (26504,26508,26512,26526,26530,26532,26573,26574,26660,26680,26681,26683,26689,26778,26781,26801,26855,26880,26899,26917)
20 org.opensearch.upgrades.RecoveryIT.classMethod (26504,26508,26512,26520,26526,26527,26530,26532,26543,26559,26573,26604,26620,26680,26683,26689,26702,26765,26778,26815)
20 org.opensearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/10_basic/Use the percolate query in mixed cluster} (26503,26508,26520,26527,26529,26530,26532,26543,26573,26604,26619,26620,26660,26664,26683,26702,26765,26778,26815,26945)
19 org.opensearch.upgrades.RecoveryIT.testRetentionLeasesEstablishedWhenRelocatingPrimary (26503,26515,26543,26549,26573,26619,26620,26672,26681,26702,26730,26734,26743,26767,26777,26778,26815,26862,26922)
19 org.opensearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/10_basic/Verify nodes usage works} (26504,26512,26520,26526,26527,26559,26573,26598,26604,26619,26620,26664,26672,26680,26683,26689,26702,26815,26945)
18 org.opensearch.cluster.MinimumClusterManagerNodesIT.testThreeNodesNoClusterManagerBlock (26522,26567,26597,26635,26653,26657,26692,26705,26727,26732,26783,26808,26830,26858,26870,26930,26932,26937)
18 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testWriteLargeBlob (26639,26639,26639,26639,26640,26640,26640,26640,26738,26738,26738,26738,26740,26740,26740,26740,26755,26912)
18 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadBlobWithRetries (26532,26639,26639,26639,26639,26640,26640,26640,26640,26738,26738,26738,26738,26740,26740,26740,26740,26795)
17 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadNonexistentBlobThrowsNoSuchFileException (26639,26639,26639,26639,26640,26640,26640,26640,26738,26738,26738,26738,26740,26740,26740,26740,26799)
16 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testReadRangeBlobWithRetries (26639,26639,26639,26639,26640,26640,26640,26640,26738,26738,26738,26738,26740,26740,26740,26740)
16 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testWriteBlobWithRetries (26639,26639,26639,26639,26640,26640,26640,26640,26738,26738,26738,26738,26740,26740,26740,26740)
16 org.opensearch.repositories.azure.AzureBlobContainerRetriesTests.testRetryUntilFail (26639,26639,26639,26639,26640,26640,26640,26640,26738,26738,26738,26738,26740,26740,26740,26740)
15 org.opensearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/20_date_range/Insert more docs to java index} (26503,26504,26512,26527,26529,26530,26543,26660,26672,26680,26689,26702,26778,26815,26945)
15 org.opensearch.upgrades.RecoveryIT.testUpdateDoc (26515,26520,26526,26529,26532,26543,26604,26620,26672,26681,26683,26689,26765,26778,26815)
14 org.opensearch.upgrades.RecoveryIT.testRecoveryWithConcurrentIndexing (26502,26504,26515,26529,26532,26543,26549,26559,26573,26619,26620,26689,26765,26945)
14 org.opensearch.backwards.IndexingIT.testUpdateSnapshotStatus (26502,26526,26530,26532,26549,26559,26573,26574,26598,26620,26653,26664,26680,26778)
13 org.opensearch.search.aggregations.metrics.CardinalityWithRequestBreakerIT.classMethod (26501,26501,26501,26577,26587,26587,26587,26684,26684,26684,26826,26826,26826)
12 org.opensearch.upgrades.RecoveryIT.testRecoveryClosedIndex (26503,26504,26512,26526,26527,26532,26559,26598,26604,26702,26765,26815)
12 org.opensearch.action.admin.cluster.node.tasks.CancellableTasksIT.testCancelOrphanedTasks {p0={"search.concurrent_segment_search.enabled":"false"}} (26716,26747,26808,26819,26823,26853,26865,26877,26906,26913,26923,26942)
12 org.opensearch.action.admin.cluster.node.tasks.CancellableTasksIT.testCancelOrphanedTasks {p0={"search.concurrent_segment_search.enabled":"true"}} (26716,26747,26808,26819,26823,26853,26865,26877,26906,26913,26923,26942)
11 org.opensearch.upgrades.RecoveryIT.testClosedIndexNoopRecovery (26502,26504,26520,26529,26559,26573,26620,26681,26702,26765,26945)
11 org.opensearch.upgrades.RecoveryIT.testHistoryUUIDIsGenerated (26503,26520,26529,26604,26660,26672,26680,26689,26702,26778,26815)
11 org.opensearch.repositories.blobstore.BlobStoreRepositoryRemoteIndexTests.testRetrieveShallowCopySnapshotCase1 (26594,26609,26613,26633,26656,26667,26717,26781,26873,26881,26903)
10 org.opensearch.upgrades.RecoveryIT.testSoftDeletesDisabledWarning (26512,26520,26527,26529,26532,26543,26660,26683,26702,26945)

andrross · 2023-10-04T21:52:36Z

I like the intent behind "block merges until all tests are fixed" but in reality we should probably aim for something like "block merges until tests reliably pass" and use a combination of fixing tests, adding retries limited retries to existing flaky tests, muting known bad tests, and assigning owners to all muted/retried tests in order to get there.

nknize · 2023-10-04T21:53:32Z

-1. We should never block incremental progress (even on minor branches) for PRs unrelated to the flaky tests.

I propose:

Continue to mute the flaky tests to not block any incremental progress.
Ensure #1715 is current, update it if it's not. (I think there's another similar issue we may need to merge with [Meta] Fix random test failures #1715 but I haven't gone hunting for it)
Start assigning those issues up among the core maintainers (put an @{maintainer_id} next to the issue. Make the maintainer the owner but they don't have to be the one to fix it, just the one to seek out the expertise to fix.

dblock · 2023-10-05T14:07:45Z

I am not against muting tests (1), (2) is an obvious yes.

Doing (3) means we're asking maintainers to run after people, which I am a -1. I think given the state of affairs we need more than a carrot (a stick?) for contributors that wrote flaky tests to come back to them and fix them. I feel pretty strongly that this work shouldn't be dumped on others, while the authors of the flaky tests race to build more features with more flaky tests.

nknize · 2023-10-05T14:51:17Z

I feel pretty strongly that this work shouldn't be dumped on others, while the authors of the flaky tests race to build more features with more flaky tests.

It's not that simple. We've (informally) defined a "flaky test" as one that only "periodically" fails and can't be reproduced by the maintainer that discovered it. "If a test fails and you haven't seen it fail in a week (or ever) and I can't repro in the heat of battle it must be a flake, right?"

This loose definition has been broadly applied by ALL contributors/code authors when their PR triggers an unrelated test (that they didn't write) to fail. The author/contributor proceeds to flag that test as "flaky" by (if we're lucky) adding it to the list so it's thrown behind a retry, commenting on the PR that it's a "flake", muting the test, and re-pushing to the PR in an effort to get their contribution to pass (go green) so they can merge the contribution and move on with life.

This means the original authors (not always a maintainer) of the now deemed "flaky test" are most likely in the dark that their test actually failed. This is a bigger problem caused by us only unit and integ testing contributions on PRs. The test framework uses a Random seed!! And that randomization is all over the test harness (to include Lucene DirectoryFactory choices in LuceneTestcase). So only testing scenario based contributions like SegRep and Remote store once (or even a handful of times) is likely not going to catch esoteric corner cases until some random unrelated PR picks the right seed on the right test container under the right conditions that aren't reproducible on a developers dev environment. (As a side note this is why Lucene has "beast" testing which we used extensively when working on BKD and Geo).

What we're finding now, are many of the SegRep "flakes" are not flakes at all but actual bugs that can be reproduced by diving deeper beyond the simple retry. Those original authors have been working hard to widdle down that list and not dump it on others.

Doing (3) means we're asking maintainers to run after people..

Maybe. It's also asking maintainers to maintain the codebase by notifying original authors (that are most likely unaware) when a flake is discovered in their test. This is why I said that maintainer doesn't have to fix the test themselves, just find the expertise to help. This is the responsibility of a maintainer of an OSS project.

I also propose:

Turn on Jenkins random testing so that we can run the full test suite more often than just on PRs. Bonus points for posting those failures to a mailing list, or a dedicated slack channel (#core-build)

peternied · 2023-10-05T16:25:27Z

we proposed not to merge anything other than flaky test failures until we hit zero into 2.x (2.12).

I think this proposal goes against the fundamental tenants of this repository - unilaterally blocking 'feature' contributions will immediately be challenged and exceptions will be cut. If there is no enforcement mechanism than this is a signal of intentions, which is nice, but not equally applied or visible to contributions creating friction.

I think there might be other ways to tackle the goal of reducing flaky tests, I'll make them their own comments for 👍 👎

peternied · 2023-10-05T16:28:39Z

Proposal: Requiring more PR approvals

By changing the main branch approval count required from 1 to 2, there is more opportunities for feedback and improve PR quality - and therefore improve test quality.

Pros:

Trivial to implement for maintainers
Clearly articulated to all contributors via PR merge UX
By its nature more maintainers will have a deeper understanding about changes that are being merged.

Cons:

Maintainers will need to invest more time in approving more PRs
Contributors will might to response to to more feedback
PR approvals will take longer

peternied · 2023-10-05T16:28:45Z

Proposal: Require 3 passing gradle check runs to merge PRs

This would make the pain of flaky test harder to get around. Note; this might effectively 'block all feature development' until substian progress is made - maybe that is a feature not a bug.

Pros:

Trivial to implement for maintainers
Clearly articulated to all contributors via PR merge UX

Cons:

Contributions no matter the size/area will be impacted deeply
3x increase in utilization of CI resources

reta · 2023-10-12T02:45:20Z

Proposal: Require 3 passing gradle check runs to merge PRs

@peternied this is something we have discussed with @andrross, the idea is close enough but the process differs slightly:

we keep single Gradle check as it is now
BUT have a manually triggerable premerge Github action that runs the Gradle check N times (at least 10) on Jenkins right before pull request is about to be merged

peternied · 2023-10-12T15:22:33Z

@nknize or @andrross is there an issue or PR associated with this proposal, could we see about including those details here to see if that would satisfy the requirements of this issue?

As it stands I don't see any proposals that have 👍's - @dblock did you have a mechanisms in mind that would meet your criteria?

dblock · 2023-10-13T19:07:31Z

As it stands I don't see any proposals that have 👍's - @dblock did you have a mechanisms in mind that would meet your criteria?

My plan was to simply ask maintainers to stop merging features and comment on those along the lines of "Please go fix a flaky test from this long list to get your feature in faster." 🤷‍♂️

shwetathareja · 2023-10-23T06:03:49Z

Maybe. It's also asking maintainers to maintain the codebase by notifying original authors (that are most likely unaware) when a flake is discovered in their test. This is why I said that maintainer doesn't have to fix the test themselves, just find the expertise to help. This is the responsibility of a maintainer of an OSS project.

+1 to @nknize's point where maintainers help identify the author/ owner to fix the flaky test. Also, prevent merging any new PRs from them until flaky tests are fixed.

There is no clean gradle check run these days.

peternied · 2023-11-30T18:58:06Z

@CEHENKLE @macohen Do you have thoughts on what we do with this issue. Doesn't seem right to mark this issue as triaged as we don't have a clear outcome/owner

dblock · 2023-11-30T20:54:51Z

As far as this specific proposal of blocking the merge of new code/features I think it's clearly not accepted. I appreciate everyone who contributed a comment or suggestion here about how to fix our flaky tests, but as far as this issue goes, the right thing to do is to close it.

Flaky tests obviously remain a problem as we went from 91 to 120 between when I opened this issue now me writing this. If someone has a good idea about how to tackle that, please open a new issue.

dblock added enhancement Enhancement or improvement to existing feature or request untriaged labels Oct 4, 2023

dblock changed the title ~~Block merging code into 2.12 until all flaky tests are fixed.~~ Block merging code/features into 2.12 until all flaky tests are fixed. Oct 4, 2023

reta mentioned this issue Oct 19, 2023

Update Github pull request template to have a task for inspecting failing checks #10751

Merged

8 tasks

peternied added the Meta Meta issue, not directly linked to a PR label Nov 30, 2023

dblock closed this as completed Nov 30, 2023

reta mentioned this issue Mar 20, 2024

Remove fuzzy set from experimental #12631

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block merging code/features into 2.12 until all flaky tests are fixed. #10371

Block merging code/features into 2.12 until all flaky tests are fixed. #10371

dblock commented Oct 4, 2023 •

edited

Loading

dblock commented Oct 4, 2023

andrross commented Oct 4, 2023 •

edited

Loading

andrross commented Oct 4, 2023 •

edited

Loading

nknize commented Oct 4, 2023

dblock commented Oct 5, 2023

nknize commented Oct 5, 2023 •

edited

Loading

peternied commented Oct 5, 2023

peternied commented Oct 5, 2023 •

edited

Loading

peternied commented Oct 5, 2023 •

edited

Loading

reta commented Oct 12, 2023

peternied commented Oct 12, 2023

dblock commented Oct 13, 2023

shwetathareja commented Oct 23, 2023 •

edited

Loading

peternied commented Nov 30, 2023

dblock commented Nov 30, 2023

Block merging code/features into 2.12 until all flaky tests are fixed. #10371

Block merging code/features into 2.12 until all flaky tests are fixed. #10371

Comments

dblock commented Oct 4, 2023 • edited Loading

dblock commented Oct 4, 2023

andrross commented Oct 4, 2023 • edited Loading

andrross commented Oct 4, 2023 • edited Loading

nknize commented Oct 4, 2023

dblock commented Oct 5, 2023

nknize commented Oct 5, 2023 • edited Loading

peternied commented Oct 5, 2023

peternied commented Oct 5, 2023 • edited Loading

Proposal: Requiring more PR approvals

Pros:

Cons:

peternied commented Oct 5, 2023 • edited Loading

Proposal: Require 3 passing gradle check runs to merge PRs

Pros:

Cons:

reta commented Oct 12, 2023

peternied commented Oct 12, 2023

dblock commented Oct 13, 2023

shwetathareja commented Oct 23, 2023 • edited Loading

peternied commented Nov 30, 2023

dblock commented Nov 30, 2023

dblock commented Oct 4, 2023 •

edited

Loading

andrross commented Oct 4, 2023 •

edited

Loading

andrross commented Oct 4, 2023 •

edited

Loading

nknize commented Oct 5, 2023 •

edited

Loading

peternied commented Oct 5, 2023 •

edited

Loading

peternied commented Oct 5, 2023 •

edited

Loading

shwetathareja commented Oct 23, 2023 •

edited

Loading