-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block merging code/features into 2.12 until all flaky tests are fixed. #10371
Comments
@andrross how do I get the recent numbers on the worst offenders? |
@dblock My previous flaky-test-finder.rb worked by finding the UNSTABLE builds, which means it contained tests that failed at least once but succeeded on retry. However, we now have an explicit list of tests that are allowed to retry, and the tests causing us pain are flaky but not allowed to retry and therefore fail the gradle check on a single failed. I have a tweaked script (failed-test-finder.rb) that will find the test failures, but note there can be false positives here due to failures from PR revisions that had legit bugs that were fixed. The best signal we have here is that tests with many failures are likely the flakiest. I'm running the above script like below to get current results:
Here is an edited list of the tests with more than 10 failures in the past 450 runs:
|
I like the intent behind "block merges until all tests are fixed" but in reality we should probably aim for something like "block merges until tests reliably pass" and use a combination of fixing tests, adding retries limited retries to existing flaky tests, muting known bad tests, and assigning owners to all muted/retried tests in order to get there. |
-1. We should never block incremental progress (even on minor branches) for PRs unrelated to the flaky tests. I propose:
|
I am not against muting tests (1), (2) is an obvious yes. Doing (3) means we're asking maintainers to run after people, which I am a -1. I think given the state of affairs we need more than a carrot (a stick?) for contributors that wrote flaky tests to come back to them and fix them. I feel pretty strongly that this work shouldn't be dumped on others, while the authors of the flaky tests race to build more features with more flaky tests. |
It's not that simple. We've (informally) defined a "flaky test" as one that only "periodically" fails and can't be reproduced by the maintainer that discovered it. "If a test fails and you haven't seen it fail in a week (or ever) and I can't repro in the heat of battle it must be a flake, right?" This loose definition has been broadly applied by ALL contributors/code authors when their PR triggers an unrelated test (that they didn't write) to fail. The author/contributor proceeds to flag that test as "flaky" by (if we're lucky) adding it to the list so it's thrown behind a retry, commenting on the PR that it's a "flake", muting the test, and re-pushing to the PR in an effort to get their contribution to pass (go green) so they can merge the contribution and move on with life. This means the original authors (not always a maintainer) of the now deemed "flaky test" are most likely in the dark that their test actually failed. This is a bigger problem caused by us only unit and integ testing contributions on PRs. The test framework uses a Random seed!! And that randomization is all over the test harness (to include Lucene DirectoryFactory choices in LuceneTestcase). So only testing scenario based contributions like SegRep and Remote store once (or even a handful of times) is likely not going to catch esoteric corner cases until some random unrelated PR picks the right seed on the right test container under the right conditions that aren't reproducible on a developers dev environment. (As a side note this is why Lucene has "beast" testing which we used extensively when working on BKD and Geo). What we're finding now, are many of the SegRep "flakes" are not flakes at all but actual bugs that can be reproduced by diving deeper beyond the simple retry. Those original authors have been working hard to widdle down that list and not dump it on others.
Maybe. It's also asking maintainers to maintain the codebase by notifying original authors (that are most likely unaware) when a flake is discovered in their test. This is why I said that maintainer doesn't have to fix the test themselves, just find the expertise to help. This is the responsibility of a maintainer of an OSS project. I also propose:
|
I think this proposal goes against the fundamental tenants of this repository - unilaterally blocking 'feature' contributions will immediately be challenged and exceptions will be cut. If there is no enforcement mechanism than this is a signal of intentions, which is nice, but not equally applied or visible to contributions creating friction. I think there might be other ways to tackle the goal of reducing flaky tests, I'll make them their own comments for 👍 👎 |
Proposal: Requiring more PR approvalsBy changing the main branch approval count required from 1 to 2, there is more opportunities for feedback and improve PR quality - and therefore improve test quality. Pros:
Cons:
|
Proposal: Require 3 passing gradle check runs to merge PRsThis would make the pain of flaky test harder to get around. Note; this might effectively 'block all feature development' until substian progress is made - maybe that is a feature not a bug. Pros:
Cons:
|
@peternied this is something we have discussed with @andrross, the idea is close enough but the process differs slightly:
|
@nknize or @andrross is there an issue or PR associated with this proposal, could we see about including those details here to see if that would satisfy the requirements of this issue? As it stands I don't see any proposals that have 👍's - @dblock did you have a mechanisms in mind that would meet your criteria? |
My plan was to simply ask maintainers to stop merging features and comment on those along the lines of "Please go fix a flaky test from this long list to get your feature in faster." 🤷♂️ |
+1 to @nknize's point where maintainers help identify the author/ owner to fix the flaky test. Also, prevent merging any new PRs from them until flaky tests are fixed. There is no clean gradle check run these days. |
As far as this specific proposal of blocking the merge of new code/features I think it's clearly not accepted. I appreciate everyone who contributed a comment or suggestion here about how to fix our flaky tests, but as far as this issue goes, the right thing to do is to close it. Flaky tests obviously remain a problem as we went from 91 to 120 between when I opened this issue now me writing this. If someone has a good idea about how to tackle that, please open a new issue. |
Is your feature request related to a problem? Please describe.
As of the time of writing this there are 91 flaky tests. Even with retries, your chances of having a passing gradle check are close to none.
Describe the solution you'd like
In a discussion on slack we proposed not to merge anything other than flaky test failures until we hit zero into 2.x (2.12).
Describe alternatives you've considered
The text was updated successfully, but these errors were encountered: