From f8cd32e1c5bc0407609ee236de0bce6fdbe47288 Mon Sep 17 00:00:00 2001 From: Joyee Cheung Date: Thu, 28 Sep 2023 22:24:38 +0200 Subject: [PATCH] doc: update README and include protocol for handling reliability issues (#678) --- README.md | 210 +++++++++++++++++++++++++++--------------------------- 1 file changed, 105 insertions(+), 105 deletions(-) diff --git a/README.md b/README.md index efa1b5d..84018d6 100644 --- a/README.md +++ b/README.md @@ -5,19 +5,17 @@ This repo is used for tracking flaky tests on the Node.js CI and fixing them. **Current status**: work in progress. Please go to the issue tracker to discuss! - -- [Updating this repo](#updating-this-repo) -- [The Goal](#the-goal) +- [Node.js Core CI Reliability](#nodejs-core-ci-reliability) + - [Updating this repo](#updating-this-repo) + - [The Goal](#the-goal) - [The Definition of Green](#the-definition-of-green) -- [CI Health History](#ci-health-history) -- [Handling Failed CI runs](#handling-failed-ci-runs) - - [Flaky Tests](#flaky-tests) - - [Identifying Flaky Tests](#identifying-flaky-tests) - - [When Discovering a Potential New Flake on the CI](#when-discovering-a-potential-new-flake-on-the-ci) - - [Infrastructure failures](#infrastructure-failures) - - [Build File Failures](#build-file-failures) -- [TODO](#todo) - + - [CI Health History](#ci-health-history) + - [Protocols in improving CI reliability](#protocols-in-improving-ci-reliability) + - [Identifying flaky JS tests](#identifying-flaky-js-tests) + - [Handling flaky JS tests](#handling-flaky-js-tests) + - [Identifying infrastructure issues](#identifying-infrastructure-issues) + - [Handling infrastructure issues](#handling-infrastructure-issues) + - [TODO](#todo) ## Updating this repo @@ -50,104 +48,106 @@ Make the CI green again. ## CI Health History -See https://nodejs-ci-health.mmarchini.me/#/job-summary - -| UTC Time | RUNNING | SUCCESS | UNSTABLE | ABORTED | FAILURE | Green Rate | -| ---------------- | ------- | ------- | -------- | ------- | ------- | ---------- | -| 2018-06-01 20:00 | 1 | 1 | 15 | 11 | 72 | 1.13% | -| 2018-06-03 11:36 | 3 | 6 | 21 | 10 | 60 | 6.89% | -| 2018-06-04 15:00 | 0 | 9 | 26 | 10 | 55 | 10.00% | -| 2018-06-15 17:42 | 1 | 27 | 4 | 17 | 51 | 32.93% | -| 2018-06-24 18:11 | 0 | 27 | 2 | 8 | 63 | 29.35% | -| 2018-07-08 19:40 | 1 | 35 | 2 | 4 | 58 | 36.84% | -| 2018-07-18 20:46 | 2 | 38 | 4 | 5 | 51 | 40.86% | -| 2018-07-24 22:30 | 2 | 46 | 3 | 4 | 45 | 48.94% | -| 2018-08-01 19:11 | 4 | 17 | 2 | 2 | 75 | 18.09% | -| 2018-08-14 15:42 | 5 | 22 | 0 | 14 | 59 | 27.16% | -| 2018-08-22 13:22 | 2 | 29 | 4 | 9 | 56 | 32.58% | -| 2018-10-31 13:28 | 0 | 40 | 13 | 4 | 43 | 41.67% | -| 2018-11-19 10:32 | 0 | 48 | 8 | 5 | 39 | 50.53% | -| 2018-12-08 20:37 | 2 | 18 | 4 | 3 | 73 | 18.95% | - -## Handling Failed CI runs - -### Flaky Tests - -TODO: automate all of this in ncu-ci - -#### Identifying Flaky Tests - -When checking the CI results of a PR, if there is one or more failed tests (with -`not ok` as the TAP result): - -1. If the failed test is not related to the PR (does not touch the modified - code path), search the test name in the issue tracker of this repo. If there - is an existing issue, add a reply there using the [reproduction template](./templates/repro.txt), - and open a pull request updating `flakes.json`. -2. If there are no new existing issues about the test, run the CI again. If the - failure disappears in the next run, then it is potential flake. See - [When discovering a potential flake on the CI](#when-discovering-a-potential-new-flake-on-the-ci) - on what to do for a new flake. -3. If the failure reproduces in the next run, it is likely that the failure is - related to the PR. Do not re-run CI without code changes in the next 24 - hours, try to debug the failure. -4. If the cause of the failure still cannot be identified 24 hours later, and - the code has not been changed, start a CI run and see if the failure - disappears. Go back to step 3 if the failure still reproduces, and go to - step 2 if the failure disappears. - -#### When Discovering a Potential New Flake on the CI - -1. Open an issue in this repo using [the flake issue template](./templates/flake.txt): +[A GitHub workflow](.github/workflows/reliability_report.yml) is run every day +to produce reliability reports of the `node-test-pull-request` CI and post +it to [the issue tracker](https://github.com/nodejs/reliability/issues). + +## Protocols in improving CI reliability + +Most work starts with opening the issue tracker of this repository and +reading the latest report. If the report is missing, see +[the actions page](https://github.com/nodejs/reliability/actions) for +details. GitHub's API restricts the length of issue messages, so +whenever the report is too long the workflow can fail to post the +issue. But it should still leave a summary in the actions page. + +### Identifying flaky JS tests + +1. Check out the `JSTest Failure` section of the latest reliability report. + It contains information about the JS tests that failed more than 1 pull + requests in the last 100 `node-test-pull-request` CI runs. The more + pull requests a test fail, the higher it would be ranked, and the more + likely that it is a flake. +2. Search the name of the test in [the Node.js issue tracker](https://github.com/nodejs/node/issues) + and see if there is already an issue about it. If there is already + an issue, check if the failures are similar. Comment with updates + if necessary. +3. If the flake isn't already tracked by an issue, continue to look into + it. In the report of a JS test, check out the pull requests that it + fails and see if there is a connection. If the pull requests appear to + be unrelated, it is more likely that the test is a flake. +4. Search the historical reliability reports with the name of the test in + the reliability issue tracker, and see how long the flake has been showing + up. Gather information from the historical reports, and + [open an issue](https://github.com/nodejs/node/issues/new?assignees=&labels=flaky-test&projects=&template=4-report-a-flaky-test.yml) + in the Node.js issue tracker to track the flake. + +### Handling flaky JS tests + +1. If the flake only starts to show up in the recent month, check the + historical reports to see precisely when it starts to show up. Look at + commits landing on the target branch around the same time using + `https://github.com/nodejs/node/commits?since=YYYY-MM-DD` + and see if there is any pull request that looks related. If one or + more related pull requests can be found, ping the author or the + reviewer of the pull request, or the team in charge of the + related subsystem in the tracking issue or in private to see if + they can come up with a fix to just deflake the test. +2. If the test has been flaky for more than a month and no one is actively + working on it, it is unlikely to go away on its own, and it's time + to mark it as flaky. For example, if `parallel/some-flaky-test.js` + has been flaky on Windows in the CI, after making sure that there is an + issue tracking it, open a pull request to add the following entry to + [`test/parallel/parallel.status`](https://github.com/nodejs/node/tree/main/test/parallel/parallel.status): + + ``` + [$system==win32] + # https://github.com/nodejs/node/issues/ + some-flaky-test: PASS,FLAKY + ``` + +### Identifying infrastructure issues + +In the reliability reports, `Jenkins Failure`, `Git Failure` and +`Build Failure` are generally infrastructure issues and can be +handled by the `nodejs/build` team. Typical infrastructure +issues include: - - Title should be `Investigate path/under/the/test/directory/without/extension`, - for example `Investigate async-hooks/test-zlib.zlib-binding.deflate`. - -2. Add the `Flaky Test` label and relevant subsystem labels (TODO: create - useful labels). - -3. Open a pull request updating `flakes.json`. - -4. Notify the subsystem team related to the flake. - -### Infrastructure failures - -When the CI run fails because: - -- There are network connection issues -- There are tests fail with `ENOSPAC` (No space left on device) - The CI machine has trouble pulling source code from the repository - -Do the following: - -1. Search in this repo with the error message and see if there is any open - issue about this. -2. If there is an existing issue, wait until the problem gets fixed. -3. If there are no similar issues, open a new one with - [the build infra issue template](./templates/infra.txt). -4. Add label `Build Infra`. -5. Notify the `@nodejs/build-infra` team in the issue. - -### Build File Failures - -When the CI run of a PR that does not touch the build files ends with build -failures (e.g. the run ends before the test runner has a chance to run): - -1. Search in this repo with the error message that contains keywords like - `fatal`, `error`, etc. -2. If there is a similar issue, add a reply there using the - [reproduction template](./templates/build-file-repro.txt). -3. If there are no similar issues, open a new one with - [the build file issue template](./templates/build-file.txt). -4. Add label `Build Files`. -5. Notify the `@nodejs/build-files` team in the issue. +- The CI machine has trouble communicating to the Jenkins server +- Build timing out +- Parent job fails to trigger sub builds + +Sometimes infrastructure issues can show up in the tests too, for +example tests can fail with `ENOSPAC` (No space left on device), and +the machine needs to be cleaned up to release disk space. + +Some infrastructure issues can go away on its own, but if the same kind +of infrastructure issue has been failing multiple pull requests and +persists for more than a day, it's time to take action. + +### Handling infrastructure issues + +Check out the [Node.js build issue tracker](https://github.com/nodejs/build/issues) +to see if there is any open issue about this. If there isn't, +open a new issue about it or ask around in the `#nodejs-build` channel +in the OpenJS slack. + +When reporting infrastructure issues, it's important to include +information about the particular machines where the issues happen. +On the Jenkins job page of the failed CI build where the infrastructure +is reported in the logs (not to be confused with the parent build that +trigger the sub build that has the issues), on the top-right +corner, there is normally a line similar to +`Took 16 sec on test-equinix-ubuntu2004_container-armv7l-1`. +In this case, `test-equinix-ubuntu2004_container-armv7l-1` +is the machine having infrastructure issues, and it's important +to include this information in the report. ## TODO -- [ ] Settle down on the flake database schema - [ ] Read the flake database in ncu-ci so people can quickly tell if - a failure is a flake + a failure is a flake - [ ] Automate the report process in ncu-ci - [ ] Migrate existing issues in nodejs/node and nodejs/build, close outdated - ones. -- [ ] Automate CI health history tracking + ones.