feat: don't page unless you're absolutely sure it's worth waking someone up for #5087

johnrwatson · 2024-12-08T19:38:51Z

I'm not particularly happy with this indirect assertion system, but logically it ties together and should hold us in good stead moving forward to prevent unnecessary pages.

It works off the premise that the only reason we should page someone from CI is:

If we have a valid failure test result
It's the correct environment

Previously we had a bunch of other failure scenarios which would have also paged, such as:

NPM outage
Github orchestration issue/infrastructure issue on their side
Cypress results not uploading correctly in CI
Other networking issue/blip
Other 3rd party dependency issue

Now for every time the system wants to assert whether to page someone, the CI runner MUST have a valid failure artifact or marker pulled from the CI workflow.

ALL failures, regardless of cause will still post into #_alerts_internal to let us know there was an orchestration issue of some manner. Should save a lot of noise in paging people with no actionable result.

For the API tests, to assert whether the test actually failed while running our code, I made it throw exit code 53, which is a random code I devised that should not match any other orchestration failure from deno, dependencies or otherwise.

Also fixed an issue with the Slack notification for failed tests not working because it didn't have the environment to pull the Slack token.

Developing this was a proper grind

sprutton1 · 2024-12-10T15:39:33Z

.github/workflows/e2e-validation.yml

            # Check the exit code
            if [ -z "$exit_code" ]; then
              echo "Cypress Test task succeeded!"
              break
            fi

+            if [ $n -ge $max_retries ]; then


will this ever run? I can't remember if until [ $n -ge $max_retries ] is inclusive of the last run or if it will won't run when n>$max_retries

Great catch, I'll move it back

sprutton1 · 2024-12-10T15:42:21Z

.github/workflows/run-api-test.yml

+          echo "last_exit_code=$exit_code" >> "$GITHUB_ENV"
+          exit "$last_exit_code"
+
+      - name: Upload artifact if exit code 53


What does 53 mean?

See description

For the API tests, to assert whether the test actually failed while running our code, I made it throw exit code 53, which is a random code I devised that should not match any other orchestration failure from deno, dependencies or otherwise.

…test flake

britmyerss · 2024-12-10T20:00:01Z

.github/workflows/e2e-validation.yml

+              break
+            fi
+          done
+          # If at least one valid failure marker is present, then page


one thought here - we could send different metadata to firehydrant and have firehydrant do the routing if it's page-able... I'm like 50/50 on this though, so your call

The tricky bit here was actually isolating when a signal should be sent at all into firehydrant. We could possibly split it so that for each CI failure there are various failure modes, i.e.:

If there is a github failure we pass a "github failure" metadata key

If there is a cypress orchestration failure we send that as a metadata key

If unmatched, we page(?)

It feels like we would spend a lot of time trying to increase our filter accuracy to more actuely catch + report on different failure modes.

It certainly would help with metrics, i.e. we'd see exactly how often certain failure modes happen, but I think a ping into #_alerts_internal should be sufficient for now. If this doesn't work out I think what you suggested is probably the way to go

github-actions bot added the A-ci Area: CI configuration files and scripts label Dec 8, 2024

johnrwatson temporarily deployed to tools December 8, 2024 19:39 — with GitHub Actions Inactive

johnrwatson temporarily deployed to tools December 8, 2024 19:47 — with GitHub Actions Inactive

johnrwatson temporarily deployed to tools December 8, 2024 19:49 — with GitHub Actions Inactive

johnrwatson had a problem deploying to tools December 8, 2024 19:49 — with GitHub Actions Failure

johnrwatson temporarily deployed to tools December 8, 2024 19:52 — with GitHub Actions Inactive

johnrwatson had a problem deploying to tools December 8, 2024 19:53 — with GitHub Actions Failure

johnrwatson temporarily deployed to tools December 10, 2024 00:26 — with GitHub Actions Inactive

johnrwatson temporarily deployed to tools December 10, 2024 00:28 — with GitHub Actions Inactive

johnrwatson had a problem deploying to tools December 10, 2024 00:28 — with GitHub Actions Failure

johnrwatson temporarily deployed to tools December 10, 2024 00:28 — with GitHub Actions Inactive

johnrwatson had a problem deploying to tools December 10, 2024 00:29 — with GitHub Actions Failure

johnrwatson temporarily deployed to tools December 10, 2024 00:29 — with GitHub Actions Inactive

johnrwatson had a problem deploying to tools December 10, 2024 00:29 — with GitHub Actions Failure

johnrwatson temporarily deployed to tools December 10, 2024 00:29 — with GitHub Actions Inactive

johnrwatson temporarily deployed to tools December 10, 2024 00:30 — with GitHub Actions Inactive

johnrwatson force-pushed the feat/do-not-page-zack-when-failed-npm branch 2 times, most recently from fc54c2a to 96ee6f0 Compare December 10, 2024 00:38

johnrwatson requested a review from britmyerss December 10, 2024 12:45

sprutton1 reviewed Dec 10, 2024

View reviewed changes

feat: dont page if no artifacts were created from the test to reduce …

1367202

…test flake

johnrwatson force-pushed the feat/do-not-page-zack-when-failed-npm branch from d5045ce to 1367202 Compare December 10, 2024 18:56

johnrwatson requested a review from sprutton1 December 10, 2024 18:56

britmyerss reviewed Dec 10, 2024

View reviewed changes

britmyerss approved these changes Dec 10, 2024

View reviewed changes

johnrwatson added this pull request to the merge queue Dec 10, 2024

Merged via the queue into main with commit af6db4d Dec 10, 2024
7 checks passed

johnrwatson deleted the feat/do-not-page-zack-when-failed-npm branch December 10, 2024 20:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: don't page unless you're absolutely sure it's worth waking someone up for #5087

feat: don't page unless you're absolutely sure it's worth waking someone up for #5087

johnrwatson commented Dec 8, 2024 •

edited

Loading

sprutton1 Dec 10, 2024

johnrwatson Dec 10, 2024

sprutton1 Dec 10, 2024

johnrwatson Dec 10, 2024

britmyerss Dec 10, 2024

johnrwatson Dec 10, 2024

feat: don't page unless you're absolutely sure it's worth waking someone up for #5087

feat: don't page unless you're absolutely sure it's worth waking someone up for #5087

Conversation

johnrwatson commented Dec 8, 2024 • edited Loading

sprutton1 Dec 10, 2024

Choose a reason for hiding this comment

johnrwatson Dec 10, 2024

Choose a reason for hiding this comment

sprutton1 Dec 10, 2024

Choose a reason for hiding this comment

johnrwatson Dec 10, 2024

Choose a reason for hiding this comment

britmyerss Dec 10, 2024

Choose a reason for hiding this comment

johnrwatson Dec 10, 2024

Choose a reason for hiding this comment

johnrwatson commented Dec 8, 2024 •

edited

Loading