Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: don't page unless you're absolutely sure it's worth waking someone up for #5087

Merged
merged 1 commit into from
Dec 10, 2024

Conversation

johnrwatson
Copy link
Contributor

@johnrwatson johnrwatson commented Dec 8, 2024

I'm not particularly happy with this indirect assertion system, but logically it ties together and should hold us in good stead moving forward to prevent unnecessary pages.

It works off the premise that the only reason we should page someone from CI is:

  • If we have a valid failure test result
  • It's the correct environment

Previously we had a bunch of other failure scenarios which would have also paged, such as:

  • NPM outage
  • Github orchestration issue/infrastructure issue on their side
  • Cypress results not uploading correctly in CI
  • Other networking issue/blip
  • Other 3rd party dependency issue

Now for every time the system wants to assert whether to page someone, the CI runner MUST have a valid failure artifact or marker pulled from the CI workflow.

ALL failures, regardless of cause will still post into #_alerts_internal to let us know there was an orchestration issue of some manner. Should save a lot of noise in paging people with no actionable result.

For the API tests, to assert whether the test actually failed while running our code, I made it throw exit code 53, which is a random code I devised that should not match any other orchestration failure from deno, dependencies or otherwise.


Also fixed an issue with the Slack notification for failed tests not working because it didn't have the environment to pull the Slack token.

Developing this was a proper grind

Screenshot 2024-12-09 at 23 04 05

@github-actions github-actions bot added the A-ci Area: CI configuration files and scripts label Dec 8, 2024
@johnrwatson johnrwatson force-pushed the feat/do-not-page-zack-when-failed-npm branch 2 times, most recently from fc54c2a to 96ee6f0 Compare December 10, 2024 00:38
# Check the exit code
if [ -z "$exit_code" ]; then
echo "Cypress Test task succeeded!"
break
fi

if [ $n -ge $max_retries ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this ever run? I can't remember if until [ $n -ge $max_retries ] is inclusive of the last run or if it will won't run when n>$max_retries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, I'll move it back

echo "last_exit_code=$exit_code" >> "$GITHUB_ENV"
exit "$last_exit_code"

- name: Upload artifact if exit code 53
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 53 mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See description

For the API tests, to assert whether the test actually failed while running our code, I made it throw exit code 53, which is a random code I devised that should not match any other orchestration failure from deno, dependencies or otherwise.

@johnrwatson johnrwatson force-pushed the feat/do-not-page-zack-when-failed-npm branch from d5045ce to 1367202 Compare December 10, 2024 18:56
break
fi
done
# If at least one valid failure marker is present, then page
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thought here - we could send different metadata to firehydrant and have firehydrant do the routing if it's page-able... I'm like 50/50 on this though, so your call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tricky bit here was actually isolating when a signal should be sent at all into firehydrant. We could possibly split it so that for each CI failure there are various failure modes, i.e.:

  • If there is a github failure we pass a "github failure" metadata key
  • If there is a cypress orchestration failure we send that as a metadata key
  • If unmatched, we page(?)

It feels like we would spend a lot of time trying to increase our filter accuracy to more actuely catch + report on different failure modes.

It certainly would help with metrics, i.e. we'd see exactly how often certain failure modes happen, but I think a ping into #_alerts_internal should be sufficient for now. If this doesn't work out I think what you suggested is probably the way to go

@johnrwatson johnrwatson added this pull request to the merge queue Dec 10, 2024
Merged via the queue into main with commit af6db4d Dec 10, 2024
7 checks passed
@johnrwatson johnrwatson deleted the feat/do-not-page-zack-when-failed-npm branch December 10, 2024 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ci Area: CI configuration files and scripts
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants