-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: don't page unless you're absolutely sure it's worth waking someone up for #5087
Conversation
fc54c2a
to
96ee6f0
Compare
.github/workflows/e2e-validation.yml
Outdated
# Check the exit code | ||
if [ -z "$exit_code" ]; then | ||
echo "Cypress Test task succeeded!" | ||
break | ||
fi | ||
|
||
if [ $n -ge $max_retries ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this ever run? I can't remember if until [ $n -ge $max_retries ]
is inclusive of the last run or if it will won't run when n>$max_retries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch, I'll move it back
echo "last_exit_code=$exit_code" >> "$GITHUB_ENV" | ||
exit "$last_exit_code" | ||
|
||
- name: Upload artifact if exit code 53 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does 53 mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See description
For the API tests, to assert whether the test actually failed while running our code, I made it throw exit code 53, which is a random code I devised that should not match any other orchestration failure from deno, dependencies or otherwise.
d5045ce
to
1367202
Compare
break | ||
fi | ||
done | ||
# If at least one valid failure marker is present, then page |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one thought here - we could send different metadata to firehydrant and have firehydrant do the routing if it's page-able... I'm like 50/50 on this though, so your call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tricky bit here was actually isolating when a signal should be sent at all into firehydrant. We could possibly split it so that for each CI failure there are various failure modes, i.e.:
- If there is a github failure we pass a "github failure" metadata key
- If there is a cypress orchestration failure we send that as a metadata key
- If unmatched, we page(?)
It feels like we would spend a lot of time trying to increase our filter accuracy to more actuely catch + report on different failure modes.
It certainly would help with metrics, i.e. we'd see exactly how often certain failure modes happen, but I think a ping into #_alerts_internal should be sufficient for now. If this doesn't work out I think what you suggested is probably the way to go
I'm not particularly happy with this indirect assertion system, but logically it ties together and should hold us in good stead moving forward to prevent unnecessary pages.
It works off the premise that the only reason we should page someone from CI is:
Previously we had a bunch of other failure scenarios which would have also paged, such as:
Now for every time the system wants to assert whether to page someone, the CI runner MUST have a valid failure artifact or marker pulled from the CI workflow.
ALL
failures, regardless of cause will still post into #_alerts_internal to let us know there was an orchestration issue of some manner. Should save a lot of noise in paging people with no actionable result.For the API tests, to assert whether the test actually failed while running our code, I made it throw exit code 53, which is a random code I devised that should not match any other orchestration failure from deno, dependencies or otherwise.
Also fixed an issue with the Slack notification for failed tests not working because it didn't have the environment to pull the Slack token.
Developing this was a proper grind