Replies: 2 comments 1 reply
-
What I would want is an OpenTelemetry version of Honeycomb's |
Beta Was this translation helpful? Give feedback.
1 reply
-
@mambax RunsOn is a replacement for the philips-labs tool and ships with CloudWatch metrics for all the workflow jobs with dimensions across minutes consumed, repository, workflow name, job name, instance type, conclusion (success/failure/canceled), etc.) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am just reposting this discussion here as advised by @yaananth.
To be honest, since I last posted this I feel ever more the need for it within our company. We plan to provision hundreds of runners to any team and it might indeed end in a bloodbath without monitoring of who uses what.
We provide our team runners on AWS with GitHub - philips-labs/terraform-aws-github-runner: Terraform module for scalable GitHub action runners on AWS.
This makes them independent and they can just request one of them with
Cool no?
What is not cool though is (until now) we have zero transparency/observability. It is a question of when not if when the first teams will hog the runners.
Yes, they autoscale, but we are not ready to burn money just because someone thinks he needs to install modules 7 times or code an endless loop (or worst case mine some 🪙).
Also, they should break down their tests into smaller, fast-feedback bites.
Now, what we lack and I find it nowhere out there is some way to observe the runners. I mean requirements in the direction:
Which job is executed the most?
Which job fails the most?
Which job takes the longest?
Which step from which job takes the longest, fails the most often?
Etc.etc., let’s just say I want to observe the runners, on a runner basis. I know there are e.g. stepstimeout-minutes 1 but it’s the wrong way around.
I want to observe which teams “violate” our guidelines and mentor them into the pattern. Of course, a “hard limit” for jobs is an option but then again this robs all freedom for special cases.
What “Runner Observability” exists there?
Thank you 🤗
Beta Was this translation helpful? Give feedback.
All reactions