Observability of runners #1158

mambax · 2021-06-16T14:15:51Z

mambax
Jun 16, 2021

I am just reposting this discussion here as advised by @yaananth.

To be honest, since I last posted this I feel ever more the need for it within our company. We plan to provision hundreds of runners to any team and it might indeed end in a bloodbath without monitoring of who uses what.

We provide our team runners on AWS with GitHub - philips-labs/terraform-aws-github-runner: Terraform module for scalable GitHub action runners on AWS.

This makes them independent and they can just request one of them with

runs-on: [aws]

Cool no?

What is not cool though is (until now) we have zero transparency/observability. It is a question of when not if when the first teams will hog the runners.

Yes, they autoscale, but we are not ready to burn money just because someone thinks he needs to install modules 7 times or code an endless loop (or worst case mine some 🪙).
Also, they should break down their tests into smaller, fast-feedback bites.

Now, what we lack and I find it nowhere out there is some way to observe the runners. I mean requirements in the direction:

Which job is executed the most?
Which job fails the most?
Which job takes the longest?
Which step from which job takes the longest, fails the most often?
Etc.etc., let’s just say I want to observe the runners, on a runner basis. I know there are e.g. stepstimeout-minutes 1 but it’s the wrong way around.

I want to observe which teams “violate” our guidelines and mentor them into the pattern. Of course, a “hard limit” for jobs is an option but then again this robs all freedom for special cases.

What “Runner Observability” exists there?

Thank you 🤗

ringods · 2021-11-04T08:57:34Z

ringods
Nov 4, 2021

What I would want is an OpenTelemetry version of Honeycomb's buildevents tool.

https://github.com/honeycombio/buildevents

1 reply

lizthegrey Nov 11, 2021

A version of this exists -- https://github.com/packethost/otel-cli but would need to be wired in.

crohr · 2024-04-12T12:51:02Z

crohr
Apr 12, 2024

@mambax RunsOn is a replacement for the philips-labs tool and ships with CloudWatch metrics for all the workflow jobs with dimensions across minutes consumed, repository, workflow name, job name, instance type, conclusion (success/failure/canceled), etc.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability of runners #1158

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Observability of runners #1158

mambax Jun 16, 2021

Replies: 2 comments · 1 reply

ringods Nov 4, 2021

lizthegrey Nov 11, 2021

crohr Apr 12, 2024

mambax
Jun 16, 2021

Replies: 2 comments 1 reply

ringods
Nov 4, 2021

crohr
Apr 12, 2024