Scaling delay for pod provisioning with higher job spikes #3276

ventsislav-georgiev · 2024-02-08T22:36:43Z

Checks

I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
I am using charts that are officially provided

Controller Version

0.8.2

Deployment Method

Helm

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions).
I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Reproducible by modifying the min runners of an AutoscalingRunnerSet.
From 0 to 30, 100, 400.

Describe the bug

We are experiencing scaling issues during higher demand. If our CI triggers big amount of jobs the gha-runner-scale-set has hard time spinning pods and they are seem like stuck in pending state.

Here are some screen captures of scaling from 0 to 30, to 100 and to 400 runners:

0 to 30 (took 15s to create 30 pods)

30target_15s_to_30pod.mov

0 to 100 (took 40s to create 30 pods)

100target_40s_to_30pod.mov

0 to 400 (took 2m 8s to create 30 pods)

400target_2m8s_to_30pod.mov

Describe the expected behavior

The scaling speed to first pods should be the same. Otherwise the CI slows down the moment it is needed most.

Additional Context

Controller Logs

https://gist.github.com/ventsislav-georgiev/f318f84b6bc6e801d733907087ce287c

Runner Pod Logs

[Irrelevant]

github-actions · 2024-02-08T22:37:09Z

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

xunholy · 2024-02-14T01:23:59Z

@ventsislav-georgiev we're observing similar behaviour, it seems like it's batching 50 jobs from the queue every 60s and whilst min runners are actioning and completing it's only bringing in 50 new queued jobs within that timeframe so you'll always see 50 or less runners start within a minute.

Yet to completely understand why this is happening but have raised it with Github.

xunholy · 2024-03-24T21:36:16Z

@ventsislav-georgiev we've internally been looking at what/how this can be resolved with no real success - I'm hoping that GitHub can provide more feedback here...

We've gone as far as ensuring this isn't network bandwidth, compute, or storage related. We do have a forward proxy we use to reach out to GitHub which I'll double check this week, however it should be able to handle much more load than 50 runner instances, and I haven't seen any error logs.

@nikola-jokic any review from GitHubs side? I know we internally escalated a support ticket which was unfortuntely closed with a unsatisfactory answer https://support.github.com/ticket/enterprise/1617/2658092

diegotecbr · 2024-06-04T21:41:13Z

We are experiencing the same behavior in version 0.9.2, how are these issues being handled?

int128 · 2024-10-01T00:55:06Z

I tried to split runners as below, and it slightly reduced the startup time of runners. For our repository, up to 400+ jobs are running at once.

Workflow changes like:

Before
- Test jobs run on runner A
- Deploy jobs run on runner A
- Other jobs run on runner A
After
- Test jobs run on runner A
- Deploy jobs run on runner B
- Other jobs run on runner C

Startup time changes:

Before: 90s (75%tile), 136s (90%tile)
After: 71s (75%tile), 122s (90%tile)

krzysztof-magosa · 2024-10-15T07:53:11Z

I suggest to check if you aren't affected by actions/runner-container-hooks#167.
In our case ARC waits for files to be copied before spawning new containers, and that's the main delaying factor.

int128 · 2024-11-02T02:22:26Z

I discussed this issue with @mumoshu. He wrote a patch f58dd76 to reduce the total time to reconcile an EphemeralRunner object.

I tested the patch in our organization. According to the listener metrics, we can see that the job startup duration has been improved by the patch. Here are the distribution graphs of job startup duration.

ventsislav-georgiev added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Feb 8, 2024

nikola-jokic removed the needs triage Requires review from the maintainers label Feb 19, 2024

nikola-jokic mentioned this issue May 22, 2024

Listener is very slow to receive job messages and spawn worker pods. (20-30 seconds delay). #3534

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling delay for pod provisioning with higher job spikes #3276

Scaling delay for pod provisioning with higher job spikes #3276

ventsislav-georgiev commented Feb 8, 2024

github-actions bot commented Feb 8, 2024

xunholy commented Feb 14, 2024

xunholy commented Mar 24, 2024

diegotecbr commented Jun 4, 2024

int128 commented Oct 1, 2024

krzysztof-magosa commented Oct 15, 2024

int128 commented Nov 2, 2024

Scaling delay for pod provisioning with higher job spikes #3276

Scaling delay for pod provisioning with higher job spikes #3276

Comments

ventsislav-georgiev commented Feb 8, 2024

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

0 to 30 (took 15s to create 30 pods)

0 to 100 (took 40s to create 30 pods)

0 to 400 (took 2m 8s to create 30 pods)

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

github-actions bot commented Feb 8, 2024

xunholy commented Feb 14, 2024

xunholy commented Mar 24, 2024

diegotecbr commented Jun 4, 2024

int128 commented Oct 1, 2024

krzysztof-magosa commented Oct 15, 2024

int128 commented Nov 2, 2024