ARC scalesets runners - evicted ephemeral runner pod treated as "viable" and hangs workflows assigned to it #2656

mikeclayton · 2023-06-08T19:18:22Z

mikeclayton
Jun 8, 2023

tl;dr

Using the new autoscaling scale set Actions Runner Controller, I've noticed that if a runner pod gets evicted by the Kubernetes cluster it appears to count the "evicted" / "failed" pod as part of the target number of pods and hangs any workflow assigned to it indefinitely.

Q. Is this a bug / known issue or am I doing something wrong?

Long Version

I have an ARC runner scale set installation running on an Azure Kubernetes Service (AKS) instance using these images:

controller - ghcr.io/actions/gha-runner-scale-set-controller:0.4.0
runner - ghcr.io/actions/gha-runner-scale-set-controller@sha256:7dda34d5a99842769060dceaf51e4761c1000f6e8bfc489032e424d9d7a4e1b4

Evicted runner pod

Everything works fine, except if a pod gets evicted by the cluster due to memory pressure while running a workflow:

> kubectl get pods -n github-runners
NAME                                READY   STATUS    RESTARTS   AGE
github-runners-c4nv9-runner-68lp5   0/1     Evicted   0          42m

> kubectl describe pod github-runners-c4nv9-runner-68lp5 -n github-runners
Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Normal   Scheduled  41m   default-scheduler  Successfully assigned github-runners/github-runners-c4nv9-runner-68lp5 to aks-ghrunners-xxxxxxxx-vmss00001s
  Warning  Evicted    41m   kubelet            The node had condition: [MemoryPressure].

When this happens, the evicted node doesn't automatically terminate and just sits there waiting to be deleted - either manually, or by the garbage collector when it reaches the --terminated-pod-gc-threshold setting (which is apparently 12500 by default!),

In the meantime, the runner controller (and / or listener) seem to treat the evicted pod as a viable runner because the next workflow that gets started is assigned to this pod and never starts. E.g:

Requested runner group: github-runners
Job defined at: my-org/my-repo/.github/workflows/my-workflow.yaml@refs/heads/my-beanch
Waiting for a runner to pick up this job...
Job is waiting for a runner from 'github-runners' to come online.

If I start a second workflow, a new runner pod gets created fine, and it processes the workflow and terminates as expected, leaving just the evicted pod in the namespace again.

There's clearly something a bit out of sorts because when it's just the evicted runner pod in the namespace the GitHub portal shows an inconsistent state for the jobs and runners with:

Total available jobs: 0
Total assigned jobs: 1
Total running jobs: 0
Total busy runners: 0
Total idle runners: 0

Workaround - manually delete evicted pod

If I manually delete the evicted pod I see this sequence:

C:\>kubectl get pods -n github-runners
NAME                                READY   STATUS    RESTARTS   AGE
github-runners-c4nv9-runner-68lp5   0/1     Evicted   0          3h19m

C:\>kubectl delete pod github-runner-linux-default-c4nv9-runner-68lp5 -n github-runner-linux-default
pod "github-runners-c4nv9-runner-68lp5" deleted

C:\>kubectl get pods -n github-runner-linux-default
NAME                                READY   STATUS    RESTARTS   AGE
github-runners-c4nv9-runner-68lp5   1/1     Running   0          3s

C:\>kubectl get pods -n github-runner-linux-default
NAME                                READY   STATUS    RESTARTS   AGE
github-runners-c4nv9-runner-kkct4   1/1     Running   0          10s

and everything resumes normal processing - the controller detects there's a workflow waiting in the queue (the same one that was previously blocked) and spins up a new runner pod to process it.

Root cause

I'm pretty sure this is triggered by a pod getting evicted - I've tried to work through the source code to identify the root cause but not really narrowed it down yet.

However, it looks like the ephemeral runner controller counts "evicted" pods (with a status of "failed") in the total number of required pods, but I can't be 100% sure this is the actual cause of the problem:

https://github.com/actions/actions-runner-controller/blob/aac811f210782d1a35e33ffcfe12db69ebe8e447/controllers/actions.github.com/ephemeralrunnerset_controller.go#LL183C2-L191C7

	total := len(pendingEphemeralRunners) + len(runningEphemeralRunners) + len(failedEphemeralRunners)
	log.Info("Scaling comparison", "current", total, "desired", ephemeralRunnerSet.Spec.Replicas)
	switch {
	case total < ephemeralRunnerSet.Spec.Replicas: // Handle scale up
		count := ephemeralRunnerSet.Spec.Replicas - total
		log.Info("Creating new ephemeral runners (scale up)", "count", count)
		if err := r.createEphemeralRunners(ctx, ephemeralRunnerSet, count, log); err != nil {
			log.Error(err, "failed to make ephemeral runner")
			return ctrl.Result{}, err

If anyone has seen this same behaviour or has any insights into the cause (and / or fix) I'd appreciate a pointer...

(More info / logs also available if it will help)

Cheers,

M

macux · 2023-06-09T07:02:20Z

macux
Jun 9, 2023

I've seen something similar to this, when setting up runners I was having some issues with GitHub authentication meaning that the job failed. After cancelling the job the listener would never accept new jobs - I did try manually deleting the pods but this didn't seem to solve the issue reliably (it worked sometimes), what did was using helm delete and then recreating the scale set with helm as well.

If I've understood your notes correctly about the root cause, I think I agree - it seems to be something to do with the internal state kept by the controller and/or the listener. They seem to think a runner is still available when actually it's in a failing state (and as I remember I think the pod was already deleted), I'm basing this just on the logs though I didn't look at the code.

I was scaling nodes down to zero in my tests but I did see the same issue after keeping at least one node available.

These are some of the logs that looked relevant (side note - I was using v2.292 of the runner here as I copied if from the ARC repo's Windows Dockerfile and didn't notice it was an old version - this was the cause of my auth issue as 292 dealt with jit tokens differently 😒):

# this is an attempt to run a pipeline after a previous one failed - is the issue here that internal stats are wrong and it thinks a runner or scale set still exists?
# no cluster scaling happens below (windows nodepool = 0 nodes) and no runner pod appears via kubectl get pods -n $runnerNamespace

2023-05-28T06:35:53Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr", "pending": 0, "running": 0, "finished": 0, "failed": 1, "deleting": 0}
2023-05-28T06:35:53Z	INFO	EphemeralRunnerSet	Scaling comparison	{"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr", "current": 1, "desired": 0}
2023-05-28T06:35:53Z	INFO	EphemeralRunnerSet	Deleting ephemeral runners (scale down)	{"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr", "count": 1}
2023-05-28T06:35:53Z	INFO	EphemeralRunnerSet	No pending or running ephemeral runners running at this time for scale down	{"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr"}
2023-05-28T06:35:53Z	INFO	AutoscalingRunnerSet	Find existing ephemeral runner set	{"autoscalingrunnerset": "arc-runners-default/arc-runner-win-2-292-0", "name": "arc-runner-win-2-292-0-vbmtr", "specHash": "5f474959bd"}
2023-05-28T06:47:33Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr", "pending": 0, "running": 0, "finished": 0, "failed": 1, "deleting": 0}
2023-05-28T06:47:33Z	INFO	EphemeralRunnerSet	Scaling comparison	{"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr", "current": 1, "desired": 1}
2023-05-28T06:47:33Z	INFO	AutoscalingRunnerSet	Find existing ephemeral runner set	{"autoscalingrunnerset": "arc-runners-default/arc-runner-win-2-292-0", "name": "arc-runner-win-2-292-0-vbmtr", "specHash": "5f474959bd"}

...

# these are the logs after cancelling the pipeline (appears to suggest the delete operation wasn't registered by ARC but it does say none were found):

2023-05-28T07:23:39Z    INFO    EphemeralRunnerSet      Ephemeral runner counts {"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr", "pending": 0, "running": 0, "finished": 0, "failed": 1, "deleting": 0}
2023-05-28T07:23:39Z    INFO    EphemeralRunnerSet      Scaling comparison      {"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr", "current": 1, "desired": 0}
2023-05-28T07:23:39Z    INFO    EphemeralRunnerSet      Deleting ephemeral runners (scale down) {"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr", "count": 1}
2023-05-28T07:23:39Z    INFO    EphemeralRunnerSet      No pending or running ephemeral runners running at this time for scale down     {"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr"}
2023-05-28T07:23:39Z    INFO    AutoscalingRunnerSet    Find existing ephemeral runner set      {"autoscalingrunnerset": "arc-runners-default/arc-runner-win-2-292-0", "name": "arc-runner-win-2-292-0-vbmtr", "specHash": "5f474959bd"}

...

# these are from the win listener logs corresponding with the above:

2023-05-28T06:47:26Z    INFO    service process message.        {"messageId": 43, "messageType": "RunnerScaleSetJobMessages"}
2023-05-28T06:47:26Z    INFO    service current runner scale set statistics.    {"available jobs": 1, "acquired jobs": 0, "assigned jobs": 0, "running jobs": 0, "registered runners": 0, "busy runners": 0, "idle runners": 0}
2023-05-28T06:47:26Z    INFO    service process batched runner scale set job messages.  {"messageId": 43, "batchSize": 1}
2023-05-28T06:47:26Z    INFO    service job available message received. {"RequestId": 56}
2023-05-28T06:47:26Z    INFO    auto_scaler     acquiring jobs. {"request count": 1, "requestIds": "[56]"}
2023-05-28T06:47:27Z    INFO    auto_scaler     acquired jobs.  {"requested": 1, "acquired": 1}
2023-05-28T06:47:28Z    INFO    auto_scaler     deleted message.        {"messageId": 43}
2023-05-28T06:47:28Z    INFO    service waiting for message...
2023-05-28T06:47:33Z    INFO    service process message.        {"messageId": 44, "messageType": "RunnerScaleSetJobMessages"}
2023-05-28T06:47:33Z    INFO    service current runner scale set statistics.    {"available jobs": 0, "acquired jobs": 0, "assigned jobs": 1, "running jobs": 0, "registered runners": 0, "busy runners": 0, "idle runners": 0}
2023-05-28T06:47:33Z    INFO    service process batched runner scale set job messages.  {"messageId": 44, "batchSize": 1}
2023-05-28T06:47:33Z    INFO    service job assigned message received.  {"RequestId": 56}
2023-05-28T06:47:33Z    INFO    auto_scaler     acquiring jobs. {"request count": 0, "requestIds": "[]"}
2023-05-28T06:47:33Z    INFO    service try scale runner request up/down base on assigned job count     {"assigned job": 1, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 0}
2023-05-28T06:47:33Z    INFO    KubernetesManager       Created merge patch json for EphemeralRunnerSet update  {"json": "{\"spec\":{\"replicas\":1}}"}
2023-05-28T06:47:33Z    INFO    KubernetesManager       Ephemeral runner set scaled.    {"namespace": "arc-runners-default", "name": "arc-runner-win-2-292-0-vbmtr", "replicas": 1}
2023-05-28T06:47:35Z    INFO    auto_scaler     deleted message.        {"messageId": 44}
2023-05-28T06:47:35Z    INFO    service waiting for message...
2023-05-28T06:55:56Z    INFO    refreshing_client       message queue token is expired during GetNextMessage, refreshing...
2023-05-28T06:55:56Z    INFO    refreshing token        {"githubConfigUrl": "https://github.com/xyzrepo/actions-tests"}
2023-05-28T06:55:56Z    INFO    getting runner registration token       {"registrationTokenURL": "https://api.github.com/repos/xyzrepo/actions-tests/actions/runners/registration-token"}
2023-05-28T06:55:56Z    INFO    getting Actions tenant URL and JWT      {"registrationURL": "https://api.github.com/actions/runner-registration"}

...

# this is the listener when the job is cancelled - stats seem corrupt - line 2 shows all 0 but further down currentRunnerCount = 1

2023-05-28T07:23:39Z    INFO    service process message.        {"messageId": 45, "messageType": "RunnerScaleSetJobMessages"}
2023-05-28T07:23:39Z    INFO    service current runner scale set statistics.    {"available jobs": 0, "acquired jobs": 0, "assigned jobs": 0, "running jobs": 0, "registered runners": 0, "busy runners": 0, "idle runners": 0}
2023-05-28T07:23:39Z    INFO    service process batched runner scale set job messages.  {"messageId": 45, "batchSize": 1}
2023-05-28T07:23:39Z    INFO    service job completed message received. {"RequestId": 56, "Result": "canceled", "RunnerId": 0, "RunnerName": ""}
2023-05-28T07:23:39Z    INFO    auto_scaler     acquiring jobs. {"request count": 0, "requestIds": "[]"}
2023-05-28T07:23:39Z    INFO    service try scale runner request up/down base on assigned job count     {"assigned job": 0, "decision": 0, "min": 0, "max": 2147483647, "currentRunnerCount": 1}
2023-05-28T07:23:39Z    INFO    KubernetesManager       Created merge patch json for EphemeralRunnerSet update  {"json": "{\"spec\":{\"replicas\":null}}"}
2023-05-28T07:23:39Z    INFO    KubernetesManager       Ephemeral runner set scaled.    {"namespace": "arc-runners-default", "name": "arc-runner-win-2-292-0-vbmtr", "replicas": 0}
2023-05-28T07:23:41Z    INFO    auto_scaler     deleted message.        {"messageId": 45}
2023-05-28T07:23:41Z    INFO    service waiting for message...

0 replies

Xtigyro · 2023-08-31T12:48:52Z

Xtigyro
Aug 31, 2023

Hey guys - has the above behavior been fixed or that's still the norm as of now?

0 replies

0x39 · 2024-02-19T22:32:48Z

0x39
Feb 19, 2024

Same problem for me, Evicted runner treated as live and workflow never starts.
Manually deleting pod, fixes problem.

Name:             arc-runner-set-2fh9q-runner-tkvlz
Namespace:        arc-runners
Priority:         0
Service Account:  arc-runner-set-gha-rs-no-permission
Node:             pool-github-runners-3-ogny9/
Start Time:       Tue, 20 Feb 2024 00:02:53 +0200
Labels:           actions-ephemeral-runner=True
                  actions.github.com/organization=***
                  actions.github.com/scale-set-name=arc-runner-set
                  actions.github.com/scale-set-namespace=arc-runners
                  app.kubernetes.io/component=runner
                  app.kubernetes.io/part-of=gha-runner-scale-set
                  app.kubernetes.io/version=0.8.2
                  pod-template-hash=7484468bf7
Annotations:      actions.github.com/runner-group-name: Default
Status:           Failed
Reason:           Evicted
Message:          Pod was rejected: The node had condition: [DiskPressure]. 
IP:               
IPs:              <none>
Controlled By:    EphemeralRunner/arc-runner-set-2fh9q-runner-tkvlz
Init Containers:
  init-dind-externals:
    Image:      registry.digitalocean.com/***/actions-runner:custom-v5-node18.16.0
    Port:       <none>
    Host Port:  <none>
    Command:
      cp
      -r
      -v
      /home/runner/externals/.
      /home/runner/tmpDir/
    Environment:  <none>
    Mounts:
      /home/runner/tmpDir from dind-externals (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-w76w2 (ro)
Containers:
  runner:
    Image:      registry.digitalocean.com/***/actions-runner:custom-v5-node18.16.0
    Port:       <none>
    Host Port:  <none>
    Command:
      /home/runner/run.sh
    Environment:
      DOCKER_HOST:                             unix:///run/docker/docker.sock
      ACTIONS_RUNNER_INPUT_JITCONFIG:          <set to the key 'jitToken' in secret 'arc-runner-set-2fh9q-runner-tkvlz'>  Optional: false
      GITHUB_ACTIONS_RUNNER_EXTRA_USER_AGENT:  actions-runner-controller/0.8.2
    Mounts:
      /home/runner/_work from work (rw)
      /run/docker from dind-sock (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-w76w2 (ro)
  dind:
    Image:      docker:dind
    Port:       <none>
    Host Port:  <none>
    Args:
      dockerd
      --host=unix:///run/docker/docker.sock
      --group=$(DOCKER_GROUP_GID)
    Environment:
      DOCKER_GROUP_GID:  123
    Mounts:
      /home/runner/_work from work (rw)
      /home/runner/externals from dind-externals (rw)
      /run/docker from dind-sock (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-w76w2 (ro)
Volumes:
  work:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  dind-sock:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  dind-externals:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-w76w2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Normal   Scheduled  18m   default-scheduler  Successfully assigned arc-runners/arc-runner-set-2fh9q-runner-tkvlz to pool-github-runners-3-ogny9
  Warning  Evicted    18m   kubelet            The node had condition: [DiskPressure].

1 reply

johnjeffers Sep 23, 2024

I had a similar issue, but in my case the pod that the runner controller was looking for didn't exist. I don't know where the reference to the pod was stored. Probably somewhere in etcd, I would guess. I had to uninstall and reinstall the runner controller helm chart to get things working again.

If there's a better way to deal with this, I'd love to know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARC scalesets runners - evicted ephemeral runner pod treated as "viable" and hangs workflows assigned to it #2656

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

ARC scalesets runners - evicted ephemeral runner pod treated as "viable" and hangs workflows assigned to it #2656

mikeclayton Jun 8, 2023

tl;dr

Long Version

Evicted runner pod

Workaround - manually delete evicted pod

Root cause

Replies: 3 comments · 1 reply

macux Jun 9, 2023

Xtigyro Aug 31, 2023

0x39 Feb 19, 2024

johnjeffers Sep 23, 2024

mikeclayton
Jun 8, 2023

Replies: 3 comments 1 reply

macux
Jun 9, 2023

Xtigyro
Aug 31, 2023

0x39
Feb 19, 2024