ARC scalesets runners - evicted ephemeral runner pod treated as "viable" and hangs workflows assigned to it #2656
Replies: 3 comments 1 reply
-
I've seen something similar to this, when setting up runners I was having some issues with GitHub authentication meaning that the job failed. After cancelling the job the listener would never accept new jobs - I did try manually deleting the pods but this didn't seem to solve the issue reliably (it worked sometimes), what did was using helm delete and then recreating the scale set with helm as well. If I've understood your notes correctly about the root cause, I think I agree - it seems to be something to do with the internal state kept by the controller and/or the listener. They seem to think a runner is still available when actually it's in a failing state (and as I remember I think the pod was already deleted), I'm basing this just on the logs though I didn't look at the code. I was scaling nodes down to zero in my tests but I did see the same issue after keeping at least one node available. These are some of the logs that looked relevant (side note - I was using v2.292 of the runner here as I copied if from the ARC repo's Windows Dockerfile and didn't notice it was an old version - this was the cause of my auth issue as 292 dealt with jit tokens differently 😒): # this is an attempt to run a pipeline after a previous one failed - is the issue here that internal stats are wrong and it thinks a runner or scale set still exists? # no cluster scaling happens below (windows nodepool = 0 nodes) and no runner pod appears via kubectl get pods -n $runnerNamespace 2023-05-28T06:35:53Z INFO EphemeralRunnerSet Ephemeral runner counts {"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr", "pending": 0, "running": 0, "finished": 0, "failed": 1, "deleting": 0} 2023-05-28T06:35:53Z INFO EphemeralRunnerSet Scaling comparison {"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr", "current": 1, "desired": 0} 2023-05-28T06:35:53Z INFO EphemeralRunnerSet Deleting ephemeral runners (scale down) {"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr", "count": 1} 2023-05-28T06:35:53Z INFO EphemeralRunnerSet No pending or running ephemeral runners running at this time for scale down {"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr"} 2023-05-28T06:35:53Z INFO AutoscalingRunnerSet Find existing ephemeral runner set {"autoscalingrunnerset": "arc-runners-default/arc-runner-win-2-292-0", "name": "arc-runner-win-2-292-0-vbmtr", "specHash": "5f474959bd"} 2023-05-28T06:47:33Z INFO EphemeralRunnerSet Ephemeral runner counts {"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr", "pending": 0, "running": 0, "finished": 0, "failed": 1, "deleting": 0} 2023-05-28T06:47:33Z INFO EphemeralRunnerSet Scaling comparison {"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr", "current": 1, "desired": 1} 2023-05-28T06:47:33Z INFO AutoscalingRunnerSet Find existing ephemeral runner set {"autoscalingrunnerset": "arc-runners-default/arc-runner-win-2-292-0", "name": "arc-runner-win-2-292-0-vbmtr", "specHash": "5f474959bd"} ... # these are the logs after cancelling the pipeline (appears to suggest the delete operation wasn't registered by ARC but it does say none were found): 2023-05-28T07:23:39Z INFO EphemeralRunnerSet Ephemeral runner counts {"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr", "pending": 0, "running": 0, "finished": 0, "failed": 1, "deleting": 0} 2023-05-28T07:23:39Z INFO EphemeralRunnerSet Scaling comparison {"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr", "current": 1, "desired": 0} 2023-05-28T07:23:39Z INFO EphemeralRunnerSet Deleting ephemeral runners (scale down) {"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr", "count": 1} 2023-05-28T07:23:39Z INFO EphemeralRunnerSet No pending or running ephemeral runners running at this time for scale down {"ephemeralrunnerset": "arc-runners-default/arc-runner-win-2-292-0-vbmtr"} 2023-05-28T07:23:39Z INFO AutoscalingRunnerSet Find existing ephemeral runner set {"autoscalingrunnerset": "arc-runners-default/arc-runner-win-2-292-0", "name": "arc-runner-win-2-292-0-vbmtr", "specHash": "5f474959bd"} ... # these are from the win listener logs corresponding with the above: 2023-05-28T06:47:26Z INFO service process message. {"messageId": 43, "messageType": "RunnerScaleSetJobMessages"} 2023-05-28T06:47:26Z INFO service current runner scale set statistics. {"available jobs": 1, "acquired jobs": 0, "assigned jobs": 0, "running jobs": 0, "registered runners": 0, "busy runners": 0, "idle runners": 0} 2023-05-28T06:47:26Z INFO service process batched runner scale set job messages. {"messageId": 43, "batchSize": 1} 2023-05-28T06:47:26Z INFO service job available message received. {"RequestId": 56} 2023-05-28T06:47:26Z INFO auto_scaler acquiring jobs. {"request count": 1, "requestIds": "[56]"} 2023-05-28T06:47:27Z INFO auto_scaler acquired jobs. {"requested": 1, "acquired": 1} 2023-05-28T06:47:28Z INFO auto_scaler deleted message. {"messageId": 43} 2023-05-28T06:47:28Z INFO service waiting for message... 2023-05-28T06:47:33Z INFO service process message. {"messageId": 44, "messageType": "RunnerScaleSetJobMessages"} 2023-05-28T06:47:33Z INFO service current runner scale set statistics. {"available jobs": 0, "acquired jobs": 0, "assigned jobs": 1, "running jobs": 0, "registered runners": 0, "busy runners": 0, "idle runners": 0} 2023-05-28T06:47:33Z INFO service process batched runner scale set job messages. {"messageId": 44, "batchSize": 1} 2023-05-28T06:47:33Z INFO service job assigned message received. {"RequestId": 56} 2023-05-28T06:47:33Z INFO auto_scaler acquiring jobs. {"request count": 0, "requestIds": "[]"} 2023-05-28T06:47:33Z INFO service try scale runner request up/down base on assigned job count {"assigned job": 1, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 0} 2023-05-28T06:47:33Z INFO KubernetesManager Created merge patch json for EphemeralRunnerSet update {"json": "{\"spec\":{\"replicas\":1}}"} 2023-05-28T06:47:33Z INFO KubernetesManager Ephemeral runner set scaled. {"namespace": "arc-runners-default", "name": "arc-runner-win-2-292-0-vbmtr", "replicas": 1} 2023-05-28T06:47:35Z INFO auto_scaler deleted message. {"messageId": 44} 2023-05-28T06:47:35Z INFO service waiting for message... 2023-05-28T06:55:56Z INFO refreshing_client message queue token is expired during GetNextMessage, refreshing... 2023-05-28T06:55:56Z INFO refreshing token {"githubConfigUrl": "https://github.com/xyzrepo/actions-tests"} 2023-05-28T06:55:56Z INFO getting runner registration token {"registrationTokenURL": "https://api.github.com/repos/xyzrepo/actions-tests/actions/runners/registration-token"} 2023-05-28T06:55:56Z INFO getting Actions tenant URL and JWT {"registrationURL": "https://api.github.com/actions/runner-registration"} ... # this is the listener when the job is cancelled - stats seem corrupt - line 2 shows all 0 but further down currentRunnerCount = 1 2023-05-28T07:23:39Z INFO service process message. {"messageId": 45, "messageType": "RunnerScaleSetJobMessages"} 2023-05-28T07:23:39Z INFO service current runner scale set statistics. {"available jobs": 0, "acquired jobs": 0, "assigned jobs": 0, "running jobs": 0, "registered runners": 0, "busy runners": 0, "idle runners": 0} 2023-05-28T07:23:39Z INFO service process batched runner scale set job messages. {"messageId": 45, "batchSize": 1} 2023-05-28T07:23:39Z INFO service job completed message received. {"RequestId": 56, "Result": "canceled", "RunnerId": 0, "RunnerName": ""} 2023-05-28T07:23:39Z INFO auto_scaler acquiring jobs. {"request count": 0, "requestIds": "[]"} 2023-05-28T07:23:39Z INFO service try scale runner request up/down base on assigned job count {"assigned job": 0, "decision": 0, "min": 0, "max": 2147483647, "currentRunnerCount": 1} 2023-05-28T07:23:39Z INFO KubernetesManager Created merge patch json for EphemeralRunnerSet update {"json": "{\"spec\":{\"replicas\":null}}"} 2023-05-28T07:23:39Z INFO KubernetesManager Ephemeral runner set scaled. {"namespace": "arc-runners-default", "name": "arc-runner-win-2-292-0-vbmtr", "replicas": 0} 2023-05-28T07:23:41Z INFO auto_scaler deleted message. {"messageId": 45} 2023-05-28T07:23:41Z INFO service waiting for message... |
Beta Was this translation helpful? Give feedback.
-
Hey guys - has the above behavior been fixed or that's still the norm as of now? |
Beta Was this translation helpful? Give feedback.
-
Same problem for me,
|
Beta Was this translation helpful? Give feedback.
-
tl;dr
Using the new autoscaling scale set Actions Runner Controller, I've noticed that if a runner pod gets evicted by the Kubernetes cluster it appears to count the "evicted" / "failed" pod as part of the target number of pods and hangs any workflow assigned to it indefinitely.
Q. Is this a bug / known issue or am I doing something wrong?
Long Version
I have an ARC runner scale set installation running on an Azure Kubernetes Service (AKS) instance using these images:
Evicted runner pod
Everything works fine, except if a pod gets evicted by the cluster due to memory pressure while running a workflow:
When this happens, the evicted node doesn't automatically terminate and just sits there waiting to be deleted - either manually, or by the garbage collector when it reaches the
--terminated-pod-gc-threshold
setting (which is apparently 12500 by default!),In the meantime, the runner controller (and / or listener) seem to treat the evicted pod as a viable runner because the next workflow that gets started is assigned to this pod and never starts. E.g:
If I start a second workflow, a new runner pod gets created fine, and it processes the workflow and terminates as expected, leaving just the evicted pod in the namespace again.
There's clearly something a bit out of sorts because when it's just the evicted runner pod in the namespace the GitHub portal shows an inconsistent state for the jobs and runners with:
Workaround - manually delete evicted pod
If I manually delete the evicted pod I see this sequence:
and everything resumes normal processing - the controller detects there's a workflow waiting in the queue (the same one that was previously blocked) and spins up a new runner pod to process it.
Root cause
I'm pretty sure this is triggered by a pod getting evicted - I've tried to work through the source code to identify the root cause but not really narrowed it down yet.
However, it looks like the ephemeral runner controller counts "evicted" pods (with a status of "failed") in the total number of required pods, but I can't be 100% sure this is the actual cause of the problem:
https://github.com/actions/actions-runner-controller/blob/aac811f210782d1a35e33ffcfe12db69ebe8e447/controllers/actions.github.com/ephemeralrunnerset_controller.go#LL183C2-L191C7
If anyone has seen this same behaviour or has any insights into the cause (and / or fix) I'd appreciate a pointer...
(More info / logs also available if it will help)
Cheers,
M
Beta Was this translation helpful? Give feedback.
All reactions