Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARC should handle OOM killed runners #143

Open
4 tasks done
antoineozenne-at-leocare opened this issue Mar 4, 2024 · 5 comments
Open
4 tasks done

ARC should handle OOM killed runners #143

antoineozenne-at-leocare opened this issue Mar 4, 2024 · 5 comments
Labels
bug Something isn't working k8s

Comments

@antoineozenne-at-leocare
Copy link

antoineozenne-at-leocare commented Mar 4, 2024

Checks

Controller Version

0.8.0

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Deploy a release of `gha-runner-scale-set` with a `ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE` to customize the resource requests and limits of the runner.
2. Run a job in GitHub and getting this runner OOMKilled.

Describe the bug

When the runner is OOMKilled, nothing appends and the pod stays in OOMKilled status. The controller doesn't seem to handle this case, and the job finally times out.

Describe the expected behavior

I think ARC should handle the case the runner is OMMKilled by stopping the job in GitHub with an error status.

Additional Context

kubectl get pods -n arc-runners
# NAME                                                           READY   STATUS      RESTARTS   AGE
# arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m            1/1     Running     0          13h
# arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m-workflow   0/1     OOMKilled   0          136m

Controller Logs

2024-03-04T00:23:29Z	INFO	EphemeralRunnerSet	Created new ephemeral runner	{"ephemeralrunnerset": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l","namespace":"arc-runners"}, "runner": "arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m"}
2024-03-04T00:23:29Z	INFO	EphemeralRunner	Adding runner registration finalizer	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:29Z	INFO	EphemeralRunner	Successfully added runner registration finalizer	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:29Z	INFO	EphemeralRunner	Adding finalizer	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:29Z	INFO	EphemeralRunner	Successfully added finalizer	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:29Z	INFO	EphemeralRunner	Adding finalizer	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:29Z	INFO	EphemeralRunner	Successfully added finalizer	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:29Z	INFO	EphemeralRunner	Creating new ephemeral runner registration and updating status with runner config	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:29Z	INFO	EphemeralRunner	Creating ephemeral runner JIT config	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z	INFO	EphemeralRunner	Created ephemeral runner JIT config	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}, "runnerId": 5715}
2024-03-04T00:23:31Z	INFO	EphemeralRunner	Updating ephemeral runner status with runnerId and runnerJITConfig	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z	INFO	EphemeralRunner	Updated ephemeral runner status with runnerId and runnerJITConfig	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z	INFO	EphemeralRunner	Creating new ephemeral runner secret for jitconfig.	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z	INFO	EphemeralRunner	Creating new secret for ephemeral runner	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z	INFO	EphemeralRunner	Created new secret spec for ephemeral runner	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z	INFO	EphemeralRunner	Created ephemeral runner secret	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}, "secretName": "arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m"}
2024-03-04T00:23:31Z	INFO	EphemeralRunner	Creating new EphemeralRunner pod.	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z	INFO	EphemeralRunner	Creating new pod for ephemeral runner	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z	INFO	EphemeralRunner	Created new pod spec for ephemeral runner	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z	INFO	EphemeralRunner	Created ephemeral runner pod	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}, "runnerScaleSetId": 9, "runnerName": "arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m", "runnerId": 5715, "configUrl": "https://github.com/XXX", "podName": "arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m"}
2024-03-04T00:23:31Z	INFO	EphemeralRunner	Waiting for runner container status to be available	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:31Z	INFO	EphemeralRunner	Waiting for runner container status to be available	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:59Z	INFO	EphemeralRunner	Waiting for runner container status to be available	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:59Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:59Z	INFO	EphemeralRunner	Updating ephemeral runner status with pod phase	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}, "phase": "Pending", "reason": "", "message": ""}
2024-03-04T00:23:59Z	INFO	EphemeralRunner	Updated ephemeral runner status with pod phase	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:23:59Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:24:13Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:24:13Z	INFO	EphemeralRunner	Updating ephemeral runner status with pod phase	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}, "phase": "Running", "reason": "", "message": ""}
2024-03-04T00:24:13Z	INFO	EphemeralRunner	Updated ephemeral runner status with pod phase	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T00:24:13Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}
2024-03-04T11:27:43Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"ephemeralrunner": {"name":"arc-runner-set-aks-stg-fc-001-at-f2g6l-runner-59x5m","namespace":"arc-runners"}}

Runner Pod Logs

...
[WORKER 2024-03-04 13:49:04Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[WORKER 2024-03-04 13:49:04Z INFO HostContext] Well known directory 'Root': '/home/runner'
[WORKER 2024-03-04 13:49:04Z INFO HostContext] Well known directory 'Work': '/home/runner/_work'
[RUNNER 2024-03-04 13:49:14Z INFO JobDispatcher] Successfully renew job request 93068, job is valid till 03/04/2024 13:59:14
[WORKER 2024-03-04 13:49:14Z INFO HostContext] Well known directory 'Bin': '/home/runner/bin'
[WORKER 2024-03-04 13:49:14Z INFO HostContext] Well known directory 'Root': '/home/runner'
[WORKER 2024-03-04 13:49:14Z INFO HostContext] Well known directory 'Work': '/home/runner/_work'
...
@antoineozenne-at-leocare antoineozenne-at-leocare added the bug Something isn't working label Mar 4, 2024
Copy link

github-actions bot commented Mar 4, 2024

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@nikola-jokic nikola-jokic transferred this issue from actions/actions-runner-controller Mar 8, 2024
@nikola-jokic
Copy link
Contributor

Moved issue to hooks, since the hook should be responsible for maintaining resources that it creates ☺️

@halradaideh
Copy link

happened here as well
runner went OOM and the workflow just froze up

@nikola-jokic
Copy link
Contributor

Hey everyone,

The main problem is that we do not use the scheduler to schedule pods. The reason is that we need workflow pods to land on the same machine where the runner is. There is an option to use a kube scheduler, however, it requires the ReadWriteMany volume.
Under these constraints, we can't do anything else. There is a great PR and a suggestion on how to work around the issue. Hopefully, we can dedicate time in the future to test it and double-check if it works. However, OOM killed is raised by k8s, so the best thing you can do at this time is to ensure your nodes can handle the load, or use the read write many volume to allow workflow pods to be scheduled on different nodes.

@halradaideh
Copy link

I think i am facing different issue

I am using arc with dind template as explained in the documentation
The pod resources created by the scaleset are limited
Like a core and 2 gb ram
The issue happens when the workflow requests more resources from what the scaleset runner has defined
Causes kubernetes to kill the pod

Instead of getting something useful on the action logs, like returning oom status
Kubernetes reschedule the pod
And it stuck at waiting IPV from dispatcher
at the same time the action logs get stuck

and i have to force kill the action

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working k8s
Projects
None yet
Development

No branches or pull requests

3 participants