Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scale-set terminate runner-pod when minRunners=0 , if pod was in pending state for long. #3816

Open
4 tasks done
amir-bialek opened this issue Nov 20, 2024 · 2 comments
Open
4 tasks done
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers

Comments

@amir-bialek
Copy link

amir-bialek commented Nov 20, 2024

Checks

Controller Version

0.9.1

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Running with minRunners=0 :

1. Configure an EKS cluster with the following:
   - GitHub ARC controller.
   - One scale-set with `minRunners=0`.
   - Cluster Autoscaler enabled.

2. Trigger a GitHub Actions workflow that requires a runner.

3. Observe the following sequence of events:
   - The scale-set listener receives the job and deploys a new pod.
   - The pod enters a **pending state** due to no available nodes.

4. Wait for the **Cluster Autoscaler** to respond:
   - The Autoscaler scales up the node group. starting cycle

5. The pod 'disappear'.
   
7.   A new node is deployed.

8. Pod 're-appear', observed the pod's behavior:
   - The pod is scheduled on the newly provisioned node.
   - It pulls the necessary image.
   - The init container runs.
   - The main container starts running.
   - The pod does not run the workflow-job, instead it is terminates.

minRunners=1:
All is the same, but at point 5 the pod stay in pending, and at point 8 it is running the jobs.

Describe the bug

The controller terminates the runner-pod, and does not re-scheduled it.
The workflow on github show as "waiting for runner to come back online".

Describe the expected behavior

The runner-pod should run the workflow-job.
The controller should not terminate the runner-pod.

Additional Context

overwrite the default values.yaml with:

githubConfigUrl: "my_org_repo"
githubConfigSecret: github-token

minRunners: 0
maxRunners: 5


runnerScaleSetName: "my_scale_Set"

controllerServiceAccount:
  namespace: github-controller
  name: github-runner-controller

runnerGroup: "default"

template:
  spec:
    tolerations:
    - key: "need-gpu"
      operator: "Equal"
      value: "yes"
      effect: "NoSchedule"
    
    imagePullSecrets:
    - name: registry
    initContainers:
      - name: init-share-repo
        image: alpine/git:v2.45.2
        command: ["/bin/sh", "-c"]
        args:
        - sh "/tmp/data-script/runme.sh"
        env:
          - name: READ_TOKEN
            valueFrom:
              secretKeyRef:
                name: github-read-token
                key: token

        volumeMounts:
          - name: ed-share-folder
            mountPath: /tmp/shared-repos
          - name: data-script
            mountPath: /tmp/data-script

    containers:
      - name: runner
        image:  my_custom_image
        command: ["/home/runner/run.sh"]
        imagePullPolicy: IfNotPresent

        resources: 
          requests:
            nvidia.com/gpu: 1
            cpu: "7000m"
            memory: "20Gi"
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
          - name: empty-docker-config
            mountPath: /home/runner/.docker
          - name: docker-login-config
            mountPath: /home/runner/.docker/config.json
            subPath: config.json
            readOnly: true
          - name: ed-share-folder
            mountPath: /some/path
    volumes:
    - name: data-script
      configMap:
        name: data-script-cm
    - name: docker-login-config
      secret:
        secretName: some-secret
        items:
        - key: .dockerconfigjson
          path: config.json
    - name: empty-docker-config
      emptyDir: {}
    - name: ed-share-folder
      emptyDir: {}


containerMode:
  type: "dind"

Controller Logs

https://gist.github.com/amir-bialek/9a9bd3ab45847b4dd285b86cf51ea069

Runner Pod Logs

irrelevant - issue coming from the controller
@amir-bialek amir-bialek added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Nov 20, 2024
Copy link
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@amir-bialek
Copy link
Author

Similar issue here:
#2850

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

1 participant