Skip to content

Spin up an autoscaling stack of Buildkite Agents on Kubernetes

License

Notifications You must be signed in to change notification settings

Hivebrite/agent-stack-k8s

 
 

Repository files navigation

Buildkite Agent Stack for Kubernetes

Build status

Overview

A Kubernetes controller that runs Buildkite steps as Kubernetes jobs.

How does it work

The controller uses the Buildkite GraphQL API to watch for scheduled work that uses the kubernetes plugin.

When a job is available, the controller will create a pod to acquire and run the job. It converts the PodSpec in the kubernetes plugin into a pod by:

  • adding an init container to:
    • copy the agent binary onto the workspace volume
  • adding a container to run the buildkite agent
  • adding a container to clone the source repository
  • modifying the user-specified containers to:
    • overwrite the entrypoint to the agent binary
    • run with the working directory set to the workspace

The entrypoint rewriting and ordering logic is heavily inspired by the approach used in Tekton.

Architecture

sequenceDiagram
    participant bc as buildkite controller
    participant gql as Buildkite GraphQL API
    participant bapi as Buildkite API
    participant kubernetes
    bc->>gql: Get scheduled builds & jobs
    gql-->>bc: {build: jobs: [{uuid: "abc"}]}
    kubernetes->>pod: start
    bc->>kubernetes: watch for pod completions
    bc->>kubernetes: create pod with agent sidecar
    kubernetes->>pod: create
    pod->>bapi: agent accepts & starts job
    pod->>pod: run sidecars
    pod->>pod: agent bootstrap
    pod->>pod: run user pods to completion
    pod->>bapi: upload artifacts, exit code
    pod->>pod: agent exit
    kubernetes->>bc: pod completion event
    bc->>kubernetes: cleanup finished pods
Loading

Installation

Requirements

Deploy with Helm

The simplest way to get up and running is by deploying our Helm chart:

helm upgrade --install agent-stack-k8s oci://ghcr.io/buildkite/helm/agent-stack-k8s \
    --create-namespace \
    --namespace buildkite \
    --set config.org=<your Buildkite org slug> \
    --set agentToken=<your Buildkite agent token> \
    --set graphqlToken=<your Buildkite GraphQL-enabled API token>

If you are using Buildkite Clusters to isolate sets of pipelines from each other, you will need to specify the cluster's UUID in the configuration for the controller. This may be done using a flag on the helm command like so: --set config.cluster-uuid=<your cluster's UUID>, or an entry in a values file.

# values.yaml
config:
  cluster-uuid: beefcafe-abbe-baba-abba-deedcedecade

The cluster's UUID may be obtained by navigating to the clusters page, clicking on the relevant cluster and then clicking on "Settings". It will be in a section titled "GraphQL API Integration".

We're using Helm's support for OCI-based registries, which means you'll need Helm version 3.8.0 or newer.

This will create an agent-stack-k8s installation that will listen to the kubernetes queue. See the --tags option for specifying a different queue.

Options

Usage:
  agent-stack-k8s [flags]
  agent-stack-k8s [command]

Available Commands:
  completion  Generate the autocompletion script for the specified shell
  help        Help about any command
  lint        A tool for linting Buildkite pipelines
  version     Prints the version

Flags:
      --agent-token-secret string   name of the Buildkite agent token secret (default "buildkite-agent-token")
      --buildkite-token string      Buildkite API token with GraphQL scopes
      --cluster-uuid string         UUID of the Buildkite Cluster. The agent token must be for the Buildkite Cluster.
  -f, --config string               config file path
      --debug                       debug logs
  -h, --help                        help for agent-stack-k8s
      --image string                The image to use for the Buildkite agent (default "ghcr.io/buildkite/agent-stack-k8s/agent:latest")
      --job-ttl duration            time to retain kubernetes jobs after completion (default 10m0s)
      --max-in-flight int           max jobs in flight, 0 means no max (default 25)
      --namespace string            kubernetes namespace to create resources in (default "default")
      --org string                  Buildkite organization name to watch
      --profiler-address string     Bind address to expose the pprof profiler (e.g. localhost:6060)
      --tags strings                A comma-separated list of agent tags. The "queue" tag must be unique (e.g. "queue=kubernetes,os=linux") (default [queue=kubernetes])

Use "agent-stack-k8s [command] --help" for more information about a command.

Configuration can also be provided by a config file (--config or CONFIG), or environment variables. In the examples folder there is a sample YAML config and a sample dotenv config.

Externalize Secrets

You can also have an external provider create a secret for you in the namespace before deploying the chart with helm. If the secret is pre-provisioned, replace the agentToken and graphqlToken arguments with:

--set agentStackSecret=<secret-name>

The format of the required secret can be found in this file.

Other Installation Methods

You can also use this chart as a dependency:

dependencies:
- name: agent-stack-k8s
  version: "0.5.0"
  repository: "oci://ghcr.io/buildkite/helm"

or use it as a template:

helm template oci://ghcr.io/buildkite/helm/agent-stack-k8s -f my-values.yaml

Available versions and their digests can be found on the releases page.

Sample Buildkite Pipelines

For simple commands, you merely have to target the queue you configured agent-stack-k8s with.

steps:
- label: Hello World!
  command: echo Hello World!
  agents:
    queue: kubernetes

For more complicated steps, you have access to the PodSpec Kubernetes API resource that will be used in a Kubernetes Job. For now, this is nested under a kubernetes plugin. But unlike other Buildkite plugins, there is no corresponding plugin repository. Rather, this is syntax that is interpreted by the agent-stack-k8s controller.

steps:
- label: Hello World!
  agents:
    queue: kubernetes
  plugins:
  - kubernetes:
      podSpec:
        containers:
        - image: alpine:latest
          command:
          - echo
          - Hello World!

Note that in a podSpec, a command should be YAML list that will be combined into a single command for a container to run.

Almost any container image may be used, but it MUST have a POSIX shell available to be executed at /bin/sh.

It's also possible to specify an entire script as a command

steps:
- label: Hello World!
  agents:
    queue: kubernetes
  plugins:
  - kubernetes:
      podSpec:
        containers:
        - image: alpine:latest
          command:
          - |-
            set -euo pipefail

            echo Hello World! > hello.txt
            cat hello.txt | buildkite-agent annotate

If you have a multi-line command, specifying the args as well could lead to confusion, so we recommend just using command.

More samples can be found in the integration test fixtures directory.

Cloning repos via SSH

To use SSH to clone your repos, you'll need to add a secret reference via an EnvFrom to your pipeline to specify where to mount your SSH private key from. Place this object under a gitEnvFrom key in the kubernetes plugin (see the example below).

You should create a secret in your namespace with an environment variable name that's recognised by docker-ssh-env-config. A script from this project is included in the default entrypoint of the default buildkite/agent Docker image. It will process the value of the secret and write out a private key to the ~/.ssh directory of the checkout container.

However this key will not be available in your job containers. If you need to use git ssh credentials in your job containers, we recommend one of the following options:

  1. Use a container image that's based on the default buildkite/agent docker image and preserve the default entrypoint by not overriding the command in the job spec.
  2. Include or reproduce the functionality of the ssh-env-config.sh script in the entrypoint for your job container image

Example secret creation for ssh cloning

You most likely want to use a more secure method of managing k8s secrets. This example is illustrative only.

Supposing a SSH private key has been created and its public key has been registered with the remote repository provider (e.g. GitHub).

kubectl create secret generic my-git-ssh-credentials --from-file=SSH_PRIVATE_DSA_KEY="$HOME/.ssh/id_ecdsa"

Then the following pipeline will be able to clone a git repository that requires ssh credentials.

steps:
  - label: build image
    agents:
      queue: kubernetes
    plugins:
      - kubernetes:
          gitEnvFrom:
            - secretRef:
                name: my-git-ssh-credentials # <----
          podSpec:
            containers:
              - image: gradle:latest
                command: [gradle]
                args:
                  - jib
                  - --image=ttl.sh/example:1h

Pod Spec Patch

Rather than defining the entire Pod Spec in a step, there is the option to define a strategic merge patch in the controller. Agent Stack K8s will first generate a K8s Job with a PodSpec from a Buildkite Job and then apply the patch in the controller. It will then apply the patch specified in its config file, which is derived from the value in the helm installation. This can replace much of the functionality of some of the other fields in the plugin, like gitEnvFrom.

Eliminate gitEnvFrom

Here's an example demonstrating how one would eliminate the need to specify gitEnvFrom from every step, but still checkout private repositories.

First, deploy the helm chart with a values.yaml file.

# values.yaml
agentStackSecret: <name of predefined secrets for k8s>
config:
  org: <your-org-slug>
  pod-spec-patch:
    containers:
    - name: checkout         # <---- this is needed so that the secret will only be mounted on the checkout container
      envFrom:
      - secretRef:
          name: git-checkout # <---- this is the same secret name you would have put in `gitEnvFrom` in the kubernetes plugin

You may use the -f or --values arguments to helm upgrade to specify a values.yaml file.

helm upgrade --install agent-stack-k8s oci://ghcr.io/buildkite/helm/agent-stack-k8s \
    --create-namespace \
    --namespace buildkite \
    --values values.yaml \
    --version <agent-stack-k8s version>

Now, with this setup, we don't even need to specify the kubernetes plugin to use Agent Stack K8s with a private repo

# pipelines.yaml
agents:
  queue: kubernetes
steps:
- name: Hello World!
  commands:
  - echo -n Hello!
  - echo " World!"

- name: Hello World in one command
  command: |-
    echo -n Hello!
    echo " World!"

Custom Images

You can specify a different image to use for a step in a step level podSpecPatch. Previously this could be done with a step level podSpec.

# pipelines.yaml
agents:
  queue: kubernetes
steps:
- name: Hello World!
  commands:
  - echo -n Hello!
  - echo " World!"
  plugins:
  - kubernetes:
      podSpecPatch:
      - name: container-0
        image: alpine:latest

- name: Hello World from alpine!
  commands:
  - echo -n Hello
  - echo " from alpine!"
  plugins:
  - kubernetes:
      podSpecPatch:
      - name: container-0      # <---- You must specify this as exactly `container-0` for now.
        image: alpine:latest   #       We are experimenting with ways to make it more ergonomic

Default Resources

In the helm values, you can specify default resources to be used by the containers in Pods that are launched to run Jobs.

# values.yaml
agentStackSecret: <name of predefend secrets for k8s>
config:
  org: <your-org-slug>
  pod-spec-patch:
    initContainers:
    - name: copy-agent
    requests:
      cpu: 100m
      memory: 50Mi
    limits:
      memory: 100Mi
    containers:
    - name: agent          # this container acquires the job
      resources:
        requests:
          cpu: 100m
          memory: 50Mi
        limits:
          memory: 1Gi
    - name: checkout       # this container clones the repo
      resources:
        requests:
          cpu: 100m
          memory: 50Mi
        limits:
          memory: 1Gi
    - name: container-0    # the job runs in a container with this name by default
      resources:
        requests:
          cpu: 100m
          memory: 50Mi
        limits:
          memory: 1Gi

and then every job that's handled by this installation of agent-stack-k8s will default to these values. To override it for a step, use a step level podSpecPatch.

# pipelines.yaml
agents:
  queue: kubernetes
steps:
- name: Hello from a container with more resources
  command: echo Hello World!
  plugins:
  - kubernetes:
      podSpecPatch:
        containers:
        - name: container-0    # <---- You must specify this as exactly `container-0` for now.
          resources:           #       We are experimenting with ways to make it more ergonomic
            requests:
              cpu: 1000m
              memory: 50Mi
            limits:
              memory: 1Gi

- name: Hello from a container with default resources
  command: echo Hello World!

Sidecars

Sidecar containers can be added to your job by specifying them under the top-level sidecars key. See this example for a simple job that runs nginx as a sidecar, and accesses the nginx server from the main job.

There is no guarantee that your sidecars will have started before your job, so using retries or a tool like wait-for-it is a good idea to avoid flaky tests.

Extra volume mounts

In some situations, for example if you want to use git mirrors you may want to attach extra volume mounts (in addition to the /workspace one) in all the pod containers.

See this example, that will declare a new volume in the podSpec and mount it in all the containers. The benefit, is to have the same mounted path in all containers, including the checkout container.

Skipping checkout

For some steps, you may wish to avoid checkout (cloning a source repository). This can be done with the checkout block under the kubernetes plugin:

steps:
- label: Hello World!
  agents:
    queue: kubernetes
  plugins:
  - kubernetes:
      checkout:
        skip: true # prevents scheduling the checkout container

Overriding flags for git clone/fetch

git clone and git fetch flags can be overridden per-step (similar to BUILDKITE_GIT_CLONE_FLAGS and BUILDLKITE_GIT_FETCH_FLAGS env vars) with the checkout block also:

steps:
- label: Hello World!
  agents:
    queue: kubernetes
  plugins:
  - kubernetes:
      checkout:
        cloneFlags: -v --depth 1
        fetchFlags: -v --prune --tags

Validating your pipeline

With the unstructured nature of Buildkite plugin specs, it can be frustratingly easy to mess up your configuration and then have to debug why your agent pods are failing to start. To help prevent this sort of error, there's a linter that uses JSON schema to validate the pipeline and plugin configuration.

This currently can't prevent every sort of error, you might still have a reference to a Kubernetes volume that doesn't exist, or other errors of that sort, but it will validate that the fields match the API spec we expect.

Our JSON schema can also be used with editors that support JSON Schema by configuring your editor to validate against the schema found here.

Debugging

Use the log-collector script in the utils folder to collect logs for agent-stack-k8s.

Prerequisites

  • kubectl binary
  • kubectl setup and authenticated to correct k8s cluster

Inputs to the script

k8s namespace where you deployed agent stack k8s and where you expect their k8s jobs to run.

Buildkite job id for which you saw issues.

Data/logs gathered:

The script will collect kubectl describe of k8s job, pod and agent stack k8s controller pod.

It will also capture kubectl logs of k8s pod for the Buildkite job, agent stack k8s controller pod and package them in a tar archive which you can send via email to [email protected].

Open questions

  • How to deal with stuck jobs? Timeouts?
  • How to deal with pod failures (not job failures)?
    • Report failure to buildkite from controller?
    • Emit pod logs to buildkite? If agent isn't starting correctly
    • Retry?

About

Spin up an autoscaling stack of Buildkite Agents on Kubernetes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Go 87.4%
  • Shell 7.8%
  • Smarty 3.5%
  • Just 1.2%
  • Ruby 0.1%