This repository contains the code of GitHub Actions Runner modified to spawn preemptible GCP instances with Singularity containers and to perform run steps within them.
The software was designed to run in Google Compute Engine. Therefore, it is necessary to prepare some virtual infrastructure prior to installing the runner.
The repositories listed below contain the definitions of the required components:
- github-actions-runner-scalerunner - the image used by preemptible GCP instances that serve as workers (one worker per job).
- github-actions-runner-terraform - a Terraform module used to create the virtual network, firewall rules, cloud NAT and coordinator instance for the runner.
The manual below assumes that Debian Buster is used to deploy the runner.
The following packages must be installed:
build-essential
- Terraform
- Google Cloud SDK
With all prerequisites in place, in order to install the software, follow the steps below:
Install the Google Cloud SDK and setup the project:
# Authenticate with GCP.
gcloud auth login
# Create a GCP project for your runner.
export PROJECT=example-runner-project
gcloud projects create $PROJECT
gcloud config set project $PROJECT
# At this point, billing needs to be enabled.
# To do this, follow the instructions from the link below:
# https://cloud.google.com/billing/docs/how-to/modify-project
# Enable the necessary APIs in your project.
gcloud services enable compute.googleapis.com
gcloud services enable storage-component.googleapis.com
gcloud services enable storage.googleapis.com
gcloud services enable storage-api.googleapis.com
# Create and setup a service account.
export SERVICE_ACCOUNT_ID=runner-manager
gcloud iam service-accounts create $SERVICE_ACCOUNT_ID
export FULL_SA_MAIL=$SERVICE_ACCOUNT_ID@$PROJECT.iam.gserviceaccount.com
gcloud projects add-iam-policy-binding $PROJECT \
--member="serviceAccount:$FULL_SA_MAIL" \
--role="roles/compute.admin"
gcloud projects add-iam-policy-binding $PROJECT \
--member="serviceAccount:$FULL_SA_MAIL" \
--role="roles/iam.serviceAccountCreator"
gcloud projects add-iam-policy-binding $PROJECT \
--member="serviceAccount:$FULL_SA_MAIL" \
--role="roles/iam.serviceAccountUser"
gcloud projects add-iam-policy-binding $PROJECT \
--member="serviceAccount:$FULL_SA_MAIL" \
--role="roles/iam.serviceAccountKeyAdmin"
gcloud projects add-iam-policy-binding $PROJECT \
--member="serviceAccount:$FULL_SA_MAIL" \
--role="roles/resourcemanager.projectIamAdmin"
# Create and download SA key.
# WARNING: the export below will be used by Terraform later.
export GOOGLE_APPLICATION_CREDENTIALS=$HOME/$SERVICE_ACCOUNT_ID.json
gcloud iam service-accounts keys create $GOOGLE_APPLICATION_CREDENTIALS \
--iam-account=$FULL_SA_MAIL
# Create a GCP bucket for worker image.
export BUCKET=$PROJECT-worker-bucket
gsutil mb gs://$BUCKET
Build and upload the worker image:
# Clone the repository
git clone https://github.com/antmicro/github-actions-runner-scalerunner.git
cd github-actions-runner-scalerunner
# Compile bzImage
cd buildroot && make BR2_EXTERNAL=../overlay/ scalenode_gcp_defconfig && make
cd ..
# Prepare a disk for GCP
./make_gcp_image.sh
# Upload the resulting tar archive
./upload_gcp_image.sh $PROJECT $BUCKET
# (optional) If you need ARM64 support, perform a full rebuild with ARM64 defconfig.
rm -rf output/*
cd buildroot && make clean && make BR2_EXTERNAL=../overlay/ scalenode_gcp_arm64_defconfig && make
# Prepare a disk for GCP
./make_gcp_image.sh
# Upload the resulting tar archive
./upload_gcp_image.sh $PROJECT $BUCKET
Setup virtual infrastructure using Terraform.
If you need ARM64 support, make sure to fill out the gcp_arm64_worker_image_name
variable.
git clone https://github.com/antmicro/github-actions-runner-terraform.git
terraform init && terraform apply
Connect to the coordinator instance created in the previous step:
gcloud compute ssh <COORDINATOR_INSTANCE> --zone <COORDINATOR_ZONE>
Install and configure the runner on the coordinator instance according to the instructions below.
The registration token (the $TOKEN
variable) can be obtained from the Runners settings page in repository settings (https://github.com/$REPOSITORY_ORG/$REPOSITORY_NAME/settings/actions/runners/new
) or using the Self-hosted runners API.
# Update repositories and install wget.
sudo apt -qqy update && sudo apt -qqy install wget
# Download and run the installation script.
wget -O - https://raw.githubusercontent.com/antmicro/runner/vm-runners/scripts/install.sh | sudo bash
# The runner software runs as the 'runner' user, so let's sudo into it.
sudo -i -u runner
cd /home/runner/github-actions-runner
# Init and update submodules
git submodule update --init --recursive
# Copy the .vm_specs.json file and adjust the parameters accordingly.
# For ARM64 support make sure to add some t2a-standard-* instances to allowed machine types.
cp .vm_specs.example.json .vm_specs.json
vim .vm_specs.json
# Register the runner in the desired repository.
./config.sh --url https://github.com/$REPOSITORY_ORG/$REPOSITORY_NAME --token $TOKEN --num $SLOTS
The default behavior for coordinator is to spawn worker machines in its own zone
(which is configured using the gcp_zone parameter).
However, certain workloads may trigger the ZONE_RESOURCE_POOL_EXHAUSTED
error which is caused by a physical lack of available resources within a certain zone
(see the support page for more details).
If such error should occur, the software will make attempts to spawn the machine in neighboring zones within the region. This behavior can be further expanded by defining a list of additional regions (see the gcp_auxiliary_zones parameter).
WARNING: read on if you're planning to use the external disk feature.
For external disks to work in this arrangement, it is necessary to manually replicate them in all zones within the home region (and auxiliary regions if applicable). Otherwise, jobs requiring an external disk will be constrained to zones where the disk and its replicas can be found.
Consider the example of replicating a balanced persistent disk called auxdisk
located in europe-west4-a
to europe-west4-b
.
# Create a snapshot of the disk located in the home zone.
gcloud compute snapshots create auxdisk-snapshot-1 \
--source-disk auxdisk \
--source-disk-zone europe-west4-a \
--project my-cool-project
# Create a disk from the snapshot in another zone.
# Notice that we cannot assign the same name to it.
# We'll associate it by specifying the original name in the "gha-replica-for" label instead.
gcloud compute disks create another-auxdisk \
--zone europe-west4-b \
--labels gha-replica-for=auxdisk \
--source-snapshot auxdisk-snapshot-1 \
--project my-cool-project
It is possible to check the availability of a disk by running python3 vm_command.py --mode get_disks -d auxdisk
on the coordinator machine
(replacing the value for the -d
argument with the name of the disk to check).
Example output of such an invocation might look as follows:
runner@foo-runner:~/github-actions-runner/virt$ python3 vm_command.py --mode get_disks -d auxdisk
{'europe-west4-a': {'autoDelete': 'false', 'deviceName': 'aux', 'mode': 'READ_ONLY', 'source': 'projects/foo/zones/europe-west4-a/disks/auxdisk'}, 'europe-west4-b': {'autoDelete': 'false', 'deviceName': 'aux', 'mode': 'READ_ONLY', 'source': 'projects/foo/zones/europe-west4-b/disks/another-auxdisk'}}
By default, timestamped runner logs are stored in *_diag
directories under $THIS_REPO_PATH/_layout
.
It is possible, however, to point the runner to store logs on an external disk.
A helper script is available which creates, formats and mounts such a disk.
In order to ensure persistence, a corresponding entry will be added to /etc/fstab
.
To enable this feature, simply run ./scripts/setup_ext_log_disk.sh
.
After completing this step, restart the runner and the new mount point (/var/log/runner
) will be picked up automatically.
To make sure there aren't any stale long-running runner VMs, it is possible to enable a cron job that automatically removes any auto-spawned instances running for more than 12h.
To enable this feature, simply run ./scripts/install_stale_vm_remover.sh
.
By default logs are stored 10 days until they are deleted.
It is possible to enable a cron job that compresses all log files that are at least 2 days old and removes old archives until there is enough free disc space.
To enable this feature, simply run ./scripts/install_compress_log_files_cron.sh
.
After completing this step, logs will be automatically compressed every day at 3 AM.
Certain environment variables set at the global level influence the VM initialization step.
By convention, they are prefixed with GHA_
.
The table below documents and describes their purpose.
Environment variable | Type | Description |
---|---|---|
GHA_EXTERNAL_DISK |
string | Name of an external Compute Engine disk |
GHA_PREEMPTIBLE |
bool | Set whether the machine should be preemptible. |
GHA_MACHINE_TYPE |
string | Compute Engine machine type |
GHA_SA |
string | Machine service account suffix |
GHA_SSH_TUNNEL_CONFIG |
base64 string | OpenSSH configuration file for tunneling |
GHA_SSH_TUNNEL_KEY |
base64 string | OpenSSH private key file |
GHA_SSH_TUNNEL_CONFIG_SECRET_NAME |
string | Secret name from GCP Secret Manager containing OpenSSH configuration file for tunneling |
GHA_SSH_TUNNEL_KEY_SECRET_NAME |
string | Secret name from GCP Secret Manager containing OpenSSH private key file |
GHA_CUSTOM_LINE_PREFIX |
string | Custom line prefix for logs, if empty or not specified, time (in format HH:mm:ss) will be used |
Spawning ARM64 machines requires the following steps to have been completed:
- The worker image for ARM64 has been built and uploaded to your GCP project.
- The
gcp_arm64_worker_image_name
variable in Terraform has been set or theWORKER_IMAGE_ARM64
metadata variable has been set manually on the coordinator machine. - At least one of T2A instance types has been added to allowed machine types.
After ensuring the checklist above, set the GHA_MACHINE_TYPE
variable in your workflow to a Tau T2A machine, e.g. t2a-standard-4
.
It is possible to establish a secure tunnel to an SSH-enabled host in order to forward some ports.
First, prepare a configuration file according to the OpenSSH client configuration syntax.
An example configuration file may looks as follows:
Host some-host
HostName example.com
User test
StrictHostKeyChecking no
ExitOnForwardFailure yes
LocalForward localhost:8080 127.0.0.1:80
This will forward the HTTP port from example.com
to port 8080 on the worker machine.
The forwarded port will be available within the job container (this will allow you to, for example, run wget localhost:8080
).
Apart from preparing the configuration file, it is necessary to prepare a private key for authentication with the remote host,
There are two ways of exposing these files:
- Encode both files in Base64 (this can be done by running
cat <filename> | base64 -w0
), store them in GitHub Actions Encrypted secrets and expose them in the workflow file. - Store them as secrets in GCP Secret Manager with two labels (
gha_runner_exposed: 1
andgha_runner_namespace: $REPOSITORY_NAME
) and reference their names in the workflow file.
In the event that both methods are used in the workflow file, Secret Manager takes precedence.
An example workflow file leveraging this feature may look as follows:
on: [push]
name: test
jobs:
centos:
container: centos:7
runs-on: [self-hosted, Linux, X64]
env:
GHA_SSH_TUNNEL_KEY: "${{ secrets.GHA_SSH_TUNNEL_KEY }}"
GHA_SSH_TUNNEL_CONFIG: "${{ secrets.GHA_SSH_TUNNEL_CONFIG }}"
GHA_SSH_TUNNEL_CONFIG_SECRET_NAME: "my_tunnel_config"
GHA_SSH_TUNNEL_KEY_SECRET_NAME: "my_tunnel_key"
steps:
- run: yum -y install wget
- run: wget http://localhost:8080 && cat index.html
In order to start the runners manually, run SCALE=<number of slots> supervisord -n -c supervisord.conf
.
Start the runner by running sudo systemctl start gha-main@$SLOTS
replacing $SLOTS
with the number of runner slots you'd like to allocate.
If you want the software to start automatically, run the command above with the enable
action instead of start
.