Metal3 Prow dashboard: https://prow.apps.test.metal3.io
The configuration files for prow are in the config
directory. The prow jobs
have been split out into multiple files to make managing them easier. They can
be found under config/jobs
. They are all collected into a ConfigMap
job-config
that prow can access.
The jobs are sorted based on github organization and repository. For repositories with release branches, we keep one file per branch. This way it is easy to add tests for a new branch by copying an existing file, and removing tests for a branch by just deleting the corresponding file.
- To merge, patches must have
/approve
and/lgtm
, which apply theapproved
andlgtm
labels. For repositories in the Nordix organization, two approving reviews are also needed to avoid accidental merges when an approver putslgtm
and approves (which addsapproved
) in one review. - The use of
/approve
and/lgtm
is controlled by theOWNERS
file in each repository. See the OWNERS spec for more details about how to manage access to all or part of a repo with this file. - Tests will run automatically for PRs authored by public members of the
metal3-io
GitHub organization. Members of the GitHub org can run/ok-to-test
for PRs authored by those not in the GitHub org.
See the Prow command help for more information about who can run each prow command.
Prow is deployed in a Kubernetes cluster in Xerces. The cluster is created using
the Cluster API provider for OpenStack (CAPO), using the configuration in
capo-cluster
. You will need access to (or be able to create) the following
credentials in order to deploy or manage the Kubernetes cluster and the Prow
instance:
- OpenStack credentials, used by CAPO to manage the OpenStack resources.
- S3 credentials (can be created using the OpenStack API).
- A HMAC token for webhook validation.
- A GitHub token for accessing GitHub.
- A separate GitHub token for the cherry-pick bot.
- A token and username for accessing Jenkins, when triggering Jenkins jobs from Prow.
In addition to this, we rely on a GitHub bot account (metal3-io-bot, owner of the GitHub token) and a separate GitHub bot metal3-cherrypick-bot, for cherry picking pull requests. A webhook must be configured in GitHub to send events to Prow and a DNS record must be configured so that https://prow.apps.test.metal3.io points to the IP where Prow can be accessed.
The DNS records are managed by CNCF for us. Any changes to them can be done through the service desk. All project maintainers have (or can request) access. See their FAQ for more details.
You will need the following CLI tools in order to deploy and/or manage Prow:
- clusterctl
- kubectl
- kind
- openstack, can be installed directly from Ubuntu repositories.
- s3cmd, can be installed directly from Ubuntu repositories.
There are four folders with kustomizations (capo-cluster
, cluster-resources
,
infra
and manifests
). The capo-cluster
folder contains everything needed for
creating the Kubernetes cluster itself. In cluster-resources
, you will find
things the cluster needs to integrate with the cloud, i.e. the external
cloud-provider for OpenStack and CSI plugin for Cinder. It also has the CNI
(Calico), since that is needed to get a healthy cluster and the cluster
autoscaler. The infra
folder contains "optional" add-ons and configuration
e.g. an ingress controller, a ClusterIssuer for getting Let's Encrypt
certificates and a LoadBalancer Service for ingress and a StorageClass for
Cinder volumes.
The deployment manifests for Prow (manifests
) are based on the
getting started guide
and the
starter-s3.yaml
manifests. They have been separated out into multiple files and tied together
using kustomizations. This makes it easy to keep secrets outside of git while
still allowing for simple one-liners for applying all of it.
The getting started guide is (as of writing) focused on setting up Prow using a GitHub app, but Prow actually supports many authentication methods. We use API tokens created from GitHub bot accounts.
We use ingress-nginx as ingress controller. It is fronted by a LoadBalancer Service, i.e. a loadbalancer in Xerces. The Service is configured to grab the correct IP automatically and to avoid deleting it even if the Service is deleted. See infra/service.yaml. For securing it with TLS we rely on cert-manager and the Let's Encrypt HTTP01 challenge, as seen in infra/cluster-issuer-http.yaml.
The Kubernetes cluster where Prow runs needs pre-built images for the Nodes. We use image-builder for this.
Here is how to build a node image directly in Xerces. See the image-builder book for more details. Start by creating a JSON file with relevant parameters for the image. Here is an example:
{
"source_image": "54f49763-5f17-475e-9ad2-67dc8cd9a9ff",
"networks": "29fd620e-8145-43a2-8140-5cec6a69f344",
"flavor": "c4m4",
"floating_ip_network": "internet",
"ssh_username": "ubuntu",
"volume_type": "",
"kubernetes_deb_version": "1.30.6-1.1",
"kubernetes_rpm_version": "1.30.6",
"kubernetes_semver": "v1.30.6",
"kubernetes_series": "v1.30"
}
Then build the image like this:
cd images/capi
PACKER_VAR_FILES=var_file.json make build-openstack-ubuntu-2204
-
Create application credentials for use by the OpenStack cloud provider.
# Note! If the user that created the credentials is removed, # so is the application credentials! openstack application credential create prow-capo-cloud-controller
The output of this command should give you an application credential ID and secret. Store them in a safe place. In the rest of this document they will be referred to as
${APP_CRED_ID}
and${APP_CRED_SECRET}
. -
Set up S3 object storage.
openstack --os-interface public ec2 credentials create
Store the access key and secret key in a safe place. They will be needed also for deploying prow. They will be referred to as
${S3_ACCESS_KEY}
and${S3_SECRET_KEY}
. -
Generate HMAC token.
openssl rand -hex 20
It will be referred to as
${HMAC_TOKEN}
. -
Create a Jenkins token by logging in to Jenkins using the [email protected] account and adding an API token in the "Configure" tab for the user. It will be referred to as
${JENKINS_TOKEN}
.
-
Create bot accounts. The bot accounts are normal accounts on GitHub. Both of them have Gmail accounts connected to the GitHub accounts.
-
Create a personal access token for the GitHub bot account. This should be done from the metal3-io-bot GitHub bot account. You can follow this link to create the token. When generating the token, make sure you have only the following scopes checked in.
repo
scope for full control of all repositoriesworkflow
admin:org_hook
for a GitHub org
The token will be referred to as
${GITHUB_TOKEN}
. -
Create a personal access token for the cherry-picker bot. This should be done from the metal3-io-bot GitHub bot account. You can follow this link to create the token. When generating the token, make sure you have only the following scopes checked in.
repo
workflow
admin:org/read:org
(not all of admin:org, just the sub-item read:org)
The token will be referred to as
${CHERRYPICK_TOKEN}
. -
Create a GitHub webhook for https://prow.apps.test.metal3.io/hook using the HMAC token generated earlier. Add the URL and token as below. Select "Send me everything", and for Content type: application/json.
Files with credentials and other sensitive information are not stored in this repository. You will need to add them manually before you can apply any manifests and build the kustomizations. CAPO needs access to the OpenStack API and so does the external cloud-provider. Prow needs a GitHub token for accessing GitHub, a HMAC token for validating webhook requests, and S3 credentials for storing logs and similar.
If you are deploying from scratch or rotating credentials, please make sure to save them in a secure place after creating them. If there is an existing instance, you most likely just have to take the credentials from the secure place and generate the files below from them.
Please set the following environment variables with the relevant credentials. Then you will be able to just copy and paste the snippets below.
APP_CRED_ID
APP_CRED_SECRET
S3_ACCESS_KEY
S3_SECRET_KEY
HMAC_TOKEN
GITHUB_TOKEN
CHERRYPICK_TOKEN
JENKINS_TOKEN
Now you are ready to create the files.
-
Create
clouds.yaml
secret used by the CAPO controllers to manage the infrastructure.cat > capo-cluster/clouds.yaml <<EOF clouds: prow: auth: auth_url: https://xerces.ericsson.net:5000 auth_type: v3applicationcredential application_credential_id: ${APP_CRED_ID} application_credential_secret: ${APP_CRED_SECRET} user_domain_name: xerces project_name: EST_Metal3_CI project_id: 51faa3170dfc4990b6654346c2bf2243 region_name: RegionOne EOF
-
Create
cloud.conf
secret for the cloud provider integration.cat > cluster-resources/cloud.conf <<EOF [Global] auth-url=https://xerces.ericsson.net:5000 application-credential-id=${APP_CRED_ID} application-credential-secret=${APP_CRED_SECRET} region=RegionOne project-name=EST_Metal3_CI project-id=51faa3170dfc4990b6654346c2bf2243 EOF
-
Get the kubeconfig and save it as
capo-cluster/kubeconfig.yaml
. -
Create S3 credentials secret.
# Create service-account.json cat > manifests/overlays/metal3/service-account.json <<EOF { "region": "RegionOne", "access_key": "${S3_ACCESS_KEY}", "endpoint": "xerces.ericsson.net:7480", "insecure": false, "s3_force_path_style": true, "secret_key": "${S3_SECRET_KEY}" } EOF
-
Create configuration file for
s3cmd
. This is used for creating the buckets. The command will create the file.s3cfg
.s3cmd --config .s3cfg --configure # Provide Access key and Secret key from the openstack credentials created above. # Leave Default Region as US. # Set S3 Endpoint xerces.ericsson.net:7480 # Set DNS-style bucket+hostname:port template to the same xerces.ericsson.net:7480 # Default values for the rest. And save settings.
-
Save the HMAC token as
manifests/overlays/metal3/hmac-token
.echo "${HMAC_TOKEN}" > manifests/overlays/metal3/hmac-token
-
Save the metal3-io-bot token as
manifests/overlays/metal3/github-token
.echo "${GITHUB_TOKEN}" > manifests/overlays/metal3/github-token
-
Save the metal3-cherrypick-bot token as
manifests/overlays/metal3/cherrypick-bot-github-token
.echo "${CHERRYPICK_TOKEN}" > manifests/overlays/metal3/cherrypick-bot-github-token
-
Save the Jenkins token as
manifests/overlays/metal3/jenkins-token
.echo "${JENKINS_TOKEN}" > manifests/overlays/metal3/jenkins-token
For accessing an existing instance, you can simply get the relevant credentials
and files from the password manager and put them in the correct places (see the
section for generating secret files). Check the IP of the bastion in xerces and
set it in the environment variable BASTION_FLOATING_IP
. After this you just
need to set up a proxy for accessing the API through the bastion:
ssh -N -D 127.0.0.1:6443 "ubuntu@${BASTION_FLOATING_IP}"
The above command will just "hang". That is expected, since it is just forwarding the traffic.
In another terminal you should now be able to use kubectl
to access the
cluster:
export KUBECONFIG=capo-cluster/kubeconfig.yaml
kubectl get nodes
When deploying completely from scratch, you will need to first create the necessary GitHub bot accounts and webhook configuration. In addition, you need to create the credentials, generate secret files from them, and build node images (see sections above). You may also have to create a keypair with the Metal3 CI ssh key.
-
Create a bootstrap cluster (kind).
kind create cluster clusterctl init --infrastructure=openstack:v0.11.1 --core=cluster-api:v1.8.5 \ --bootstrap=kubeadm:v1.8.5 --control-plane=kubeadm:v1.8.5
-
Create cluster.
kubectl apply -k capo-cluster
-
Temporarily allow access from your public IP so that CAPI can access the cluster. See the section below about updating the bastion for more details.
# Check public IP curl -s ifconfig.me # Add it to spec.apiServerLoadBalancer.allowedCIDRs kubectl edit openstackcluster prow
-
Get kubeconfig and set up proxy for accessing API through the bastion.
BASTION_FLOATING_IP="$(kubectl get openstackcluster prow -o jsonpath="{.status.bastion.floatingIP}")" clusterctl get kubeconfig prow > capo-cluster/kubeconfig.yaml export KUBECONFIG=capo-cluster/kubeconfig.yaml kubectl config set clusters.prow.proxy-url socks5://localhost:6443 # In a separate terminal, set up proxy forwarding ssh -N -D 127.0.0.1:6443 "ubuntu@${BASTION_FLOATING_IP}"
-
Install cloud-provider, cluster-autoscaler and CNI. See the CAPO book, the CSI Cinder plugin docs and the external-cloud-provider docs for more information.
kubectl apply -k cluster-resources
All nodes and pods should become ready at this point, except the cluster autoscaler, which needs the CAPI CRDs to be installed.
-
Make cluster self-hosted
clusterctl init --infrastructure=openstack:v0.11.1 --core=cluster-api:v1.8.5 \ --bootstrap=kubeadm:v1.8.5 --control-plane=kubeadm:v1.8.5 unset KUBECONFIG clusterctl move --to-kubeconfig=capo-cluster/kubeconfig.yaml export KUBECONFIG=capo-cluster/kubeconfig.yaml
-
Remove the temporary access from your public IP and delete the kind cluster.
kubectl edit openstackcluster prow kind cluster delete
-
Add ingress-controller, ClusterIssuer and StorageClass
kubectl apply -k infra
-
Set up S3 object storage buckets
s3cmd --config .s3cfg mb s3://prow-logs s3cmd --config .s3cfg mb s3://tide
-
Deploy Prow
# Create the CRDs. These are applied separately and using server side apply # since they are so huge that the "last applied" annotation that would be # added with a normal apply, becomes larger than the allowed limit. # TODO: This is pinned for now because of https://github.com/kubernetes-sigs/prow/issues/181. kubectl apply --server-side=true -f https://raw.githubusercontent.com/kubernetes-sigs/prow/e67659d368fd013492a9ce038d801ba8998b7d10/config/prow/cluster/prowjob-crd/prowjob_customresourcedefinition.yaml # Deploy all prow components kubectl apply -k manifests/overlays/metal3
-
Create config (updated automatically by prow after this)
kubectl -n prow create configmap config --from-file=config/config.yaml kubectl -n prow create configmap plugins --from-file=config/plugins.yaml kubectl -n prow create configmap label-config --from-file=config/labels.yaml kubectl -n prow create configmap job-config \ --from-file=config/jobs/metal3-io \ --from-file=config/jobs/Nordix \ --from-file=config/jobs
Metal3 prow is currently working with two Github organizations(orgs):
metal3-io
and Nordix
. For Nordix
we have enabled prow only for two
repositories, namely: metal3-dev-tools and metal3-clusterapi-docs. We don't
foresee enabling Metal3 prow for any other Github org, however we might need to
enable prow in other repositories in existing Github orgs for example. In any
case we should follow the steps below to enable prow:
-
Add/check
metal3-io-bot
user in the Github org withadmin
access. Check the image -
Enable prow webhook as described in GitHub configuration section. For
metal3-io
the webhook is enabled in org level. For the two repositories inNordix
org we have enabled them on individual repositories. Keep in mind that the HMAC token and hook URL are the same as described in GitHub configuration. The webhook should look happy (green tick) as shown in the image below once you have configured it correctly and communication has been established between Github and prow hook. It should point tohttps://prow.apps.test.metal3.io/hook
. -
Check the general settings and branch protection settings of Github repository and make sure the settings are correct. Take any existing repository which has prow enabled as example (i.e.
Nordix/metal3-dev-tools
) -
Add the
OWNERS
file in the repository. -
Add the repository entry and related configurations in the files
prow/config/config.yaml
andprow/config/plugins.yaml
inmetal3-io/project-infra
repository. An example PR is here. -
One small tweak might still be needed. We have experienced the default
merge_method
of prow which ismerge
didn't work for Nordix repos. The other two options formerge_method
are:rebase
andsquash
. We have enabledrebase
for Nordix repos but keptmerge
for metal3-io org. An example is shown in this PR. -
At this point you should see the prow tests you have configured as presubmits for the repository, running on open PRs and tide is enabled and waiting for appropriate labels.
This section is about recurring operations for keeping Prow and all of the related components up to date and healthy.
A general advice about how to apply changes: it is a good idea to check what
changes would be made when applying by using kubectl diff
. It works with
kustomizations just like the kubectl apply
command:
kubectl diff -k <path-to-kustomization>
# Example
kubectl diff -k capo-cluster
Even when you think the change is straight forward it is well worth it to check
first. Maybe you don't have the latest changes locally? Or perhaps someone did
some changes in the cluster without committing them in the manifests. A simple
kubectl diff
will save you from surprises many times.
When it is time to upgrade to a new Kubernetes version (or just update the node image), you will first need to build a new node image (see section above). After this you need to create new OpenStackMachineTemplates for the image and switch the KubeadmControlPlane and MachineDeployment to these new templates. If the Kubernetes version has changed you also need to update the Kubernetes version in the KubeadmControlPlane and MachineDeployment. The relevant files are kubeadmcontrolplane.yaml, machinedeployment.yaml and openstackmachinetemplate.yaml. Apply the changes and then create a PR with the changes.
# Check that the changes are as you expect
kubectl diff -k capo-cluster
# Apply
kubectl apply -k capo-cluster
Note: This operation may disrupt running prow jobs when the Node where they run is deleted. The jobs are not automatically restarted. Therefore it is best to do upgrades at times when no jobs are running.
Verify that the new Machines and Nodes successfully join the cluster and are healthy.
kubectl get machines
kubectl get nodes
If everything looks good, consider deleting older OpenStackMachineTemplates, but keep the last used templates in case a rollback is needed. Remember to also delete the images in Xerces when they are no longer needed.
The bastion host just runs a normal Ubuntu image. When it needs to be updated, just upload the new image to Xerces. The tricky part is then to replace the bastion host. This is a sensitive operation since it could lock us out of the cluster if we are not careful. We can work around it by, for example, temporarily allowing direct access to the Kubernetes API from your own public IP. The same procedure (but without changing the image) can also be used when rebooting or doing other operations that could potentially cause issues.
Find your public IP:
curl ifconfig.me
Now edit the openstackcluster.yaml to add the IP to the allowed CIDRs:
---
spec:
apiServerLoadBalancer:
enabled: true
allowedCIDRs:
- 10.6.0.0/24
- <your-ip-here>/32
Apply the changes and try accessing the API directly to make sure it is working:
kubectl diff -k capo-cluster
kubectl apply -k capo-cluster
# Stop the port forward and then try to access it directly
kubectl config unset clusters.prow.proxy-url
kubectl cluster-info
We are now ready to update the bastion image (or do other disruptive operations,
like rebooting). This is done by disabling the bastion, then change the image
and finally enable it again. Edit the openstackcluster.yaml
again so that
spec.bastion.enabled
is set to false
and apply the changes. Next, edit the
spec.bastion.image
to the new image and set spec.bastion.enabled
back to
true
, then apply again.
Once the new bastion is up, we must ensure that we can access the API through it. Add the proxy URL back and start the port-forward in a separate terminal:
# In a separate terminal, set up proxy forwarding
BASTION_FLOATING_IP="$(kubectl get openstackcluster prow -o jsonpath="{.status.bastion.floatingIP}")"
ssh -N -D 127.0.0.1:6443 "ubuntu@${BASTION_FLOATING_IP}"
# Then add back the proxy-url and check that it is working
kubectl config set clusters.prow.proxy-url socks5://localhost:6443
kubectl cluster-info
Finally, remove your public IP from the allowed CIDRs and apply again.
If the worst would happen and we lock ourselves out, it is possible to modify the OpenStack resources directly to fix it:
openstack loadbalancer listener unset --allowed-cidrs <listener ID>
Remember to delete old images when they are no longer needed.
If a Node fails for some reason, e.g. it is overwhelmed by workload that uses up all the CPU resources, then it can be hard to get back to a working state. Generally, the procedure is to delete the Machine corresponding to the failed Node.
kubectl delete machine <name-of-machine>
However, this process will not progress if the Node is unresponsive. The reason for this is that it is impossible to drain it since it doesn't respond.
It is possible to tell Cluster API to proceed if the drain doesn't succeed within some time limit, but this is risky if there is workload that cannot tolerate multiple replicas or similar. It could still be running on the Node even though the kubelet is failing to report its status. Fortunately we can inspect the Node and see what is going on and even shut it down using the OpenStack API if needed.
Once you are relatively sure that it is safe to remove it, edit the Machine to
add nodeDrainTimeout
and nodeVolumeDetachTimeout
. Example:
apiVersion: cluster.x-k8s.io/v1beta1
kind: Machine
metadata:
name: machine-name
namespace: default
spec:
nodeDrainTimeout: 10m
nodeVolumeDetachTimeout: 15m
After this, the Machine and Node will be deleted. A new Machine and Node will be created automatically to replace them.
We run the cluster-autoscaler for automatically scaling the MachineDeployment that the tests run on. This means that you should NOT scale the MachineDeployment manually.
The autoscaler is configured through annotations on the MachineDeployment. These annotations set limits for how many replicas there should be. For example:
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
name: prow-md-0
annotations:
cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "1"
cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "5"
For more details on how the autoscaler works with Cluster API, check these docs.
If there is any issue with the autoscaler, remove the annotations. This will stop it from making changes so that the MachineDeployment can be manually scaled instead.
Scaling the MachineDeployment manually is then as easy as this:
kubectl scale md prow-md-0 --replicas=3
Here prow-md-0
is the name of the MachineDeployment and 3
is the desired
number of replicas. To check the status of the MachineDeployment use
kubectl get md
.