All documentation in these guides assumes you have already downloaded both the Azure CLI and aks-engine
. Follow the quickstart guide before continuing.
This guide assumes you already have deployed a cluster using aks-engine
. For more details on how to do that see deploy.
This document provides guidance on how to upgrade the Kubernetes version for an existing AKS Engine cluster and recommendations for adopting aks-engine upgrade
as a tool.
In order to ensure that your aks-engine upgrade
operation runs smoothly, there are a few things you should be aware of before getting started.
-
You will need access to the API Model (
apimodel.json
) that was generated byaks-engine deploy
oraks-engine generate
(by default this file is placed into a relative directory that looks like_output/<clustername>/
). -
aks-engine upgrade
expects an API model that conforms to the current state of the cluster. In other words, the Azure resources inside the resource group deployed byaks-engine
should be in the same state as when they were originally created byaks-engine
. If you perform manual operations on your Azure IaaS resources (other than successfulaks-engine scale
,aks-engine update
, oraks-engine upgrade
operations) DO NOT useaks-engine upgrade
, as the aks-engine-generated ARM template won't be reconcilable against the state of the Azure resources that reside in the resource group. Some examples of manual operations that will prevent upgrade from working successfully:
- renaming resources
- executing follow-up CustomScriptExtensions against VMs after a cluster has been created: a VM or VMSS instance may only have a single CustomScriptExtension attached to it; follow-up operations CustomScriptExtension operations will essentially "replace" the CustomScriptExtension defined by aks-engine at cluster creation time, and
aks-engine upgrade
will not be able to recognize the VM resource.
aks-engine upgrade
relies on some resources (such as VMs) to be named in accordance with the original aks-engine
deployment. In summary, the set of Azure resources in the resource group are mutually reconcilable by aks-engine upgrade
only if they have been exclusively created and managed as the result of a series of successive ARM template deployments originating from various AKS Engine commands that have run to completion successfully.
-
aks-engine upgrade
allows upgrading the Kubernetes version to any AKS Engine-supported patch release in the current minor release channel that is greater than the current version on the cluster (e.g., from1.21.4
to1.21.5
), or to the next aks-engine-supported minor version (e.g., from1.21.5
to1.22.2
). (Or, seeaks-engine upgrade --force
if you want to bypass AKS Engine "supported version requirements"). In practice, the next AKS Engine-supported minor version will commonly be a single minor version ahead of the current cluster version. However, if the cluster has not been upgraded in a significant amount of time, the "next" minor version may have no longer be supported by aks-engine. In such a case, your long-lived cluster will be upgradable to the nearest, supported minor version thataks-engine
supports at the time of upgrade (e.g., from1.17.18
to1.19.15
).To get the list of all available Kubernetes versions and upgrades, run the
get-versions
command:aks-engine get-versions
To get the versions of Kubernetes that your particular cluster version is upgradable to, provide its current Kubernetes version in the
version
arg:aks-engine get-versions --version 1.19.14
-
aks-engine upgrade
relies upon a working connection to the cluster control plane during upgrade, both (1) to validate successful upgrade progress, and (2) to cordon and drain nodes before upgrading them, in order to minimize operational downtime of any running cluster workloads. If you are upgrading a private cluster, you must runaks-engine upgrade
from a host VM that has network access to the control plane, for example a jumpbox VM that resides in the same VNET as the master VMs. For more information on private clusters refer to this documentation. -
If using
aks-engine upgrade
in production, it is recommended to stage an upgrade test on an cluster that was built to the same specifications (built with the same cluster configuration + the same version of theaks-engine
command line tool) as your production cluster before performing the upgrade, especially if the cluster configuration is "interesting", or in other words differs significantly from defaults. The reason for this is that AKS Engine supports many different cluster configurations and the extent of E2E testing that the AKS Engine team runs cannot practically cover every possible configuration. Therefore, it is recommended that you ensure in a staging environment that your specific cluster configuration is upgradable usingaks-engine upgrade
before attempting this potentially destructive operation on your production cluster. -
aks-engine upgrade
is backwards compatible. If you deployed withaks-engine
version0.27.x
, you can run upgrade with version0.29.y
. In fact, it is recommended that you use the latest availableaks-engine
version when running an upgrade operation. This will ensure that you get the latest available software and bug fixes in your upgraded cluster. -
aks-engine upgrade
will automatically re-generate your cluster configuration to best pair with the desired new version of Kubernetes, and/or the version ofaks-engine
that is used to executeaks-engine upgrade
. To use an example of both:
- When you upgrade to (for example) Kubernetes 1.21 from 1.20, AKS Engine will automatically change your control plane configuration (e.g.,
coredns
,metrics-server
,kube-proxy
) so that the cluster component configurations have a close, known-working affinity with 1.21. - When you perform an upgrade, even if it is a Kubernetes patch release upgrade such as 1.21.4 to 1.21.5, but you use a newer version of
aks-engine
, a newer version ofetcd
(for example) may have been validated and configured as default since the version ofaks-engine
used to build the cluster was released. So, for example, without any explicit user direction, the newly upgraded cluster will now be running etcd v3.2.26 instead of v3.2.25. This is by design.
In summary, using aks-engine upgrade
means you will freshen and re-pave the entire stack that underlies Kubernetes to reflect the best-known, recent implementation of Azure IaaS + OS + OS config + Kubernetes config.
Parameter | Required | Description |
---|---|---|
--api-model | yes | Relative path to the API model (cluster definition) that declares the desired cluster configuration. |
--kubeconfig | no | Path to kubeconfig; if not provided, it will be generated on the fly from the API model data. |
--upgrade-version | yes | Version of Kubernetes to upgrade to. |
--force | no | Force upgrading the cluster to desired version, regardless of version support. Allows same-version upgrades and downgrades. |
--control-plane-only | no | Upgrade control plane VMs only, do not upgrade node pools (unsupported on air-gapped clouds). |
--cordon-drain-timeout | no | How long to wait for each vm to be cordoned in minutes (default -1, i.e., no timeout). |
--vm-timeout | no | How long to wait for each vm to be upgraded in minutes (default -1, i.e., no timeout). |
--upgrade-windows-vhd | no | Upgrade image reference of all Windows nodes to a new AKS Engine-validated image, if available (default is true). |
--azure-env | no | The target Azure cloud (default "AzurePublicCloud") to deploy to. |
--subscription-id | yes | The subscription id the cluster is deployed in. |
--resource-group | yes | The resource group the cluster is deployed in. |
--location | yes | The location to deploy to. |
--client-id | depends | The Service Principal Client ID. This is required if the auth-method is set to client_secret or client_certificate |
--client-secret | depends | The Service Principal Client secret. This is required if the auth-method is set to client_secret |
--certificate-path | depends | The path to the file which contains the client certificate. This is required if the auth-method is set to client_certificate |
--identity-system | no | Identity system (default is azure_ad) |
--auth-method | no | The authentication method used. Default value is client_secret . Other supported values are: cli , client_certificate , and device . |
--private-key-path | no | Path to private key (used with --auth-method=client_certificate). |
--language | no | Language to return error message in. Default value is "en-us"). |
During the upgrade, aks-engine successively visits virtual machines that constitute the cluster (first the master nodes, then the agent nodes) and performs the following operations:
Control plane nodes:
- cordon the node and drain existing workloads
- delete the VM
- create new VM and install desired Kubernetes version
- add the new VM to the cluster (custom annotations, labels and taints etc are retained automatically)
Agent nodes:
- create new VM and install desired Kubernetes version
- add the new VM to the cluster
- evict any pods that might be scheduled onto this node by Kubernetes before copying custom node properties
- copy the custom annotations, labels and taints of old node to new node.
- cordon the node and drain existing workloads
- delete the VM
Once you have read all the requirements, run aks-engine upgrade
with the appropriate arguments:
./bin/aks-engine upgrade \
--subscription-id <subscription id> \
--api-model <generated apimodel.json> \
--location <resource group location> \
--resource-group <resource group name> \
--upgrade-version <desired Kubernetes version>
For example,
./bin/aks-engine upgrade \
--subscription-id xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx \
--api-model _output/mycluster/apimodel.json \
--location westus \
--resource-group test-upgrade \
--upgrade-version 1.8.7
If you use Key Vault for secrets, you must specify a local kubeconfig file to connect to the cluster because aks-engine is currently unable to read secrets from a Key Vault during an upgrade.
./bin/aks-engine upgrade \
--api-model _output/mycluster/apimodel.json \
--location westus \
--resource-group test-upgrade \
--upgrade-version 1.21.5 \
--kubeconfig ./path/to/kubeconfig.json
The upgrade operation is a long-running, successive set of ARM deployments, and for large clusters, more susceptible to one of those deployments failing. This is based on the design principle of upgrade enumerating, one-at-a-time, through each node in the cluster. A transient Azure resource allocation error could thus interrupt the successful progression of the overall transaction. At present, the upgrade operation is implemented to "fail fast"; and so, if a well formed upgrade operation fails before completing, it can be manually retried by invoking the exact same command line arguments as were sent originally. The upgrade operation will enumerate through the cluster nodes, skipping any nodes that have already been upgraded to the desired Kubernetes version. Those nodes that match the original Kubernetes version will then, one-at-a-time, be cordon and drained, and upgraded to the desired version. Put another way, an upgrade command is designed to be idempotent across retry scenarios.
At this time, we don't recommend using aks-engine upgrade
on clusters running the cluster-autoscaler
addon that have Availability Set (non-VMSS) node pools.
The upgrade operation takes an optional --force
argument:
-f, --force
force upgrading the cluster to desired version. Allows same version upgrades and downgrades.
In some situations, you might want to bypass the AKS Engine validation of your API model versions and cluster nodes versions. This is at your own risk and you should assess the potential harm of using this flag.
The --force
parameter instructs the upgrade process to:
- bypass the usual version validation
- include all your cluster's nodes (masters and agents) in the upgrade process; nodes that are already on the target version will not be skipped.
- allow any Kubernetes versions, including the non-supported or deprecated versions
- accept downgrade operations
Note: If you pass in a version that AKS-Engine literally cannot install (e.g., a version of Kubernetes that does not exist), you may break your cluster.
For each node, the cluster will follow the same process described in the section above: Under the hood
No! aks-engine upgrade
was designed to exclusively update the Kubernetes version running on a cluster, without affecting any other cluster config (especially IaaS resources). Because under the hood aks-engine upgrade
is actually removing and adding new VMs, various configuration changes may be delivered to the new VMs (such as the VM size), but these changes should be considered experimental and thoroughly tested in a staging environment before being integrated into a production workflow. Specifically, changes to the VNET, Load Balancer, and other network-related configuration are not supported as modifiable by aks-engine upgrade
. If you need to change the Load Balancer config, for example, you will need to build a new cluster.
We actually recommend that you only use aks-engine upgrade --control-plane-only
. There are a few reasons:
- The
aks-engine upgrade
workflow has been implemented in such a way that assumes the underlying nodes are pets, and not cattle. Each node is carefully accounted for during the operation, and every effort is made to "put the cluster back together" as if the nodes simply went away for a few minutes, but then came back. (This is in fact not what's happening under the hood, as the original VMs are in fact destroyed, and replaced with entirely new VMs; only the data disks are actually preserved.) Such an approach is appropriate for control plane VMs, because they are actually defined by AKS Engine as more or less static resources. However, individual worker nodes are not statically defined — the nodes participating in a cluster are designed to be ephemeral in response to changing operational realities. aks-engine upgrade
does its best to minimize operational cluster downtime, but there will be some amount of interruption due to the fact that VMs are in fact deleted, then added, behind a distributed control plane (we're assuming you're running 3 or 5 control plane VMs). Given that a small amount of disruption is unavoidable given the architectural constraints ofaks-engine upgrade
, it is more suitable to absorb that disruption in the control plane, which is probably not user-impacting (unless your users are Kubernetes cluster administrators!). You may be able to afford a small maintenance window to update your control plane, while your existing production workloads continue to serve traffic reliably. Of course production traffic is not static, and any temporary control plane unavailability will disrupt the dynamic attributes of your cluster that ultimately serve user traffic. We do recommend upgrading the control plane during an appropriate time when it is more preferable for your cluster to be put into a "static" mode.- A Kubernetes cluster is likely to run a variety of production workloads, each with its own requirements for downtime maintenance. Running a cluster-wide operation like
aks-engine upgrade
has the result of forcing you to schedule a maintenance window for your control plane, and all production environments simultaneously. - More flexible node pool-specific tooling is available to upgrade various parts of your production-serving nodes. See the addpool, update, and scale documentation to help you develop cluster workflows for managing node pools distinct from the control plane.
aks-engine upgrade --control-plane-only
is not expected to work as intended on air-gapped clouds. Unless you are forcing an upgrade to the current orchestrator version,aks-engine upgrade --control-plane-only
on air-gapped clouds is expected to break thekube-proxy
daemonset as agents will be required to pull the newerkube-proxy
container image. A cluster can be recovered from this bad state by running a full cluster upgrade (control plane and agents).
tl;dr Upgrade your control plane first!
If following our guidance you employ aks-engine upgrade --control-plane-only
to upgrade your control plane distinctly from your worker nodes, and a combination of aks-engine addpool
and aks-engine update
to upgrade worker nodes, the natural question is: which should I do first?
The Kubernetes project publishes that the control plane may be up to 2 versions higher than kubelet, but not vice versa. What this means is that you should not run a newer version of Kubernetes on a node than is running on the control plane. Relevant documentation:
Another example from the kubeadm community project outlines its upgrade process, which specifies upgrading the control plane first.
Can I use aks-engine upgrade --control-plane-only
to change the control plane configuration irrespective of updating the Kubernetes version?
Yes, but with caveats. Essentially you may use the aks-engine upgrade --control-plane-only
functionality to replace your control plane VMs, one-at-a-time, with newer VMs rendered from updated API model configuration. You should always stage such changes, however, by building a staging cluster (reproducing at a minimum the version of aks-engine
used to build your production cluster, and the API model JSON used as input; in a best-case scenario it will be in the same location as well). Here are a few useful possibilities that will work:
- Updating the VM SKU by changing the
properties.masterProfile.vmSize
value - Certain configurable/tuneable kubelet properties in
properties.masterProfile.kubernetesConfig.kubeletConfig
, e.g.:"--feature-gates"
"--node-status-update-frequency"
"--pod-max-pids"
"--register-with-taints"
"--image-gc-high-threshold"
or"--image-gc-low-threshold"
Generally, don't change any listening ports or filepaths, as those may have static dependencies elsewhere.
- Again, certain configurable/tuneable kubelet properties in:
properties.orchestratorProfile.kubernetesConfig.controllerManagerConfig
properties.orchestratorProfile.kubernetesConfig.cloudControllerManagerConfig
properties.orchestratorProfile.kubernetesConfig.apiServerConfig
properties.orchestratorProfile.kubernetesConfig.schedulerConfig
- Control plane VM runtime kernel configuration via
properties.masterProfile.kubernetesConfig.sysctldConfig
You may not change the following values, doing so may break your cluster!
- DO NOT CHANGE the number of VMs in your control plane via
masterProfile.count
- DO NOT CHANGE the static IP address range of your control plane via
masterProfile.firstConsecutiveStaticIP
These types of configuration changes are advanced, only do this if you're a confident, expert Kubernetes cluster administrator!
If any of the upgraded nodes is not able to reach the Ready
state, then aks-engine upgrade
will exit showing a message similar to one below:
Error validating upgraded master VM: k8s-master-12345678-0
Error: upgrading cluster: Node was not ready within 20m0s
There are a variety of reasons why cluster nodes might not be able to come back online after an upgrade. Looking at the kubelet logs (sudo journalctl -u kubelet
) may be a good first investigative step.
Once you identify the problem and update your API model, you can recreate the NotReady
node by (1) changing the orchestrator
tag of the virtual machine so it does not match the target upgrade version, and (2) retrying aks-engine upgrade
. Failing to update the orchestrator
tag will result in aks-engine upgrade
treating the NotReady
node as already upgraded and consequently ignoring it to move on to the next node.