36. Multi-cluster

Date: 2022-09-16

Status

✅ Accepted

Context

At the time of writing (16th September, 2022); Cloud Platform hosts user applications in a single Kubernetes cluster. On 12th April, 2022, we decommissioned the old kOps cluster in favour of utilising EKS (see ADR-022 - EKS).

At the time of decommissioning the kOps cluster, Cloud Platform hosted 409 namespaces in the new EKS-based Kubernetes cluster.

The key reasoning behind utilising a single Kubernetes cluster for service teams is described in ADR-021, namely:

Strong isolation is already required between apps from different teams (via namespaces, network policies), so there is no difference for isolating environments
Maintaining clusters for each environment is a cost in effort
You risk the clusters diverging. So you might miss problems when testing on the dev/staging clusters, because they aren't the same as prod.

However, outside of service teams applications; the Cloud Platform does already use multiple clusters for 'management', and ephemeral 'test' purposes, and further information regarding multi-cluster and its challenges are described in ADR-021.

The Cloud Platform team's use of EKS has matured and usage of the EKS-based cluster has grown since April from 409 namespaces, to 457 at the time of writing and will continue to grow in the future. Due to this, we believe now is a good time to start separating out workloads into a multi-cluster environment, as the nuances and costs of EKS are more widely understood.

It is worth noting that the current cluster (at the time of writing) is not considered a "large cluster" by Kubernetes, which stipulates these considerations for a "large cluster":

No more than 110 pods per node
No more than 5000 nodes
No more than 150000 total pods
No more than 300000 total containers

The Cloud Platform is not near these limits.

Decision

We have decided to:

adopt a multi-cluster approach
start adopting multi-cluster by splitting out workloads by environment type, i.e. production and non-production
treat the non-production cluster as a second production cluster within the team, though there may be slight differences such as:
- compute power, where non-production workloads don't require as intensive power as production workloads
- hours of operation, for e.g. we may turn off the non-production cluster out of hours or on weekends if there is appetite to do so with our users to save on cost
we will allow users to self-select their preferred environment initially, though this may change or automate in the future

Consequences

By running multiple clusters, we can:

derisk upgrading Kubernetes versions as non-production workloads will be separate from production workloads
reduce the blast radius by creating a hard separation between clusters
reduce the cluster single-points of failure, such as ingress controllers
increase our separation outside of Kubernetes between workloads
improve our resilience
we can customise the non-production cluster to meet non-production needs, and the production cluster to meet production needs

However, this does come at both a monetary and team cost:

we expect the cost of running the Cloud Platform to increase, not significantly but not insignificantly
there will be less efficient resource usage as some nodes may not fill up
there will be an increased complexity of cluster management

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

036-multi-cluster.md

036-multi-cluster.md

36. Multi-cluster

Status

Context

Decision

Consequences

Files

036-multi-cluster.md

Latest commit

History

036-multi-cluster.md

File metadata and controls

36. Multi-cluster

Status

Context

Decision

Consequences