Everyone knows that stateful workloads like PostgreSQL cannot run in Kubernetes. Why do you say the contrary?
An independent research survey commissioned by the Data on Kubernetes Community in September 2021 revealed that half of the respondents run most of their production workloads on Kubernetes. 90% of them believe that Kubernetes is ready for stateful workloads, and 70% of them run databases in production. Databases like Postgres. However, according to them, significant challenges remain, such as the knowledge gap (Kubernetes and Cloud Native, in general, have a steep learning curve) and the quality of Kubernetes operators. The latter is the reason why we believe that an operator like CloudNativePG highly contributes to the success of your project.
For database fanatics like us, a real game-changer has been the introduction of the support for local persistent volumes in Kubernetes 1.14 in April 2019.
CloudNativePG is built on immutable application containers. What does it mean?
According to the microservice architectural pattern, a container is designed to run a single application or process. As a result, such container images are built to run the main application as the single entry point (the so-called PID 1 process).
In Kubernetes terms, the application is referred to as workload. Workloads can be stateless like a web application server or stateful like a database. Mapping this concept to PostgreSQL, an immutable application container is a single "postgres" process that is running and tied to a single and specific version - the one in the immutable container image.
No other processes such as SSH or systemd, or syslog are allowed.
Immutable Application Containers are in contrast with Mutable System Containers, which are still a very common way to interpret and use containers.
Immutable means that a container won't be modified during its life: no updates, no patches, no configuration changes. If you must update the application code or apply a patch, you build a new image and redeploy it. Immutability makes deployments safer and more repeatable.
For more information, please refer to "Why EDB chose immutable application containers".
What does Cloud Native mean?
The Cloud Native Computing Foundation defines the term "Cloud Native". However, since the start of the Cloud Native PostgreSQL/CloudNativePG operator at 2ndQuadrant, the development team has been interpreting Cloud Native as three main concepts:
- An existing, healthy, genuine, and prosperous DevOps culture, founded on people, as well as principles and processes, which enables teams and organizations (as teams of teams) to continuously change so to innovate and accelerate the delivery of outcomes and produce value for the business in safer, more efficient, and more engaging ways
- A microservice architecture that is based on Immutable Application Containers
- A way to manage and orchestrate these containers, such as Kubernetes
Currently, the standard de facto for container orchestration is Kubernetes, which automates the deployment, administration and scalability of Cloud Native Applications.
Another definition of Cloud Native that resonates with us is the one defined by Ibryam and Huß in "Kubernetes Patterns", published by O'Reilly:
Principles, Patterns, Tools to automate containerized microservices at scale
Can I run CloudNativePG on bare metal Kubernetes?
Yes, definitely. You can run Kubernetes on bare metal. And you can dedicate one or more physical worker nodes with locally attached storage to PostgreSQL workloads for maximum and predictable I/O performance.
The actual Cloud Native PostgreSQL project, from which CloudNativePG originated, was born after a pilot project in 2019 that benchmarked storage and PostgreSQL on the same bare metal server, first directly in Linux, and then inside Kubernetes. As expected, the experiment showed only negligible performance impact introduced by the container running in Kubernetes through local persistent volumes, allowing the Cloud Native initiative to continue.
Why should I use PostgreSQL replication instead of file system replication?
Please read the "Architecture: Synchronizing the state" section.
Why should I use an operator instead of running PostgreSQL as a container?
The most basic approach to running PostgreSQL in Kubernetes is to have a pod, which is the smallest unit of deployment in Kubernetes, running a Postgres container with no replica. The volume hosting the Postgres data directory is mounted on the pod, and it usually resides on network storage. In this case, Kubernetes restarts the pod in case of a problem or moves it to another Kubernetes node.
The most sophisticated approach is to run PostgreSQL using an operator. An operator is an extension of the Kubernetes controller and defines how a complex application works in business continuity contexts. The operator pattern is currently state of the art in Kubernetes for this purpose. An operator simulates the work of a human operator in an automated and programmatic way.
Postgres is a complex application, and an operator not only needs to deploy a cluster (the first step), but also properly react after unexpected events. The typical example is that of a failover.
An operator relies on Kubernetes for capabilities like self-healing, scalability, replication, high availability, backup, recovery, updates, access, resource control, storage management, and so on. It also facilitates the integration of a PostgreSQL cluster in the log management and monitoring infrastructure.
CloudNativePG enables the definition of the desired state of a PostgreSQL cluster via declarative configuration. Kubernetes continuously makes sure that the current state of the infrastructure matches the desired one through reconciliation loops initiated by the Kubernetes controller. If the desired state and the actual state don't match, reconciliation loops trigger self-healing procedures. That's where an operator like CloudNativePG comes into play.
Are there any other operators for Postgres out there?
Yes, of course. And our advice is that you look at all of them and compare them with CloudNativePG before making your decision. You will see that most of these operators use an external failover management tool (Patroni or similar) and rely on StatefulSets.
Here is a non exhaustive list, in chronological order from their publication on GitHub:
- Crunchy Data Postgres Operator (2017)
- Zalando Postgres Operator (2017)
- Stackgres (2020)
- Percona Operator for PostgreSQL (2021)
- Kubegres (2021)
Feel free to report any relevant missing entry as a PR.
!!! Info The Data on Kubernetes Community (which includes some of our maintainers) is working on an independent and vendor neutral project to list the operators called Operator Feature Matrix.
You say that CloudNativePG is a fully declarative operator. What do you mean by that?
The easiest way is to explain declarative configuration through an example that highlights the differences with imperative configuration. In an imperative context, the state is defined as a series of tasks to be executed in sequence. So, we can get a three-node PostgreSQL cluster by creating the first instance, configuring the replication, cloning a second instance, and the third one.
In a declarative approach, the state of a system is defined using configuration, namely: there's a PostgreSQL 13 cluster with two replicas. This approach highly simplifies change management operations, and when these are stored in source control systems like Git, it enables the Infrastructure as Code capability. And Kubernetes takes it farther than deployment, as it makes sure that our request is fulfilled at any time.
What are the required skills to run PostgreSQL on Kubernetes?
Running PostgreSQL on Kubernetes requires both PostgreSQL and Kubernetes skills in your DevOps team. The best experience is when database administrators familiarize themselves with Kubernetes core concepts and are able to interact with Kubernetes administrators.
Our advice is for everyone that wants to fully exploit Cloud Native PostgreSQL to acquire the "Certified Kubernetes Administrator (CKA)" status from the CNCF certification program.
Why isn't CloudNativePG using StatefulSets?
CloudNativePG does not rely on StatefulSet
resources, and
instead manages the underlying PVCs directly by leveraging the selected
storage class for dynamic provisioning. Please refer to the
"Custom Pod Controller" section for details and reasons behind
this decision.
What happens to the PostgreSQL clusters when the operator pod dies or it is not available for a certain amount of time?
The CloudNativePG operator, among other things, is responsible for self-healing capabilities. As such, they might not be available during an outage of the operator.
However, assuming that the outage does not affect the nodes where PostgreSQL clusters are running, the database will continue to serve normal operations, through the relevant Kubernetes services. Moreover, the instance manager, which runs inside each PostgreSQL pod will still work, making sure that the database server is up, including accessory services like logging, export of metrics, continuous archiving of WAL files, etc.
To summarize:
an outage of the operator does not necessarily imply a PostgreSQL database outage; it's like running a database without a DBA or system administrator.
What are the reasons behind CloudNativePG not relying on a failover management tool like Patroni, repmgr, or Stolon?
Although part of the team that develops CloudNativePG has been heavily involved in repmgr in the past, we decided to take a different approach and directly extend the Kubernetes controller and rely on the Kubernetes API server to hold the status of a Postgres cluster, and use it as the only source of truth to:
- control High Availability of a Postgres cluster primarily via automated failover and switchover, coordinating itself with the instance manager
- control the Kubernetes services, that is the entry points for your applications
Should I manually resync a former primary with the new one following a failover?
No. The operator does that automatically for you, and relies on pg_rewind
to
synchronize the former primary with the new one.
Why should I use PostgreSQL?
We believe that PostgreSQL is the equivalent in the database area of what Linux represents in the operating system space. The current latest major version of Postgres is version 16, which ships out of the box:
- native streaming replication, both physical and logical
- continuous hot backup and point in time recovery
- declarative partitioning for horizontal table partitioning, which is a very well-known technique in the database area to improve vertical scalability on a single instance
- extensibility, with extensions like PostGIS for geographical databases
- parallel queries for vertical scalability
- JSON support, unleashing the multi-model hybrid database for both structured and unstructured data queried via standard SQL
And so on ...
How many databases should be hosted in a single PostgreSQL instance?
Our recommendation is to dedicate a single PostgreSQL cluster (intended as primary and multiple standby servers) to a single database, entirely managed by a single microservice application. However, by leveraging the "postgres" superuser, it is possible to create as many users and databases as desired (subject to the available resources).
The reason for this recommendation lies in the Cloud Native concept, based on microservices. In a pure microservice architecture, the microservice itself should own the data it manages exclusively. These could be flat files, queues, key-value stores, or, in our case, a PostgreSQL relational database containing both structured and unstructured data. The general idea is that only the microservice can access the database, including schema management and migrations.
CloudNativePG has been designed to work this way out of the box, by default creating an application user and an application database owned by the aforementioned application user.
Reserving a PostgreSQL instance to a single microservice owned database, enhances:
- resource management: in PostgreSQL, CPU, and memory constrained resources are generally handled at the instance level, not the database level, making it easier to integrate it with Kubernetes resource management policies at the pod level
- physical continuous backup and Point-In-Time-Recovery (PITR): given that PostgreSQL handles continuous backup and recovery at the instance level, having one database per instance simplifies PITR operations, differentiates retention policy management, and increases data protection of backups
- application updates: enable each application to decide their update policies without impacting other databases owned by different applications
- database updates: each application can decide which PostgreSQL version to use, and independently, when to upgrade to a different major version of PostgreSQL and at what conditions (e.g., cutover time)
Is there an upper limit in database size for not considering Kubernetes?
No, as Kubernetes is no different from virtual machines and bare metal as far as this is regarded. Practically, however, it depends on the available resources of your Kubernetes cluster. Our advice with very large databases (VLDB) is to consider a shared nothing architecture, where a Kubernetes worker node is dedicated to a single Postgres instance, with dedicated storage. We proved that this extreme architectural pattern works when we benchmarked running PostgreSQL on bare metal Kubernetes with local persistent volumes. A current limitation of CloudNativePG, which will be overcome in version 1.22, is the lack of support for tablespaces. Once tablespaces are available, horizontal partitioning can be easily implemented.
How can I specify a time zone in the PostgreSQL cluster?
PostgreSQL has an extensive support for time zones, as explained in the official documentation:
Although time zones can even be used at session, transaction and even as part
of a query in PostgreSQL, a very common way is to set them up globally. With
CloudNativePG you can configure the cluster level time zone in the
spec.postgresql.parameters
section as in the following example:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: pg-italy
spec:
instances: 1
postgresql:
parameters:
timezone: "Europe/Rome"
storage:
size: 1Gi
The time zone can be verified with:
$ kubectl exec -ti pg-italy-1 -c postgres -- psql -x -c "SHOW timezone"
-[ RECORD 1 ]---------
TimeZone | Europe/Rome
What is the recommended architecture for best business continuity outcomes?
As covered in the "Architecture" section, the main recommendation is to adopt shared nothing architectures as much as possible, by leveraging the native capabilities and resources that Kubernetes provides in a single cluster, namely:
- availability zones: spread your instances across different availability zones in the same Kubernetes cluster
- worker nodes: as a consequence, make sure that your Postgres instances reside on different Kubernetes worker nodes
- storage: use dedicated storage for each worker node running Postgres
Use at least one standby, preferably at least two, so that you can configure synchronous replication in the cluster, introducing RPO=0 for high availability.
If you do not have availability zones - normally the case of on-premise installations - separate on worker nodes and storage.
Properly setup continuous backup on a local/regional object store.
The same architecture that is in a single Kubernetes cluster can be replicated in another Kubernetes cluster (normally in another geographical area or region) through the replica cluster feature, providing disaster recovery and high availability at global scale.
You can use the WAL archive in the primary object store to feed the replica in the other region, without having to provide a streaming connection, if the default maximum RPO of 5 minutes is enough for you.
How can instances be stopped or started?
Please look at "Fencing" or "Hibernation".
What are the global objects such as roles and databases that are automatically created by CloudNativePG?
The operator automatically creates a user for the application (by default
called app
) and a database for the application (by default called app
)
which is owned by the aforementioned user.
This way, the database is ready for a microservice adoption, as developers
can control migrations using the app
user, without requiring superuser
access.
Teams can then create another user for read-write operations through the
"Declarative role management" feature
and assign the required GRANT
to the tables.