Resiliency Test

This test is thought and designed to answer a few sensible questions:

Can a single node CNPG cluster recover from primary failure using redundant volume replicas?
Is a 3 node CNPG cluster an overdesign or an overzealousness act?
Can CNPG guarantee consistency and recoverability from failure with no PostgreSQL replicas, but only using volume replicas?
Which technology between CSI volume replication and PostgreSQL streaming replication allows a CNPG primary to recover faster?

Requirements

In order to work inside the Kluster-Pi, this test requires:

the latest version of the chosen CSI
the latest version of CNPG operator
the latest version of MetalLB
the storage class for PostgreSQL test to have the volume replication disabled
the storage class for the CSI test to have the volume replication enabled

Clusters

The available YAML manifests in this directory contain the definition of CNPG clusters.

PostgreSQL Cluster

The cluster_pg-based-replicas.yaml manifest, as the name suggests, will create a CNPG cluster with 3 nodes: one primary and two replicas in streaming replication. The storage class used by this cluster don't make use of volume replication.

Longhorn Cluster

The cluster_longhorn-based-replicas.yaml manifest, as the name suggests, will create a CNPG cluster with a single primary node. The storage class used by this cluster make use of volume replication in the other nodes through Longhorn.

The script

The resiliency-test.sh script in this section is designed to run against a single CNPG cluster at a time. It will basically simulate a failure of the PostgreSQL primary by deleting the CNPG primary pod and cordoning the k3s node. This forces the CNPG operator to either create a new primary in a different k3s node in case of a single node CNPG cluster, or promote the most advanced replica in case of a 3 node CNPG cluster.

In order to measure the downtime the test will write constantly a timestamp inside a table of the target database. The result will be the difference between the latest written timestamp before the failure and the first timestamp written after the primary restarted accepting writes.

Procedure

Apply one cluster manifest

kubectl apply -f <cluster_*.yaml>

gather the load balancer IP:

kubectl get svc -o custom-columns=:.metadata.name,:.status.loadBalancer.ingress[].ip | grep db-lb

get cluster name

kubectl get cluster

run the script:

bash resiliency-test.sh <cluster-name> <load-balancer-IP>

compare results

head ./downtime*

Who's gonna win?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Resiliency Test

Requirements

Clusters

PostgreSQL Cluster

Longhorn Cluster

The script

Procedure

Files

README.md

Latest commit

History

README.md

File metadata and controls

Resiliency Test

Requirements

Clusters

PostgreSQL Cluster

Longhorn Cluster

The script

Procedure