Implemented Prometheus Rule for automated alerts

Signed-off-by: Itay Grudev <[email protected]>
cloudnative-pg · Feb 22, 2024 · cbe499c · cbe499c
1 parent 36ffc89
commit cbe499c
Show file tree

Hide file tree

Showing 16 changed files with 539 additions and 18 deletions.
diff --git a/charts/cluster/docs/runbooks/CNPGClusterHACritical.md b/charts/cluster/docs/runbooks/CNPGClusterHACritical.md
@@ -0,0 +1,49 @@
+CNPGClusterHACritical
+=====================
+
+Meaning
+-------
+
+The `CNPGClusterHACritical` alert is triggered when the CloudNativePG cluster has no ready standby replicas.
+
+This can happen during a normal fail-over or automated minor version upgrades in a cluster with 2 or less
+instances. The replaced instance may need some time to catch-up with the cluster primary instance.
+
+This alarm will be always trigger if your cluster is configured to run with only 1 instance. In this case you
+may want to silence it.
+
+Impact
+------
+
+Having no available replicas puts your cluster at a severe risk if the primary instance fails. The primary instance is
+still online and able to serve queries, although connections to the `-ro` endpoint will fail.
+
+Diagnosis
+---------
+
+Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
+
+Get the status of the CloudNativePG cluster instances:
+
+```bash
+kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
+```
+
+Check the logs of the affected CloudNativePG instances:
+
+```bash
+kubectl logs --namespace <namespace> pod/<instance-pod-name>
+```
+
+Check the CloudNativePG operator logs:
+
+```bash
+kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg"
+```
+
+Mitigation
+----------
+
+Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/)
+and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for
+more information on how to troubleshoot and mitigate this issue.
diff --git a/charts/cluster/docs/runbooks/CNPGClusterHAWarning.md b/charts/cluster/docs/runbooks/CNPGClusterHAWarning.md
@@ -0,0 +1,51 @@
+CNPGClusterHAWarning
+====================
+
+Meaning
+-------
+
+The `CNPGClusterHAWarning` alert is triggered when the CloudNativePG cluster ready standby replicas are less than `2`.
+
+This alarm will be always triggered if your cluster is configured to run with less than `3` instances. In this case you
+may want to silence it.
+
+Impact
+------
+
+Having less than two available replicas puts your cluster at risk if another instance fails. The cluster is still able
+to operate normally, although the `-ro` and `-r` endpoints operate at reduced capacity.
+
+This can happen during a normal fail-over or automated minor version upgrades. The replaced instance may need some time
+to catch-up with the cluster primary instance which will trigger the alert if the operation takes more than 5 minutes.
+
+At `0` available ready replicas, a `CNPGClusterHACritical` alert will be triggered.
+
+Diagnosis
+---------
+
+Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
+
+Get the status of the CloudNativePG cluster instances:
+
+```bash
+kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
+```
+
+Check the logs of the affected CloudNativePG instances:
+
+```bash
+kubectl logs --namespace <namespace> pod/<instance-pod-name>
+```
+
+Check the CloudNativePG operator logs:
+
+```bash
+kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg"
+```
+
+Mitigation
+----------
+
+Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/)
+and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for
+more information on how to troubleshoot and mitigate this issue.
diff --git a/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsCritical.md b/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsCritical.md
@@ -0,0 +1,24 @@
+CNPGClusterHighConnectionsCritical
+=========
+
+Meaning
+-------
+
+This alert is triggered when the number of connections to the CNPG cluster instance exceeds 85% of its capacity.
+
+Impact
+------
+
+At 100% capacity, the CNPG cluster instance will not be able to accept new connections. This will result in a service
+disruption.
+
+Diagnosis
+---------
+
+Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
+
+Mitigation
+----------
+
+* Increase the maximum number of connections by increasing the `max_connections` Postgresql parameter.
+* Use connection pooling by enabling PgBouncer to reduce the number of connections to the database.
diff --git a/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsWarning.md b/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsWarning.md
@@ -0,0 +1,24 @@
+CNPGClusterHighConnectionsWarning
+=================================
+
+Meaning
+-------
+
+This alert is triggered when the number of connections to the CNPG cluster instance exceeds 85% of its capacity.
+
+Impact
+------
+
+At 100% capacity, the CNPG cluster instance will not be able to accept new connections. This will result in a service
+disruption.
+
+Diagnosis
+---------
+
+Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
+
+Mitigation
+----------
+
+* Increase the maximum number of connections by increasing the `max_connections` Postgresql parameter.
+* Use connection pooling by enabling PgBouncer to reduce the number of connections to the database.
diff --git a/charts/cluster/docs/runbooks/CNPGClusterHighReplicationLag.md b/charts/cluster/docs/runbooks/CNPGClusterHighReplicationLag.md
@@ -0,0 +1,31 @@
+CNPGClusterHighReplicationLag
+=============================
+
+Meaning
+-------
+
+This alert is triggered when the replication lag of the CNPG cluster is too high.
+
+Impact
+------
+
+High replication lag can cause the cluster replicas become out of sync. Queries to the `-r` and `-ro` endpoints may return stale data.
+In the event of a failover, there may be data loss for the time period of the lag.
+
+Diagnosis
+---------
+
+Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
+
+High replication lag can be caused by a number of factors, including:
+* Network issues
+* High load on the primary or replicas
+* Long running queries
+* Suboptimal Postgres configuration, in particular small numbers of `max_wal_senders`.
+
+```yaml
+kubectl exec --namespace <namespace> --stdin --tty services/<cluster_name>-rw -- psql -c "SELECT * from pg_stat_replication;"
+```
+
+Mitigation
+----------
diff --git a/charts/cluster/docs/runbooks/CNPGClusterInstancesOnSameNode.md b/charts/cluster/docs/runbooks/CNPGClusterInstancesOnSameNode.md
@@ -0,0 +1,27 @@
+CNPGClusterInstancesOnSameNode
+============================
+
+Meaning
+-------
+
+The `CNPGClusterInstancesOnSameNode` alert is raised when two or more database pods are scheduled on the same node.
+
+Impact
+------
+
+A failure or scheduled downtime of a single node will lead to a potential service disruption and/or data loss.
+
+Diagnosis
+---------
+
+Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
+
+```bash
+kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
+```
+
+Mitigation
+----------
+
+1. Verify you have more than a single node with no taints, preventing pods to be scheduled there.
+2. Verify your [affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/) configuration.
diff --git a/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceCritical.md b/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceCritical.md
@@ -0,0 +1,25 @@
+CNPGClusterLowDiskSpaceCritical
+===============================
+
+Meaning
+-------
+
+This alert is triggered when the disk space on the CNPG cluster is running low. It can be triggered by either:
+
+* Data PVC
+* WAL PVC
+* Tablespace PVC
+
+Impact
+------
+
+Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
+
+Excessive disk space usage can lead fragmentation negatively impacting performance. Reaching 100% disk usage will result
+in downtime and data loss.
+
+Diagnosis
+---------
+
+Mitigation
+----------
diff --git a/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceWarning.md b/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceWarning.md
@@ -0,0 +1,25 @@
+CNPGClusterLowDiskSpaceWarning
+==============================
+
+Meaning
+-------
+
+This alert is triggered when the disk space on the CNPG cluster is running low. It can be triggered by either:
+
+* Data PVC
+* WAL PVC
+* Tablespace PVC
+
+Impact
+------
+
+Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
+
+Excessive disk space usage can lead fragmentation negatively impacting performance. Reaching 100% disk usage will result
+in downtime and data loss.
+
+Diagnosis
+---------
+
+Mitigation
+----------
diff --git a/charts/cluster/docs/runbooks/CNPGClusterOffline.md b/charts/cluster/docs/runbooks/CNPGClusterOffline.md
@@ -0,0 +1,43 @@
+CNPGClusterOffline
+==================
+
+Meaning
+-------
+
+The `CNPGClusterOffline` alert is triggered when there are no ready CloudNativePG instances.
+
+Impact
+------
+
+Having an offline cluster means your applications will not be able to access the database, leading to potential service
+disruption.
+
+Diagnosis
+---------
+
+Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
+
+Get the status of the CloudNativePG cluster instances:
+
+```bash
+kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
+```
+
+Check the logs of the affected CloudNativePG instances:
+
+```bash
+kubectl logs --namespace <namespace> pod/<instance-pod-name>
+```
+
+Check the CloudNativePG operator logs:
+
+```bash
+kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg"
+```
+
+Mitigation
+----------
+
+Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/)
+and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for
+more information on how to troubleshoot and mitigate this issue.
diff --git a/charts/cluster/docs/runbooks/CNPGClusterZoneSpreadWarning.md b/charts/cluster/docs/runbooks/CNPGClusterZoneSpreadWarning.md
@@ -0,0 +1,37 @@
+CNPGClusterZoneSpreadWarning
+============================
+
+Meaning
+-------
+
+The `CNPGClusterZoneSpreadWarning` alert is raised when pods are not evenly distributed across availability zones. To be
+more accurate, the alert is raised when the number of `pods > zones < 3`.
+
+Impact
+------
+
+The uneven distribution of pods across availability zones can lead to a single point of failure if a zone goes down.
+
+Diagnosis
+---------
+
+Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
+
+Get the status of the CloudNativePG cluster instances:
+
+```bash
+kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
+```
+
+Get the nodes and their respective zones:
+
+```bash
+kubectl get nodes --label-columns topology.kubernetes.io/zone
+```
+
+Mitigation
+----------
+
+1. Verify you have more than a single node with no taints, preventing pods to be scheduled in each availability zone.
+2. Verify your [affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/) configuration.
+3. Delete the pods and their respective PVC that are not in the desired availability zone and allow the operator to repair the cluster.
diff --git a/charts/cluster/examples/custom-queries.yaml b/charts/cluster/examples/custom-queries.yaml
@@ -4,6 +4,7 @@ mode: standalone
 cluster:
   instances: 1
   monitoring:
+    enabled: true
     customQueries:
       - name: "pg_cache_hit"
         query: |
@@ -20,4 +21,4 @@ cluster:
               description: "Cache hit ratio"
 
 backups:
-  enabled: false
+  enabled: false
diff --git a/charts/cluster/templates/NOTES.txt b/charts/cluster/templates/NOTES.txt
@@ -42,20 +42,21 @@ Configuration
   {{ $scheduledBackups = printf "%s, %s" $scheduledBackups .name }}
 {{- end -}}
 
-╭───────────────────┬────────────────────────────────────────────╮
-│ Configuration     │ Value                                      │
-┝━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
-│ Cluster mode      │ {{ (printf "%-42s" .Values.mode) }} │
-│ Type              │ {{ (printf "%-42s" .Values.type) }} │
-│ Image             │ {{ include "cluster.color-info" (printf "%-42s" (include "cluster.imageName" .)) }} │
-│ Instances         │ {{ include (printf "%s%s" "cluster.color-" $redundancyColor) (printf "%-42s" (toString .Values.cluster.instances)) }} │
-│ Backups           │ {{ include (printf "%s%s" "cluster.color-" (ternary "ok" "error" .Values.backups.enabled)) (printf "%-42s" (ternary "Enabled" "Disabled" .Values.backups.enabled)) }} │
-│ Backup Provider   │ {{ (printf "%-42s" (title .Values.backups.provider)) }} │
-│ Scheduled Backups │ {{ (printf "%-42s" $scheduledBackups) }} │
-│ Storage           │ {{ (printf "%-42s" .Values.cluster.storage.size) }} │
-│ Storage Class     │ {{ (printf "%-42s" (default "Default" .Values.cluster.storage.storageClass)) }} │
-│ PGBouncer         │ {{ (printf "%-42s" (ternary "Enabled" "Disabled" .Values.pooler.enabled)) }} │
-╰───────────────────┴────────────────────────────────────────────╯
+╭───────────────────┬────────────────────────────────────────────────────────╮
+│ Configuration     │ Value                                                  │
+┝━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
+│ Cluster mode      │ {{ (printf "%-54s" .Values.mode) }} │
+│ Type              │ {{ (printf "%-54s" .Values.type) }} │
+│ Image             │ {{ include "cluster.color-info" (printf "%-54s" (include "cluster.imageName" .)) }} │
+│ Instances         │ {{ include (printf "%s%s" "cluster.color-" $redundancyColor) (printf "%-54s" (toString .Values.cluster.instances)) }} │
+│ Backups           │ {{ include (printf "%s%s" "cluster.color-" (ternary "ok" "error" .Values.backups.enabled)) (printf "%-54s" (ternary "Enabled" "Disabled" .Values.backups.enabled)) }} │
+│ Backup Provider   │ {{ (printf "%-54s" (title .Values.backups.provider)) }} │
+│ Scheduled Backups │ {{ (printf "%-54s" $scheduledBackups) }} │
+│ Storage           │ {{ (printf "%-54s" .Values.cluster.storage.size) }} │
+│ Storage Class     │ {{ (printf "%-54s" (default "Default" .Values.cluster.storage.storageClass)) }} │
+│ PGBouncer         │ {{ (printf "%-54s" (ternary "Enabled" "Disabled" .Values.pooler.enabled)) }} │
+│ Monitoring        │ {{ include (printf "%s%s" "cluster.color-" (ternary "ok" "error" .Values.cluster.monitoring.enabled)) (printf "%-54s" (ternary "Enabled" "Disabled" .Values.cluster.monitoring.enabled)) }} │
+╰───────────────────┴────────────────────────────────────────────────────────╯
 
 {{ if not .Values.backups.enabled }}
   {{- include "cluster.color-error" "Warning! Backups not enabled. Recovery will not be possible! Do not use this configuration in production.\n" }}