Skip to content

Commit

Permalink
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implemented Prometheus Rule for automated alerts
Browse files Browse the repository at this point in the history
Signed-off-by: Itay Grudev <[email protected]>
itay-grudev committed Feb 22, 2024
1 parent 36ffc89 commit 1902828
Showing 16 changed files with 539 additions and 18 deletions.
49 changes: 49 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterHACritical.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
CNPGClusterHACritical
=====================

Meaning
-------

The `CNPGClusterHACritical` alert is triggered when the CloudNativePG cluster has no ready standby replicas.

This can happen during a normal fail-over or automated minor version upgrades in a cluster with 2 or less
instances. The replaced instance may need some time to catch-up with the cluster primary instance.

This alarm will be always trigger if your cluster is configured to run with only 1 instance. In this case you
may want to silence it.

Impact
------

Having no available replicas puts your cluster at a severe risk if the primary instance fails. The primary instance is
still online and able to serve queries, although connections to the `-ro` endpoint will fail.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Get the status of the CloudNativePG cluster instances:

```bash
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
```

Check the logs of the affected CloudNativePG instances:

```bash
kubectl logs --namespace <namespace> pod/<instance-pod-name>
```

Check the CloudNativePG operator logs:

```bash
kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg"
```

Mitigation
----------

Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/)
and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for
more information on how to troubleshoot and mitigate this issue.
51 changes: 51 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterHAWarning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
CNPGClusterHAWarning
====================

Meaning
-------

The `CNPGClusterHAWarning` alert is triggered when the CloudNativePG cluster ready standby replicas are less than `2`.

This alarm will be always triggered if your cluster is configured to run with less than `3` instances. In this case you
may want to silence it.

Impact
------

Having less than two available replicas puts your cluster at risk if another instance fails. The cluster is still able
to operate normally, although the `-ro` and `-r` endpoints operate at reduced capacity.

This can happen during a normal fail-over or automated minor version upgrades. The replaced instance may need some time
to catch-up with the cluster primary instance which will trigger the alert if the operation takes more than 5 minutes.

At `0` available ready replicas, a `CNPGClusterHACritical` alert will be triggered.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Get the status of the CloudNativePG cluster instances:

```bash
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
```

Check the logs of the affected CloudNativePG instances:

```bash
kubectl logs --namespace <namespace> pod/<instance-pod-name>
```

Check the CloudNativePG operator logs:

```bash
kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg"
```

Mitigation
----------

Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/)
and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for
more information on how to troubleshoot and mitigate this issue.
24 changes: 24 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterHighConnectionsCritical.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
CNPGClusterHighConnectionsCritical
=========

Meaning
-------

This alert is triggered when the number of connections to the CNPG cluster instance exceeds 85% of its capacity.

Impact
------

At 100% capacity, the CNPG cluster instance will not be able to accept new connections. This will result in a service
disruption.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Mitigation
----------

* Increase the maximum number of connections by increasing the `max_connections` Postgresql parameter.
* Use connection pooling by enabling PgBouncer to reduce the number of connections to the database.
24 changes: 24 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterHighConnectionsWarning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
CNPGClusterHighConnectionsWarning
=================================

Meaning
-------

This alert is triggered when the number of connections to the CNPG cluster instance exceeds 85% of its capacity.

Impact
------

At 100% capacity, the CNPG cluster instance will not be able to accept new connections. This will result in a service
disruption.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Mitigation
----------

* Increase the maximum number of connections by increasing the `max_connections` Postgresql parameter.
* Use connection pooling by enabling PgBouncer to reduce the number of connections to the database.
31 changes: 31 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterHighReplicationLag.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
CNPGClusterHighReplicationLag
=============================

Meaning
-------

This alert is triggered when the replication lag of the CNPG cluster is too high.

Impact
------

High replication lag can cause the cluster replicas become out of sync. Queries to the `-r` and `-ro` endpoints may return stale data.
In the event of a failover, there may be data loss for the time period of the lag.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

High replication lag can be caused by a number of factors, including:
* Network issues
* High load on the primary or replicas
* Long running queries
* Suboptimal Postgres configuration, in particular small numbers of `max_wal_senders`.

```yaml
kubectl exec --namespace <namespace> --stdin --tty services/<cluster_name>-rw -- psql -c "SELECT * from pg_stat_replication;"
```

Mitigation
----------
27 changes: 27 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterInstancesOnSameNode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
CNPGClusterInstancesOnSameNode
============================

Meaning
-------

The `CNPGClusterInstancesOnSameNode` alert is raised when two or more database pods are scheduled on the same node.

Impact
------

A failure or scheduled downtime of a single node will lead to a potential service disruption and/or data loss.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

```bash
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
```

Mitigation
----------

1. Verify you have more than a single node with no taints, preventing pods to be scheduled there.
2. Verify your [affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/) configuration.
25 changes: 25 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceCritical.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
CNPGClusterLowDiskSpaceCritical
===============================

Meaning
-------

This alert is triggered when the disk space on the CNPG cluster is running low. It can be triggered by either:

* Data PVC
* WAL PVC
* Tablespace PVC

Impact
------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Excessive disk space usage can lead fragmentation negatively impacting performance. Reaching 100% disk usage will result
in downtime and data loss.

Diagnosis
---------

Mitigation
----------
25 changes: 25 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceWarning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
CNPGClusterLowDiskSpaceWarning
==============================

Meaning
-------

This alert is triggered when the disk space on the CNPG cluster is running low. It can be triggered by either:

* Data PVC
* WAL PVC
* Tablespace PVC

Impact
------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Excessive disk space usage can lead fragmentation negatively impacting performance. Reaching 100% disk usage will result
in downtime and data loss.

Diagnosis
---------

Mitigation
----------
43 changes: 43 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterOffline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
CNPGClusterOffline
==================

Meaning
-------

The `CNPGClusterOffline` alert is triggered when there are no ready CloudNativePG instances.

Impact
------

Having an offline cluster means your applications will not be able to access the database, leading to potential service
disruption.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Get the status of the CloudNativePG cluster instances:

```bash
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
```

Check the logs of the affected CloudNativePG instances:

```bash
kubectl logs --namespace <namespace> pod/<instance-pod-name>
```

Check the CloudNativePG operator logs:

```bash
kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg"
```

Mitigation
----------

Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/)
and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for
more information on how to troubleshoot and mitigate this issue.
37 changes: 37 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterZoneSpreadWarning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
CNPGClusterZoneSpreadWarning
============================

Meaning
-------

The `CNPGClusterZoneSpreadWarning` alert is raised when pods are not evenly distributed across availability zones. To be
more accurate, the alert is raised when the number of `pods > zones < 3`.

Impact
------

The uneven distribution of pods across availability zones can lead to a single point of failure if a zone goes down.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Get the status of the CloudNativePG cluster instances:

```bash
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
```

Get the nodes and their respective zones:

```bash
kubectl get nodes --label-columns topology.kubernetes.io/zone
```

Mitigation
----------

1. Verify you have more than a single node with no taints, preventing pods to be scheduled in each availability zone.
2. Verify your [affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/) configuration.
3. Delete the pods and their respective PVC that are not in the desired availability zone and allow the operator to repair the cluster.
3 changes: 2 additions & 1 deletion charts/cluster/examples/custom-queries.yaml
Original file line number Diff line number Diff line change
@@ -4,6 +4,7 @@ mode: standalone
cluster:
instances: 1
monitoring:
enabled: true
customQueries:
- name: "pg_cache_hit"
query: |
@@ -20,4 +21,4 @@ cluster:
description: "Cache hit ratio"

backups:
enabled: false
enabled: false
29 changes: 15 additions & 14 deletions charts/cluster/templates/NOTES.txt
Original file line number Diff line number Diff line change
@@ -42,20 +42,21 @@ Configuration
{{ $scheduledBackups = printf "%s, %s" $scheduledBackups .name }}
{{- end -}}

╭───────────────────┬────────────────────────────────────────────╮
│ Configuration │ Value │
┝━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ Cluster mode │ {{ (printf "%-42s" .Values.mode) }} │
│ Type │ {{ (printf "%-42s" .Values.type) }} │
│ Image │ {{ include "cluster.color-info" (printf "%-42s" (include "cluster.imageName" .)) }} │
│ Instances │ {{ include (printf "%s%s" "cluster.color-" $redundancyColor) (printf "%-42s" (toString .Values.cluster.instances)) }} │
│ Backups │ {{ include (printf "%s%s" "cluster.color-" (ternary "ok" "error" .Values.backups.enabled)) (printf "%-42s" (ternary "Enabled" "Disabled" .Values.backups.enabled)) }} │
│ Backup Provider │ {{ (printf "%-42s" (title .Values.backups.provider)) }} │
│ Scheduled Backups │ {{ (printf "%-42s" $scheduledBackups) }} │
│ Storage │ {{ (printf "%-42s" .Values.cluster.storage.size) }} │
│ Storage Class │ {{ (printf "%-42s" (default "Default" .Values.cluster.storage.storageClass)) }} │
│ PGBouncer │ {{ (printf "%-42s" (ternary "Enabled" "Disabled" .Values.pooler.enabled)) }} │
╰───────────────────┴────────────────────────────────────────────╯
╭───────────────────┬────────────────────────────────────────────────────────╮
│ Configuration │ Value │
┝━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ Cluster mode │ {{ (printf "%-54s" .Values.mode) }} │
│ Type │ {{ (printf "%-54s" .Values.type) }} │
│ Image │ {{ include "cluster.color-info" (printf "%-54s" (include "cluster.imageName" .)) }} │
│ Instances │ {{ include (printf "%s%s" "cluster.color-" $redundancyColor) (printf "%-54s" (toString .Values.cluster.instances)) }} │
│ Backups │ {{ include (printf "%s%s" "cluster.color-" (ternary "ok" "error" .Values.backups.enabled)) (printf "%-54s" (ternary "Enabled" "Disabled" .Values.backups.enabled)) }} │
│ Backup Provider │ {{ (printf "%-54s" (title .Values.backups.provider)) }} │
│ Scheduled Backups │ {{ (printf "%-54s" $scheduledBackups) }} │
│ Storage │ {{ (printf "%-54s" .Values.cluster.storage.size) }} │
│ Storage Class │ {{ (printf "%-54s" (default "Default" .Values.cluster.storage.storageClass)) }} │
│ PGBouncer │ {{ (printf "%-54s" (ternary "Enabled" "Disabled" .Values.pooler.enabled)) }} │
│ Monitoring │ {{ include (printf "%s%s" "cluster.color-" (ternary "ok" "error" .Values.cluster.monitoring.enabled)) (printf "%-54s" (ternary "Enabled" "Disabled" .Values.cluster.monitoring.enabled)) }} │
╰───────────────────┴────────────────────────────────────────────────────────╯

{{ if not .Values.backups.enabled }}
{{- include "cluster.color-error" "Warning! Backups not enabled. Recovery will not be possible! Do not use this configuration in production.\n" }}
1 change: 1 addition & 0 deletions charts/cluster/templates/_helpers.tpl
Original file line number Diff line number Diff line change
@@ -48,6 +48,7 @@ Selector labels
{{- define "cluster.selectorLabels" -}}
app.kubernetes.io/name: {{ include "cluster.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/part-of: cloudnative-pg
{{- end }}

{{/*
2 changes: 1 addition & 1 deletion charts/cluster/templates/cluster.yaml
Original file line number Diff line number Diff line change
@@ -54,7 +54,7 @@ spec:
{{ end }}

monitoring:
enablePodMonitor: {{ .Values.cluster.monitoring.enablePodMonitor }}
enablePodMonitor: {{ and .Values.cluster.monitoring.enabled .Values.cluster.monitoring.podMonitor.enabled }}
{{- if not (empty .Values.cluster.monitoring.customQueries) }}
customQueriesConfigMap:
- name: {{ include "cluster.fullname" . }}-monitoring
177 changes: 177 additions & 0 deletions charts/cluster/templates/prometheus-rule.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
{{- if and .Values.cluster.monitoring.enabled .Values.cluster.monitoring.prometheusRule.enabled -}}
{{- $value := "{{ $value }}" -}}
{{- $namespace := .Release.Namespace -}}
{{- $cluster := printf "%s/%s" $namespace (include "cluster.fullname" .)}}
{{- $labels := dict "job" "{{ $labels.job }}" "node" "{{ $labels.node }}" "pod" "{{ $labels.pod }}" -}}
{{- $podSelector := printf "%s-([1-9][0-9]*)$" (include "cluster.fullname" .) -}}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
{{- include "cluster.labels" . | nindent 4 }}
{{- with .Values.cluster.additionalLabels }}
{{ toYaml . | nindent 4 }}
{{- end }}
name: {{ include "cluster.fullname" . }}-alert-rules
spec:
groups:
- name: cloudnative-pg/{{ include "cluster.fullname" . }}
rules:
- alert: CNPGClusterHAWarning
annotations:
summary: CNPG Cluster less than 2 standby replicas.
description: |-
CloudNativePG Cluster "{{ $labels.job }}" has only {{ $value }} standby replicas, putting
your cluster at risk if another instance fails. The cluster is still able to operate normally, although
the `-ro` and `-r` endpoints operate at reduced capacity.
This can happen during a normal fail-over or automated minor version upgrades. The replaced instance may
need some time to catch-up with the cluster primary instance.
This alarm will be constantly triggered if your cluster is configured to run with less than 3 instances.
In this case you may want to silence it.
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterHAWarning.md
expr: |
max by (job) (cnpg_pg_replication_streaming_replicas{namespace="{{ $namespace }}"} - cnpg_pg_replication_is_wal_receiver_up{namespace="{{ $namespace }}"}) < 2
for: 5m
labels:
severity: warning
- alert: CNPGClusterHACritical
annotations:
summary: CNPG Cluster has no standby replicas!
description: |-
CloudNativePG Cluster "{{ $labels.job }}" has no ready standby replicas. Your cluster at a severe
risk of data loss and downtime if the primary instance fails.
The primary instance is still online and able to serve queries, although connections to the `-ro` endpoint
will fail. The `-r` endpoint os operating at reduced capacity and all traffic is being served by the main.
This can happen during a normal fail-over or automated minor version upgrades in a cluster with 2 or less
instances. The replaced instance may need some time to catch-up with the cluster primary instance.
This alarm will be always trigger if your cluster is configured to run with only 1 instance. In this
case you may want to silence it.
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterHACritical.md
expr: |
max by (job) (cnpg_pg_replication_streaming_replicas{namespace="{{ $namespace }}"} - cnpg_pg_replication_is_wal_receiver_up{namespace="{{ $namespace }}"}) < 1
for: 5m
labels:
severity: critical
- alert: CNPGClusterOffline
annotations:
summary: CNPG Cluster has no running instances!
description: |-
CloudNativePG Cluster "{{ $labels.job }}" has no ready instances.
Having an offline cluster means your applications will not be able to access the database, leading to
potential service disruption and/or data loss.
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterOffline.md
expr: |
({{ .Values.cluster.instances }} - count(cnpg_collector_up{namespace=~"{{ $namespace }}",pod=~"{{ $podSelector }}"}) OR vector(0)) > 0
for: 5m
labels:
severity: critical
- alert: CNPGClusterZoneSpreadWarning
annotations:
summary: CNPG Cluster instances in the same zone.
description: |-
CloudNativePG Cluster "{{ $cluster }}" has instances in the same availability zone.
A disaster in one availability zone will lead to a potential service disruption and/or data loss.
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterZoneSpreadWarning.md
expr: |
{{ .Values.cluster.instances }} > count(count by (label_topology_kubernetes_io_zone) (kube_pod_info{namespace=~"{{ $namespace }}", pod=~"{{ $podSelector }}"} * on(node,instance) group_left(label_topology_kubernetes_io_zone) kube_node_labels)) < 3
for: 5m
labels:
severity: warning
- alert: CNPGClusterInstancesOnSameNode
annotations:
summary: CNPG Cluster instances are located on the same node.
description: |-
CloudNativePG Cluster "{{ $cluster }}" has {{ $value }}
instances on the same node {{ $labels.node }}.
A failure or scheduled downtime of a single node will lead to a potential service disruption and/or data loss.
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterInstancesOnSameNode.md
expr: |
count by (node) (kube_pod_info{namespace=~"{{ $namespace }}", pod=~"{{ $podSelector }}"}) > 1
for: 5m
labels:
severity: warning
- alert: CNPGClusterHighReplicationLag
annotations:
summary: CNPG Cluster high replication lag
description: |-
CloudNativePG Cluster "{{ $cluster }}" is experiencing a high replication lag of
{{ "{{ $value }}" }}ms.
High replication lag indicates network issues, busy instances, slow queries or suboptimal configuration.
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterHighReplicationLag.md
expr: |
max(cnpg_pg_replication_lag{namespace=~"{{ $namespace }}",pod=~"{{ $podSelector }}"}) * 1000 > 1000
for: 5m
labels:
severity: warning
- alert: CNPGClusterHighConnectionsWarning
annotations:
summary: CNPG Instance is approaching the maximum number of connections.
description: |-
CloudNativePG Cluster "{{ $cluster }}" instance {{ $labels.pod }} is using {{ "{{ $value }}" }}% of
the maximum number of connections.
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsWarning.md
expr: |
sum by (pod) (cnpg_backends_total{namespace=~"{{ $namespace }}", pod=~"{{ $podSelector }}"}) / max by (pod) (cnpg_pg_settings_setting{name="max_connections", namespace=~"{{ $namespace }}", pod=~"{{ $podSelector }}"}) * 100 > 80
for: 5m
labels:
severity: warning
- alert: CNPGClusterHighConnectionsCritical
annotations:
summary: CNPG Instance maximum number of connections critical!
description: |-
CloudNativePG Cluster "{{ $cluster }}" instance {{ $labels.pod }} is using {{ "{{ $value }}" }}% of
the maximum number of connections.
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsCritical.md
expr: |
sum by (pod) (cnpg_backends_total{namespace=~"{{ $namespace }}", pod=~"{{ $podSelector }}"}) / max by (pod) (cnpg_pg_settings_setting{name="max_connections", namespace=~"{{ $namespace }}", pod=~"{{ $podSelector }}"}) * 100 > 95
for: 5m
labels:
severity: critical
- alert: CNPGClusterLowDiskSpaceWarning
annotations:
summary: CNPG Instance is running out of disk space.
description: |-
CloudNativePG Cluster "{{ $cluster }}" is running low on disk space. Check attached PVCs.
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceWarning.md
expr: |
max(max by(persistentvolumeclaim) (1 - kubelet_volume_stats_available_bytes{namespace="{{ $namespace }}", persistentvolumeclaim=~"{{ $podSelector }}"} / kubelet_volume_stats_capacity_bytes{namespace="{{ $namespace }}", persistentvolumeclaim=~"{{ $podSelector }}"})) > 0.7 OR
max(max by(persistentvolumeclaim) (1 - kubelet_volume_stats_available_bytes{namespace="{{ $namespace }}", persistentvolumeclaim=~"{{ $podSelector }}-wal"} / kubelet_volume_stats_capacity_bytes{namespace="{{ $namespace }}", persistentvolumeclaim=~"{{ $podSelector }}-wal"})) > 0.7 OR
max(sum by (namespace,persistentvolumeclaim) (kubelet_volume_stats_used_bytes{namespace="{{ $namespace }}", persistentvolumeclaim=~"{{ $podSelector }}-tbs.*"})
/
sum by (namespace,persistentvolumeclaim) (kubelet_volume_stats_capacity_bytes{namespace="{{ $namespace }}", persistentvolumeclaim=~"{{ $podSelector }}-tbs.*"})
*
on(namespace, persistentvolumeclaim) group_left(volume)
kube_pod_spec_volumes_persistentvolumeclaims_info{pod=~"{{ $podSelector }}"}
) > 0.7
for: 5m
labels:
severity: warning
- alert: CNPGClusterLowDiskSpaceCritical
annotations:
summary: CNPG Instance is running out of disk space!
description: |-
CloudNativePG Cluster "{{ $cluster }}" is running extremely low on disk space. Check attached PVCs!
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceCritical.md
expr: |
max(max by(persistentvolumeclaim) (1 - kubelet_volume_stats_available_bytes{namespace="{{ $namespace }}", persistentvolumeclaim=~"{{ $podSelector }}"} / kubelet_volume_stats_capacity_bytes{namespace="{{ $namespace }}", persistentvolumeclaim=~"{{ $podSelector }}"})) > 0.9 OR
max(max by(persistentvolumeclaim) (1 - kubelet_volume_stats_available_bytes{namespace="{{ $namespace }}", persistentvolumeclaim=~"{{ $podSelector }}-wal"} / kubelet_volume_stats_capacity_bytes{namespace="{{ $namespace }}", persistentvolumeclaim=~"{{ $podSelector }}-wal"})) > 0.9 OR
max(sum by (namespace,persistentvolumeclaim) (kubelet_volume_stats_used_bytes{namespace="{{ $namespace }}", persistentvolumeclaim=~"{{ $podSelector }}-tbs.*"})
/
sum by (namespace,persistentvolumeclaim) (kubelet_volume_stats_capacity_bytes{namespace="{{ $namespace }}", persistentvolumeclaim=~"{{ $podSelector }}-tbs.*"})
*
on(namespace, persistentvolumeclaim) group_left(volume)
kube_pod_spec_volumes_persistentvolumeclaims_info{pod=~"{{ $podSelector }}"}
) > 0.9
for: 5m
labels:
severity: warning
{{ end }}
9 changes: 7 additions & 2 deletions charts/cluster/values.yaml
Original file line number Diff line number Diff line change
@@ -132,7 +132,11 @@ cluster:
superuserSecret: ""

monitoring:
enablePodMonitor: false
enabled: false
podMonitor:
enabled: true
prometheusRule:
enabled: true
customQueries: []
# - name: "pg_cache_hit_ratio"
# query: "SELECT current_database() as datname, sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio FROM pg_statio_user_tables;"
@@ -146,7 +150,8 @@ cluster:

# -- Configuration of the PostgreSQL server
# See: https://cloudnative-pg.io/documentation/current/cloudnative-pg.v1/#postgresql-cnpg-io-v1-PostgresConfiguration
postgresql:
postgresql: {}
# max_connections: 300

# -- BootstrapInitDB is the configuration of the bootstrap process when initdb is used
# See: https://cloudnative-pg.io/documentation/current/bootstrap/

0 comments on commit 1902828

Please sign in to comment.