-
Notifications
You must be signed in to change notification settings - Fork 103
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implemented Prometheus Rule for automated alerts
Signed-off-by: Itay Grudev <[email protected]>
- Loading branch information
1 parent
36ffc89
commit cbe499c
Showing
16 changed files
with
539 additions
and
18 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
CNPGClusterHACritical | ||
===================== | ||
|
||
Meaning | ||
------- | ||
|
||
The `CNPGClusterHACritical` alert is triggered when the CloudNativePG cluster has no ready standby replicas. | ||
|
||
This can happen during a normal fail-over or automated minor version upgrades in a cluster with 2 or less | ||
instances. The replaced instance may need some time to catch-up with the cluster primary instance. | ||
|
||
This alarm will be always trigger if your cluster is configured to run with only 1 instance. In this case you | ||
may want to silence it. | ||
|
||
Impact | ||
------ | ||
|
||
Having no available replicas puts your cluster at a severe risk if the primary instance fails. The primary instance is | ||
still online and able to serve queries, although connections to the `-ro` endpoint will fail. | ||
|
||
Diagnosis | ||
--------- | ||
|
||
Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). | ||
|
||
Get the status of the CloudNativePG cluster instances: | ||
|
||
```bash | ||
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide | ||
``` | ||
|
||
Check the logs of the affected CloudNativePG instances: | ||
|
||
```bash | ||
kubectl logs --namespace <namespace> pod/<instance-pod-name> | ||
``` | ||
|
||
Check the CloudNativePG operator logs: | ||
|
||
```bash | ||
kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg" | ||
``` | ||
|
||
Mitigation | ||
---------- | ||
|
||
Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/) | ||
and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for | ||
more information on how to troubleshoot and mitigate this issue. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
CNPGClusterHAWarning | ||
==================== | ||
|
||
Meaning | ||
------- | ||
|
||
The `CNPGClusterHAWarning` alert is triggered when the CloudNativePG cluster ready standby replicas are less than `2`. | ||
|
||
This alarm will be always triggered if your cluster is configured to run with less than `3` instances. In this case you | ||
may want to silence it. | ||
|
||
Impact | ||
------ | ||
|
||
Having less than two available replicas puts your cluster at risk if another instance fails. The cluster is still able | ||
to operate normally, although the `-ro` and `-r` endpoints operate at reduced capacity. | ||
|
||
This can happen during a normal fail-over or automated minor version upgrades. The replaced instance may need some time | ||
to catch-up with the cluster primary instance which will trigger the alert if the operation takes more than 5 minutes. | ||
|
||
At `0` available ready replicas, a `CNPGClusterHACritical` alert will be triggered. | ||
|
||
Diagnosis | ||
--------- | ||
|
||
Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). | ||
|
||
Get the status of the CloudNativePG cluster instances: | ||
|
||
```bash | ||
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide | ||
``` | ||
|
||
Check the logs of the affected CloudNativePG instances: | ||
|
||
```bash | ||
kubectl logs --namespace <namespace> pod/<instance-pod-name> | ||
``` | ||
|
||
Check the CloudNativePG operator logs: | ||
|
||
```bash | ||
kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg" | ||
``` | ||
|
||
Mitigation | ||
---------- | ||
|
||
Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/) | ||
and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for | ||
more information on how to troubleshoot and mitigate this issue. |
24 changes: 24 additions & 0 deletions
24
charts/cluster/docs/runbooks/CNPGClusterHighConnectionsCritical.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
CNPGClusterHighConnectionsCritical | ||
========= | ||
|
||
Meaning | ||
------- | ||
|
||
This alert is triggered when the number of connections to the CNPG cluster instance exceeds 85% of its capacity. | ||
|
||
Impact | ||
------ | ||
|
||
At 100% capacity, the CNPG cluster instance will not be able to accept new connections. This will result in a service | ||
disruption. | ||
|
||
Diagnosis | ||
--------- | ||
|
||
Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). | ||
|
||
Mitigation | ||
---------- | ||
|
||
* Increase the maximum number of connections by increasing the `max_connections` Postgresql parameter. | ||
* Use connection pooling by enabling PgBouncer to reduce the number of connections to the database. |
24 changes: 24 additions & 0 deletions
24
charts/cluster/docs/runbooks/CNPGClusterHighConnectionsWarning.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
CNPGClusterHighConnectionsWarning | ||
================================= | ||
|
||
Meaning | ||
------- | ||
|
||
This alert is triggered when the number of connections to the CNPG cluster instance exceeds 85% of its capacity. | ||
|
||
Impact | ||
------ | ||
|
||
At 100% capacity, the CNPG cluster instance will not be able to accept new connections. This will result in a service | ||
disruption. | ||
|
||
Diagnosis | ||
--------- | ||
|
||
Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). | ||
|
||
Mitigation | ||
---------- | ||
|
||
* Increase the maximum number of connections by increasing the `max_connections` Postgresql parameter. | ||
* Use connection pooling by enabling PgBouncer to reduce the number of connections to the database. |
31 changes: 31 additions & 0 deletions
31
charts/cluster/docs/runbooks/CNPGClusterHighReplicationLag.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
CNPGClusterHighReplicationLag | ||
============================= | ||
|
||
Meaning | ||
------- | ||
|
||
This alert is triggered when the replication lag of the CNPG cluster is too high. | ||
|
||
Impact | ||
------ | ||
|
||
High replication lag can cause the cluster replicas become out of sync. Queries to the `-r` and `-ro` endpoints may return stale data. | ||
In the event of a failover, there may be data loss for the time period of the lag. | ||
|
||
Diagnosis | ||
--------- | ||
|
||
Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). | ||
|
||
High replication lag can be caused by a number of factors, including: | ||
* Network issues | ||
* High load on the primary or replicas | ||
* Long running queries | ||
* Suboptimal Postgres configuration, in particular small numbers of `max_wal_senders`. | ||
|
||
```yaml | ||
kubectl exec --namespace <namespace> --stdin --tty services/<cluster_name>-rw -- psql -c "SELECT * from pg_stat_replication;" | ||
``` | ||
|
||
Mitigation | ||
---------- |
27 changes: 27 additions & 0 deletions
27
charts/cluster/docs/runbooks/CNPGClusterInstancesOnSameNode.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
CNPGClusterInstancesOnSameNode | ||
============================ | ||
|
||
Meaning | ||
------- | ||
|
||
The `CNPGClusterInstancesOnSameNode` alert is raised when two or more database pods are scheduled on the same node. | ||
|
||
Impact | ||
------ | ||
|
||
A failure or scheduled downtime of a single node will lead to a potential service disruption and/or data loss. | ||
|
||
Diagnosis | ||
--------- | ||
|
||
Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). | ||
|
||
```bash | ||
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide | ||
``` | ||
|
||
Mitigation | ||
---------- | ||
|
||
1. Verify you have more than a single node with no taints, preventing pods to be scheduled there. | ||
2. Verify your [affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/) configuration. |
25 changes: 25 additions & 0 deletions
25
charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceCritical.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
CNPGClusterLowDiskSpaceCritical | ||
=============================== | ||
|
||
Meaning | ||
------- | ||
|
||
This alert is triggered when the disk space on the CNPG cluster is running low. It can be triggered by either: | ||
|
||
* Data PVC | ||
* WAL PVC | ||
* Tablespace PVC | ||
|
||
Impact | ||
------ | ||
|
||
Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). | ||
|
||
Excessive disk space usage can lead fragmentation negatively impacting performance. Reaching 100% disk usage will result | ||
in downtime and data loss. | ||
|
||
Diagnosis | ||
--------- | ||
|
||
Mitigation | ||
---------- |
25 changes: 25 additions & 0 deletions
25
charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceWarning.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
CNPGClusterLowDiskSpaceWarning | ||
============================== | ||
|
||
Meaning | ||
------- | ||
|
||
This alert is triggered when the disk space on the CNPG cluster is running low. It can be triggered by either: | ||
|
||
* Data PVC | ||
* WAL PVC | ||
* Tablespace PVC | ||
|
||
Impact | ||
------ | ||
|
||
Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). | ||
|
||
Excessive disk space usage can lead fragmentation negatively impacting performance. Reaching 100% disk usage will result | ||
in downtime and data loss. | ||
|
||
Diagnosis | ||
--------- | ||
|
||
Mitigation | ||
---------- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
CNPGClusterOffline | ||
================== | ||
|
||
Meaning | ||
------- | ||
|
||
The `CNPGClusterOffline` alert is triggered when there are no ready CloudNativePG instances. | ||
|
||
Impact | ||
------ | ||
|
||
Having an offline cluster means your applications will not be able to access the database, leading to potential service | ||
disruption. | ||
|
||
Diagnosis | ||
--------- | ||
|
||
Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). | ||
|
||
Get the status of the CloudNativePG cluster instances: | ||
|
||
```bash | ||
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide | ||
``` | ||
|
||
Check the logs of the affected CloudNativePG instances: | ||
|
||
```bash | ||
kubectl logs --namespace <namespace> pod/<instance-pod-name> | ||
``` | ||
|
||
Check the CloudNativePG operator logs: | ||
|
||
```bash | ||
kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg" | ||
``` | ||
|
||
Mitigation | ||
---------- | ||
|
||
Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/) | ||
and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for | ||
more information on how to troubleshoot and mitigate this issue. |
37 changes: 37 additions & 0 deletions
37
charts/cluster/docs/runbooks/CNPGClusterZoneSpreadWarning.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
CNPGClusterZoneSpreadWarning | ||
============================ | ||
|
||
Meaning | ||
------- | ||
|
||
The `CNPGClusterZoneSpreadWarning` alert is raised when pods are not evenly distributed across availability zones. To be | ||
more accurate, the alert is raised when the number of `pods > zones < 3`. | ||
|
||
Impact | ||
------ | ||
|
||
The uneven distribution of pods across availability zones can lead to a single point of failure if a zone goes down. | ||
|
||
Diagnosis | ||
--------- | ||
|
||
Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/). | ||
|
||
Get the status of the CloudNativePG cluster instances: | ||
|
||
```bash | ||
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide | ||
``` | ||
|
||
Get the nodes and their respective zones: | ||
|
||
```bash | ||
kubectl get nodes --label-columns topology.kubernetes.io/zone | ||
``` | ||
|
||
Mitigation | ||
---------- | ||
|
||
1. Verify you have more than a single node with no taints, preventing pods to be scheduled in each availability zone. | ||
2. Verify your [affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/) configuration. | ||
3. Delete the pods and their respective PVC that are not in the desired availability zone and allow the operator to repair the cluster. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.