From 7357634c0e744d66d087a0cafdaf39c12e7f0f5f Mon Sep 17 00:00:00 2001 From: Gabriele Bartolini Date: Fri, 1 Mar 2024 10:00:10 +0100 Subject: [PATCH] docs: minimal changes Signed-off-by: Gabriele Bartolini --- .../cluster/docs/runbooks/CNPGClusterHACritical.md | 4 ++-- .../cluster/docs/runbooks/CNPGClusterHAWarning.md | 2 +- .../runbooks/CNPGClusterHighConnectionsCritical.md | 6 +++--- .../runbooks/CNPGClusterHighConnectionsWarning.md | 6 +++--- .../docs/runbooks/CNPGClusterHighReplicationLag.md | 4 ++-- .../runbooks/CNPGClusterInstancesOnSameNode.md | 1 + .../runbooks/CNPGClusterLowDiskSpaceCritical.md | 14 ++++++++++---- .../runbooks/CNPGClusterLowDiskSpaceWarning.md | 14 ++++++++++---- 8 files changed, 32 insertions(+), 19 deletions(-) diff --git a/charts/cluster/docs/runbooks/CNPGClusterHACritical.md b/charts/cluster/docs/runbooks/CNPGClusterHACritical.md index bb3025171..8be576c32 100644 --- a/charts/cluster/docs/runbooks/CNPGClusterHACritical.md +++ b/charts/cluster/docs/runbooks/CNPGClusterHACritical.md @@ -6,10 +6,10 @@ Meaning The `CNPGClusterHACritical` alert is triggered when the CloudNativePG cluster has no ready standby replicas. -This can happen during a normal fail-over or automated minor version upgrades in a cluster with 2 or less +This can happen during either a normal failover or automated minor version upgrades in a cluster with 2 or less instances. The replaced instance may need some time to catch-up with the cluster primary instance. -This alarm will be always trigger if your cluster is configured to run with only 1 instance. In this case you +This alarm will be always triggered if your cluster is configured to run with only 1 instance. In this case you may want to silence it. Impact diff --git a/charts/cluster/docs/runbooks/CNPGClusterHAWarning.md b/charts/cluster/docs/runbooks/CNPGClusterHAWarning.md index 9d19633cd..80acfad96 100644 --- a/charts/cluster/docs/runbooks/CNPGClusterHAWarning.md +++ b/charts/cluster/docs/runbooks/CNPGClusterHAWarning.md @@ -15,7 +15,7 @@ Impact Having less than two available replicas puts your cluster at risk if another instance fails. The cluster is still able to operate normally, although the `-ro` and `-r` endpoints operate at reduced capacity. -This can happen during a normal fail-over or automated minor version upgrades. The replaced instance may need some time +This can happen during a normal failover or automated minor version upgrades. The replaced instance may need some time to catch-up with the cluster primary instance which will trigger the alert if the operation takes more than 5 minutes. At `0` available ready replicas, a `CNPGClusterHACritical` alert will be triggered. diff --git a/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsCritical.md b/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsCritical.md index 7d62a5f52..2003421b9 100644 --- a/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsCritical.md +++ b/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsCritical.md @@ -4,12 +4,12 @@ CNPGClusterHighConnectionsCritical Meaning ------- -This alert is triggered when the number of connections to the CNPG cluster instance exceeds 95% of its capacity. +This alert is triggered when the number of connections to the CloudNativePG cluster instance exceeds 95% of its capacity. Impact ------ -At 100% capacity, the CNPG cluster instance will not be able to accept new connections. This will result in a service +At 100% capacity, the CloudNativePG cluster instance will not be able to accept new connections. This will result in a service disruption. Diagnosis @@ -20,5 +20,5 @@ Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards Mitigation ---------- -* Increase the maximum number of connections by increasing the `max_connections` Postgresql parameter. +* Increase the maximum number of connections by increasing the `max_connections` PostgreSQL parameter. * Use connection pooling by enabling PgBouncer to reduce the number of connections to the database. diff --git a/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsWarning.md b/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsWarning.md index b8f7397ea..636579f75 100644 --- a/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsWarning.md +++ b/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsWarning.md @@ -4,12 +4,12 @@ CNPGClusterHighConnectionsWarning Meaning ------- -This alert is triggered when the number of connections to the CNPG cluster instance exceeds 85% of its capacity. +This alert is triggered when the number of connections to the CloudNativePG cluster instance exceeds 85% of its capacity. Impact ------ -At 100% capacity, the CNPG cluster instance will not be able to accept new connections. This will result in a service +At 100% capacity, the CloudNativePG cluster instance will not be able to accept new connections. This will result in a service disruption. Diagnosis @@ -20,5 +20,5 @@ Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards Mitigation ---------- -* Increase the maximum number of connections by increasing the `max_connections` Postgresql parameter. +* Increase the maximum number of connections by increasing the `max_connections` PostgreSQL parameter. * Use connection pooling by enabling PgBouncer to reduce the number of connections to the database. diff --git a/charts/cluster/docs/runbooks/CNPGClusterHighReplicationLag.md b/charts/cluster/docs/runbooks/CNPGClusterHighReplicationLag.md index 86e1adde4..78963ce09 100644 --- a/charts/cluster/docs/runbooks/CNPGClusterHighReplicationLag.md +++ b/charts/cluster/docs/runbooks/CNPGClusterHighReplicationLag.md @@ -4,7 +4,7 @@ CNPGClusterHighReplicationLag Meaning ------- -This alert is triggered when the replication lag of the CNPG cluster exceed `1s`. +This alert is triggered when the replication lag of the CloudNativePG cluster exceed `1s`. Impact ------ @@ -21,7 +21,7 @@ High replication lag can be caused by a number of factors, including: * Network issues * High load on the primary or replicas * Long running queries -* Suboptimal Postgres configuration, in particular small numbers of `max_wal_senders`. +* Suboptimal PostgreSQL configuration, in particular small numbers of `max_wal_senders`. ```yaml kubectl exec --namespace --stdin --tty services/-rw -- psql -c "SELECT * from pg_stat_replication;" diff --git a/charts/cluster/docs/runbooks/CNPGClusterInstancesOnSameNode.md b/charts/cluster/docs/runbooks/CNPGClusterInstancesOnSameNode.md index 89f27cc13..df309ffa9 100644 --- a/charts/cluster/docs/runbooks/CNPGClusterInstancesOnSameNode.md +++ b/charts/cluster/docs/runbooks/CNPGClusterInstancesOnSameNode.md @@ -25,3 +25,4 @@ Mitigation 1. Verify you have more than a single node with no taints, preventing pods to be scheduled there. 2. Verify your [affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/) configuration. +3. For more information, please refer to the ["Scheduling"](https://cloudnative-pg.io/documentation/current/scheduling/) section in the documentation diff --git a/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceCritical.md b/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceCritical.md index 2d6daab8f..5b7355275 100644 --- a/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceCritical.md +++ b/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceCritical.md @@ -4,11 +4,11 @@ CNPGClusterLowDiskSpaceCritical Meaning ------- -This alert is triggered when the disk space on the CNPG cluster exceeds 90%. It can be triggered by either: +This alert is triggered when the disk space on the CloudNativePG cluster exceeds 90%. It can be triggered by either: -* Data PVC -* WAL PVC -* Tablespace PVC +* the PVC hosting the `PGDATA` (`storage` section) +* the PVC hosting WAL files (`walStorage` section), where applicable +* any PVC hosting a tablespace (`tablespaces` section) Impact ------ @@ -23,3 +23,9 @@ Diagnosis Mitigation ---------- + +If you experience issues with the WAL (Write-Ahead Logging) volume and have +set up continuous archiving, ensure that WAL archiving is functioning +correctly. This is crucial to avoid a buildup of WAL files in the `pg_wal` +folder. Monitor the `cnpg_collector_pg_wal_archive_status` metric, specifically +ensuring that the number of `ready` files does not increase linearly. diff --git a/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceWarning.md b/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceWarning.md index d91b1d74f..36e56acf1 100644 --- a/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceWarning.md +++ b/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceWarning.md @@ -4,11 +4,11 @@ CNPGClusterLowDiskSpaceWarning Meaning ------- -This alert is triggered when the disk space on the CNPG cluster exceeds 70%. It can be triggered by either: +This alert is triggered when the disk space on the CloudNativePG cluster exceeds 90%. It can be triggered by either: -* Data PVC -* WAL PVC -* Tablespace PVC +* the PVC hosting the `PGDATA` (`storage` section) +* the PVC hosting WAL files (`walStorage` section), where applicable +* any PVC hosting a tablespace (`tablespaces` section) Impact ------ @@ -23,3 +23,9 @@ Diagnosis Mitigation ---------- + +If you experience issues with the WAL (Write-Ahead Logging) volume and have +set up continuous archiving, ensure that WAL archiving is functioning +correctly. This is crucial to avoid a buildup of WAL files in the `pg_wal` +folder. Monitor the `cnpg_collector_pg_wal_archive_status` metric, specifically +ensuring that the number of `ready` files does not increase linearly.