Skip to content

Commit

Permalink
fix(cluster): Prometheus Rule CNPGClusterOffline false positives (#291)
Browse files Browse the repository at this point in the history
* Bug Fix: Direct matching instead of pattern matching for namespace in PrometheusRules
* Bug Fix: CNPGClusterOffline false positives

Signed-off-by: Itay Grudev <[email protected]>
  • Loading branch information
itay-grudev authored May 25, 2024
1 parent 917b767 commit 7da043f
Show file tree
Hide file tree
Showing 6 changed files with 6 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ annotations:
the maximum number of connections.
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsCritical.md
expr: |
sum by (pod) (cnpg_backends_total{namespace=~"{{ .namespace }}", pod=~"{{ .podSelector }}"}) / max by (pod) (cnpg_pg_settings_setting{name="max_connections", namespace=~"{{ .namespace }}", pod=~"{{ .podSelector }}"}) * 100 > 95
sum by (pod) (cnpg_backends_total{namespace="{{ .namespace }}", pod=~"{{ .podSelector }}"}) / max by (pod) (cnpg_pg_settings_setting{name="max_connections", namespace="{{ .namespace }}", pod=~"{{ .podSelector }}"}) * 100 > 95
for: 5m
labels:
severity: critical
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ annotations:
the maximum number of connections.
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsWarning.md
expr: |
sum by (pod) (cnpg_backends_total{namespace=~"{{ .namespace }}", pod=~"{{ .podSelector }}"}) / max by (pod) (cnpg_pg_settings_setting{name="max_connections", namespace=~"{{ .namespace }}", pod=~"{{ .podSelector }}"}) * 100 > 80
sum by (pod) (cnpg_backends_total{namespace="{{ .namespace }}", pod=~"{{ .podSelector }}"}) / max by (pod) (cnpg_pg_settings_setting{name="max_connections", namespace="{{ .namespace }}", pod=~"{{ .podSelector }}"}) * 100 > 80
for: 5m
labels:
severity: warning
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ annotations:
High replication lag indicates network issues, busy instances, slow queries or suboptimal configuration.
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterHighReplicationLag.md
expr: |
max(cnpg_pg_replication_lag{namespace=~"{{ .namespace }}",pod=~"{{ .podSelector }}"}) * 1000 > 1000
max(cnpg_pg_replication_lag{namespace="{{ .namespace }}",pod=~"{{ .podSelector }}"}) * 1000 > 1000
for: 5m
labels:
severity: warning
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ annotations:
A failure or scheduled downtime of a single node will lead to a potential service disruption and/or data loss.
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterInstancesOnSameNode.md
expr: |
count by (node) (kube_pod_info{namespace=~"{{ .namespace }}", pod=~"{{ .podSelector }}"}) > 1
count by (node) (kube_pod_info{namespace="{{ .namespace }}", pod=~"{{ .podSelector }}"}) > 1
for: 5m
labels:
severity: warning
Expand Down
2 changes: 1 addition & 1 deletion charts/cluster/prometheus_rules/cluster-offline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ annotations:
potential service disruption and/or data loss.
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterOffline.md
expr: |
({{ .Values.cluster.instances }} - count(cnpg_collector_up{namespace=~"{{ .namespace }}",pod=~"{{ .podSelector }}"}) OR vector(0)) > 0
(count(cnpg_collector_up{namespace="{{ .namespace }}",pod=~"{{ .podSelector }}"}) OR on() vector(0)) == 0
for: 5m
labels:
severity: critical
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ annotations:
A disaster in one availability zone will lead to a potential service disruption and/or data loss.
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterZoneSpreadWarning.md
expr: |
{{ .Values.cluster.instances }} > count(count by (label_topology_kubernetes_io_zone) (kube_pod_info{namespace=~"{{ .namespace }}", pod=~"{{ .podSelector }}"} * on(node,instance) group_left(label_topology_kubernetes_io_zone) kube_node_labels)) < 3
{{ .Values.cluster.instances }} > count(count by (label_topology_kubernetes_io_zone) (kube_pod_info{namespace="{{ .namespace }}", pod=~"{{ .podSelector }}"} * on(node,instance) group_left(label_topology_kubernetes_io_zone) kube_node_labels)) < 3
for: 5m
labels:
severity: warning
Expand Down

0 comments on commit 7da043f

Please sign in to comment.