Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various updates and quality of life changes #405

Open
wants to merge 28 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
8d3ae04
smartctl_exporter publishes both drive_trip and current drive tempera…
guruevi Feb 24, 2024
6fe429e
Add an option to run GitHub Action manually
guruevi Feb 24, 2024
fbca1a1
Add an option to force running the action for testing purposes
guruevi Feb 24, 2024
e3bc917
Set variables correctly
guruevi Feb 24, 2024
79960ae
Set variables correctly
guruevi Feb 24, 2024
59dc6dc
Publish
samber Feb 24, 2024
d6ef8e7
Clean up some more metrics
guruevi Feb 25, 2024
b660faf
Publish
samber Feb 25, 2024
87ee129
Minor bug fixes
guruevi Feb 25, 2024
46b9ccf
Merge remote-tracking branch 'guruevi/master'
guruevi Feb 25, 2024
45a711f
Publish
samber Feb 25, 2024
4604336
Removed queries that throw errors when systems are upgraded. Also fix…
guruevi Feb 25, 2024
c026db7
Publish
samber Feb 25, 2024
224e6d0
Refined some more queries
guruevi Mar 6, 2024
7e0d009
Publish
samber Mar 6, 2024
bfd04e6
Merge branch 'samber:master' into master
guruevi Mar 13, 2024
a68beee
PostgreSQL now has optimized autovacuum behavior
guruevi Mar 13, 2024
351e45c
Merge remote-tracking branch 'guruevi/master'
guruevi Mar 13, 2024
8789b86
Publish
samber Mar 13, 2024
c823aca
PostgreSQL now has optimized autovacuum behavior
guruevi Apr 11, 2024
76a86c3
Publish
samber Apr 11, 2024
0c2876e
Merge branch 'samber:master' into master
guruevi Apr 11, 2024
51d0484
Merge remote-tracking branch 'samber/master'
guruevi Jul 2, 2024
6e48cba
Publish
samber Jul 2, 2024
54e2b09
Query fails if instance names are not unique across jobs. This fixes it.
guruevi Jul 2, 2024
84a9260
Merge remote-tracking branch 'origin/master'
guruevi Jul 2, 2024
9766507
Publish
samber Jul 2, 2024
860055d
Merge branch 'master' into master
guruevi Oct 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .github/workflows/dist.yml
Original file line number Diff line number Diff line change
@@ -1,15 +1,14 @@
name: Publish

on:
workflow_dispatch:
push:
branches:
- master

jobs:
publish:
name: Publish
# Check if the PR is not from a fork
guruevi marked this conversation as resolved.
Show resolved Hide resolved
if: github.repository_owner == 'samber'
runs-on: ubuntu-latest
steps:
- name: Checkout Repo
Expand Down
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,8 @@ Or with Docker:
docker run --rm -it -p 4000:4000 -v $(pwd):/srv/jekyll jekyll/jekyll jekyll serve
```

Or with Docker-Compose:
Or with Docker Compose:

```
docker-compose up -d
docker compose up -d
```
221 changes: 117 additions & 104 deletions _data/rules.yml

Large diffs are not rendered by default.

144 changes: 63 additions & 81 deletions dist/rules/host-and-hardware/node-exporter.yml

Large diffs are not rendered by default.

16 changes: 8 additions & 8 deletions dist/rules/postgresql/postgres-exporter.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ groups:
description: "Postgresql exporter is showing errors. A query may be buggy in query.yaml\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: PostgresqlTableNotAutoVacuumed
expr: '(pg_stat_user_tables_last_autovacuum > 0) and (time() - pg_stat_user_tables_last_autovacuum) > 60 * 60 * 24 * 10'
expr: '((pg_stat_user_tables_n_tup_del + pg_stat_user_tables_n_tup_upd + pg_stat_user_tables_n_tup_hot_upd) > pg_settings_autovacuum_vacuum_threshold) and (time() - pg_stat_user_tables_last_autovacuum) > 864000'
for: 0m
labels:
severity: warning
Expand All @@ -41,7 +41,7 @@ groups:
description: "Table {{ $labels.relname }} has not been auto vacuumed for 10 days\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: PostgresqlTableNotAutoAnalyzed
expr: '(pg_stat_user_tables_last_autoanalyze > 0) and (time() - pg_stat_user_tables_last_autoanalyze) > 24 * 60 * 60 * 10'
expr: '((pg_stat_user_tables_n_tup_del + pg_stat_user_tables_n_tup_upd + pg_stat_user_tables_n_tup_hot_upd) > pg_settings_autovacuum_analyze_threshold) and (time() - pg_stat_user_tables_last_autoanalyze) > 864000'
for: 0m
labels:
severity: warning
Expand All @@ -53,7 +53,7 @@ groups:
expr: 'sum by (instance, job, server) (pg_stat_activity_count) > min by (instance, job, server) (pg_settings_max_connections * 0.8)'
for: 2m
labels:
severity: warning
severity: critical
annotations:
summary: Postgresql too many connections (instance {{ $labels.instance }})
description: "PostgreSQL instance has too many connections (> 80%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
Expand All @@ -62,7 +62,7 @@ groups:
expr: 'sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) < 5'
for: 2m
labels:
severity: warning
severity: critical
annotations:
summary: Postgresql not enough connections (instance {{ $labels.instance }})
description: "PostgreSQL instance should have more connections (> 5)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
Expand All @@ -86,8 +86,8 @@ groups:
description: "Ratio of transactions being aborted compared to committed is > 2 %\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: PostgresqlCommitRateLow
expr: 'rate(pg_stat_database_xact_commit[1m]) < 10'
for: 2m
expr: 'increase(pg_stat_database_xact_commit{datname!~"template.*|postgres",datid!="0"}[5m]) < 5'
for: 5m
labels:
severity: critical
annotations:
Expand Down Expand Up @@ -140,7 +140,7 @@ groups:
description: "PostgreSQL dead tuples is too large\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: PostgresqlConfigurationChanged
expr: '{__name__=~"pg_settings_.*"} != ON(__name__) {__name__=~"pg_settings_([^t]|t[^r]|tr[^a]|tra[^n]|tran[^s]|trans[^a]|transa[^c]|transac[^t]|transact[^i]|transacti[^o]|transactio[^n]|transaction[^_]|transaction_[^r]|transaction_r[^e]|transaction_re[^a]|transaction_rea[^d]|transaction_read[^_]|transaction_read_[^o]|transaction_read_o[^n]|transaction_read_on[^l]|transaction_read_onl[^y]).*"} OFFSET 5m'
expr: 'changes(label_replace({__name__=~"pg_settings_.*"},"name","$1","__name__", "(.+)")[1h:]) > 0'
for: 0m
labels:
severity: info
Expand All @@ -155,7 +155,7 @@ groups:
severity: critical
annotations:
summary: Postgresql SSL compression active (instance {{ $labels.instance }})
description: "Database connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
description: "Database allows connections with SSL compression enabled.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: PostgresqlTooManyLocksAcquired
expr: '((sum (pg_locks_count)) / (pg_settings_max_locks_per_transaction * pg_settings_max_connections)) > 0.20'
Expand Down
4 changes: 2 additions & 2 deletions dist/rules/prometheus-self-monitoring/embedded-exporter.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ groups:
description: "A Prometheus job does not have living target anymore.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: PrometheusTargetMissingWithWarmupTime
expr: 'sum by (instance, job) ((up == 0) * on (instance) group_right(job) (node_time_seconds - node_boot_time_seconds > 600))'
expr: 'sum by (instance, job) ((up == 0) * on (instance) group_left (__name__) (node_time_seconds - node_boot_time_seconds > 600))'
for: 0m
labels:
severity: critical
Expand Down Expand Up @@ -248,7 +248,7 @@ groups:
description: "Prometheus encountered {{ $value }} TSDB WAL truncation failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: PrometheusTimeseriesCardinality
expr: 'label_replace(count by(__name__) ({__name__=~".+"}), "name", "$1", "__name__", "(.+)") > 10000'
expr: '(label_replace(count by (__name__) ({__name__=~".+"}), "name", "$1", "__name__", "(.+)") unless on (__name__) ({__name__=~"node_cpu.*|node_systemd_unit_state"})) > 10000'
for: 0m
labels:
severity: warning
Expand Down
65 changes: 46 additions & 19 deletions dist/rules/s.m.a.r.t-device-monitoring/smartctl-exporter.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,46 +5,73 @@ groups:
rules:

- alert: SmartDeviceTemperatureWarning
expr: 'smartctl_device_temperature > 60'
for: 2m
expr: '(avg_over_time(smartctl_device_temperature{temperature_type="current"} [5m]) unless on (instance, device) smartctl_device_temperature{temperature_type="drive_trip"}) > 60'
for: 0m
labels:
severity: warning
annotations:
summary: Smart device temperature warning (instance {{ $labels.instance }})
description: "Device temperature warning (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: SMART device temperature warning (instance {{ $labels.instance }})
description: "Device temperature warning on {{ $labels.instance }} drive {{ $labels.device }} over 60°C\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: SmartDeviceTemperatureCritical
expr: 'smartctl_device_temperature > 80'
for: 2m
expr: '(max_over_time(smartctl_device_temperature{temperature_type="current"} [5m]) unless on (instance, device) smartctl_device_temperature{temperature_type="drive_trip"}) > 70'
for: 0m
labels:
severity: critical
annotations:
summary: Smart device temperature critical (instance {{ $labels.instance }})
description: "Device temperature critical (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: SMART device temperature critical (instance {{ $labels.instance }})
description: "Device temperature critical on {{ $labels.instance }} drive {{ $labels.device }} over 70°C\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: SmartDeviceTemperatureOverTripValue
expr: 'max_over_time(smartctl_device_temperature{temperature_type="current"} [10m]) >= on(device, instance) smartctl_device_temperature{temperature_type="drive_trip"}'
for: 0m
labels:
severity: critical
annotations:
summary: SMART device temperature over trip value (instance {{ $labels.instance }})
description: "Device temperature over trip value on {{ $labels.instance }} drive {{ $labels.device }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: SmartDeviceTemperatureNearingTripValue
expr: 'max_over_time(smartctl_device_temperature{temperature_type="current"} [10m]) >= on(device, instance) (smartctl_device_temperature{temperature_type="drive_trip"} * .80)'
for: 0m
labels:
severity: warning
annotations:
summary: SMART device temperature nearing trip value (instance {{ $labels.instance }})
description: "Device temperature at 80% of trip value on {{ $labels.instance }} drive {{ $labels.device }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: SmartStatus
expr: 'smartctl_device_smart_status != 1'
for: 0m
labels:
severity: critical
annotations:
summary: SMART status (instance {{ $labels.instance }})
description: "Device has a SMART status failure on {{ $labels.instance }} drive {{ $labels.device }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: SmartCriticalWarning
expr: 'smartctl_device_critical_warning > 0'
for: 15m
for: 0m
labels:
severity: critical
annotations:
summary: Smart critical warning (instance {{ $labels.instance }})
description: "device has critical warning (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: SMART critical warning (instance {{ $labels.instance }})
description: "Disk controller has critical warning on {{ $labels.instance }} drive {{ $labels.device }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: SmartMediaErrors
expr: 'smartctl_device_media_errors > 0'
for: 15m
for: 0m
labels:
severity: critical
annotations:
summary: Smart media errors (instance {{ $labels.instance }})
description: "device has media errors (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: SMART media errors (instance {{ $labels.instance }})
description: "Disk controller detected media errors on {{ $labels.instance }} drive {{ $labels.device }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: SmartNvmeWearoutIndicator
expr: 'smartctl_device_available_spare{device=~"nvme.*"} < smartctl_device_available_spare_threshold{device=~"nvme.*"}'
for: 15m
- alert: SmartWearoutIndicator
expr: 'smartctl_device_available_spare < smartctl_device_available_spare_threshold'
for: 0m
labels:
severity: critical
annotations:
summary: Smart NVME Wearout Indicator (instance {{ $labels.instance }})
description: "NVMe device is wearing out (instance {{ $labels.instance }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: SMART Wearout Indicator (instance {{ $labels.instance }})
description: "Device is wearing out on {{ $labels.instance }} drive {{ $labels.device }})\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"