This repo will guide you on installation steps for Grafana, Prometheus, Prometheus Blackbox and node exporter via helm charts to monitor a Kubernetes Cluster.
Grafana Helm Chart Prometheus Helm Chart feel free to customize the values file
./setup.sh
Panel Title: CPU Usage (%) monitor for high CPU usage. This metric is a counter that increments every second the CPU is being used. You can use the rate() function to calculate the CPU usage rate over a specified time interval. PromQL Query
rate(node_cpu_seconds_total{cpu="1", namespace="staging"}[5m])
This query returns the per-second rate of CPU usage averaged over the past 5 minutes. To set up an alert for high CPU usage, you can specify a threshold, such as 80% usage (0.8 in the query)
Panel Title: Memory Used Trigger an alert when available memory falls below a critical threshold. This is a sign of high memory usage and can be critical if not addressed, as it might lead to memory exhaustion and swapping, affecting system performance. PromQL Query
node_memory_MemAvailable_bytes
Alert Trigger an alert when available memory falls below a percentage of total memory, indicating high memory usage. (If the available memory is less than 10% of the total memory, it indicates that most of the memory is being used by applications and processes).
node_memory_MemAvailable_bytes{instance=~"your-instance-regex"} / node_memory_MemTotal_bytes{instance=~"your-instance-regex"} < 0.10
Panel Title: Network A good approach is to set up alerts based on either a sudden drop in network traffic (which might indicate a network issue or a service outage) or a sustained high network traffic (which might indicate potential network saturation or DDoS attacks). PromQL Query To detect if the network receive rate is too low, you can use the rate() function:
rate(node_network_receive_bytes_total{node=~"ip-172-31-95-55\\.ec2\\.internal|ip-172-31-36-214\\.ec2\\.internal"}[5m]) < 1000
To detect if the network receive rate is too high, you can use the rate() function:
rate(node_network_receive_bytes_total{node=~"ip-172-31-95-55\\.ec2\\.internal|ip-172-31-36-214\\.ec2\\.internal"}[5m]) < 10000
Panel Title: Disk Usage indicates the number of I/O operations currently in progress. set an alert if the number of I/O operations exceeds a certain threshold for a sustained period. PromQL Query
node_disk_io_now > 100
This is where the Blackbox exporter comes in
You can import dashboard 7587
or browse out dashboards from grafana dashboard library