Skip to content

for my Udacity SRE nanodegree project - setup a monitoring stack

Notifications You must be signed in to change notification settings

Mbaoma/monitoring-stack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

monitoring-stack

This repo will guide you on installation steps for Grafana, Prometheus, Prometheus Blackbox and node exporter via helm charts to monitor a Kubernetes Cluster.

Install Helm charts

Grafana Helm Chart Prometheus Helm Chart feel free to customize the values file

./setup.sh

Metrics to track

Cluster Specific (Host Metrics)

Panel Title: CPU Usage (%) monitor for high CPU usage. This metric is a counter that increments every second the CPU is being used. You can use the rate() function to calculate the CPU usage rate over a specified time interval. PromQL Query

rate(node_cpu_seconds_total{cpu="1", namespace="staging"}[5m])

This query returns the per-second rate of CPU usage averaged over the past 5 minutes. To set up an alert for high CPU usage, you can specify a threshold, such as 80% usage (0.8 in the query)

Panel Title: Memory Used Trigger an alert when available memory falls below a critical threshold. This is a sign of high memory usage and can be critical if not addressed, as it might lead to memory exhaustion and swapping, affecting system performance. PromQL Query

node_memory_MemAvailable_bytes
image

Alert Trigger an alert when available memory falls below a percentage of total memory, indicating high memory usage. (If the available memory is less than 10% of the total memory, it indicates that most of the memory is being used by applications and processes).

node_memory_MemAvailable_bytes{instance=~"your-instance-regex"} / node_memory_MemTotal_bytes{instance=~"your-instance-regex"} < 0.10
image

Panel Title: Network A good approach is to set up alerts based on either a sudden drop in network traffic (which might indicate a network issue or a service outage) or a sustained high network traffic (which might indicate potential network saturation or DDoS attacks). PromQL Query To detect if the network receive rate is too low, you can use the rate() function:

rate(node_network_receive_bytes_total{node=~"ip-172-31-95-55\\.ec2\\.internal|ip-172-31-36-214\\.ec2\\.internal"}[5m]) < 1000

To detect if the network receive rate is too high, you can use the rate() function:

rate(node_network_receive_bytes_total{node=~"ip-172-31-95-55\\.ec2\\.internal|ip-172-31-36-214\\.ec2\\.internal"}[5m]) < 10000
image

Panel Title: Disk Usage indicates the number of I/O operations currently in progress. set an alert if the number of I/O operations exceeds a certain threshold for a sustained period. PromQL Query

node_disk_io_now > 100

Application Specific (Synthetic monitoring)

This is where the Blackbox exporter comes in You can import dashboard 7587 or browse out dashboards from grafana dashboard library image

image image

Alerting (send alerts to a Slack channel)

Add webhook image

About

for my Udacity SRE nanodegree project - setup a monitoring stack

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages