Grafana and Prometheus Monitoring Setup

Overview

This documentation provides a comprehensive guide on setting up Grafana and Prometheus for monitoring. It covers installation steps, configuration details, instructions for accessing and managing dashboards, and troubleshooting tips for maintaining the monitoring setup.

Objectives

Install and Configure Prometheus, Grafana, Cadvisor and Node Exporter:
Set up Prometheus for metric collection and Grafana for data visualization.
Create and Configure Grafana Dashboards:
Develop dashboards to visualize metrics.
Set Up Alerts Based on Collected Metrics:
Configure alerting to notify team of potential issues.
Ensure Proper Data Retention and Access Control:
Manage data storage and user access.
Setup troubleshooting tips.

#!/bin/bash

# Exit immediately if a command exits with a non-zero status
set -e

# Update and install dependencies
apt-get update
apt-get install -y wget curl tar adduser libfontconfig1

# Create users
sudo adduser --system --group --no-create-home prometheus
sudo adduser --system --group --no-create-home node_exporter
sudo adduser --system --group --home /var/lib/grafana grafana

# Install Prometheus
PROMETHEUS_VERSION="2.37.0"
wget https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
tar xvf prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
cd prometheus-${PROMETHEUS_VERSION}.linux-amd64/
sudo mv prometheus promtool /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo mv consoles/ console_libraries/ prometheus.yml /etc/prometheus/
cd ..
rm -rf prometheus-${PROMETHEUS_VERSION}.linux-amd64*

# Set Prometheus ownership
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus

# Install Grafana
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install -y grafana

# Install Node Exporter
NODE_EXPORTER_VERSION="1.3.1"
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
tar xvf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
sudo mv node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/
rm -rf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64*

# Install cAdvisor (latest version as of the last update)
CADVISOR_VERSION="v0.47.0"  # Update this to the latest version if needed
sudo apt-get install -y libseccomp2
wget https://github.com/google/cadvisor/releases/download/${CADVISOR_VERSION}/cadvisor-${CADVISOR_VERSION}-linux-amd64
sudo mv cadvisor-${CADVISOR_VERSION}-linux-amd64 /usr/local/bin/cadvisor
sudo chmod +x /usr/local/bin/cadvisor

# Create systemd service files

# Prometheus
cat << EOF | sudo tee /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
    --config.file /etc/prometheus/prometheus.yml \
    --storage.tsdb.path /var/lib/prometheus/ \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries

[Install]
WantedBy=multi-user.target
EOF

# Node Exporter
cat << EOF | sudo tee /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# cAdvisor
cat << EOF | sudo tee /etc/systemd/system/cadvisor.service
[Unit]
Description=cAdvisor
Wants=network-online.target
After=network-online.target

[Service]
User=root
Group=root
Type=simple
ExecStart=/usr/local/bin/cadvisor

[Install]
WantedBy=multi-user.target
EOF

# Reload systemd and start services
sudo systemctl daemon-reload
sudo systemctl enable prometheus node_exporter cadvisor grafana-server
sudo systemctl start prometheus node_exporter cadvisor grafana-server

echo "Installation complete. Please check the status of the services to ensure they are running correctly."

Configure Grafana to Use Prometheus as a Data Source:

Access Grafana via your web browser (http://localhost:3000/). Since we have an application running on port 3000, we configured Grafana to use 3050 instead.
Navigate to Configuration > Data Sources > Add data source.
Select Prometheus and set the URL to http://localhost:9090. Click Save & Test.

Import Node Exporter Dashboard:

In Grafana, go to the Dashboards page.
Click the “+” icon and select “Import”.
Enter Dashboard ID 1860 and click “Load”.
Choose the Prometheus data source and click “Import”.

Setting up an Application Metrics Dashboard

Using cAdvisor:

In Grafana, click the “+” icon and select “Dashboard”.
Click “import” and the add the dashboard json or ID.

image (1)

Configuring Dynamic Variables:

Navigate to the desired dashboard, click on the settings icon, select the variables tab, and create a new dynamic variable.
In this case, we are creating a dynamic variable to reflect the environments of our containerized applications, such as dev, staging, and prod.

Testing the Dynamic Variable:

From the image below, our dynamic variable lists the containers based on the environments.

Configuring Alerting

Configuring Grafana Alerts

Create and Configure Alerts in Grafana:

Open the home menu bar and.
Navigate to the Alert tab and click "Create Alert."
Define conditions based on Prometheus queries (e.g., rate(myapp_request_count[1m]) > 100).

Configure Alert Evaluation and Frequency:

Set the evaluation interval (e.g., every minute) and the duration for which the alert condition must be met (e.g., 5 minutes).

Set Up Notification Channels:

Navigate to Alerting > Contact Points > Click Add contact point button.
Configure the notification channel (e.g., Slack) with your webhook URL and other details.

Link Alerts to Notification Channels:

In the alert configuration, associate the alert with the notification channel created.

Managed Alerts and Notifications:

We configured Grafana managed alerts to notify the team via our Slack configuration webhook URL, which sends the alerts to the #devops-alerts channel.
Alerts were created for Low Disk, High CPU, Container Down, and High Memory.

Set Up Data Retention Policy

Set Data Retention in Prometheus:

By default, Prometheus has a data retention period of 15 days. We specified our data retention period to 30 days and set the maximum storage size to 5GB. This was achieved by adding the following flags: --storage.tsdb.retention.time=30d --storage.tsdb.retention.size=5GB. The updated service file looks like this:

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file /etc/prometheus/prometheus.yml \
  --storage.tsdb.path /var/lib/prometheus/ \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=5GB
Restart=always

[Install]
WantedBy=multi-user.target

30 07 2024_02 10 13_rec_720

30 07 2024_02 05 56_rec_720

Manage User Roles and Permissions in Grafana

User Roles and Permissions in Grafana:

Log in to Grafana and navigate to the home menu bar. Expand the Administration section to create a team or add users.
We created a team called FE DevOps and added our team members to the team.
Team leads were given admin access, while other members were assigned viewer access.

Troubleshooting Tips 🔍

Low Disk Troubleshooting Tips

When dealing with low disk space in Docker containers, consider the following steps:

Clean Up Unused Resources:
- Use docker system prune -a to remove unused images, containers, networks, and volumes.
- Schedule this as a regular task to keep your system clean.

High CPU Troubleshooting Tips

To troubleshoot high CPU usage, follow these steps:

Limit CPU Usage:
- You can limit the CPU usage of a Docker container using the --cpus flag. For example:
```
docker run --cpus="1.0" --name my_container <docker_image_name>
```
  Adjust the value (e.g., 1.0 for 100% of a single core) as needed.
Monitor with docker stats:
- Use docker stats to check CPU usage. The CPU % column shows the percentage of the host's CPU that the container is using.
- If you notice high usage, investigate further.
Troubleshoot Inside the Container:
- SSH into the container using docker exec -it YOUR-CONTAINER-ID /bin/bash.
- Run top to identify processes consuming CPU resources. This helps pinpoint the issue.

Container Down Troubleshooting Tips

Override Entrypoint with a Shell:
- Create and start a container from the same failing image.
- Override the entrypoint with a shell (e.g., sh or bash):
```
docker run -it --entrypoint sh <image_name>
```
- This drops you into a shell session within the container, allowing you to run your script and investigate why it's exiting unexpectedly.
Check Container Status and Logs:
- Use docker ps -a to find the most recent stopped container.
- Check its exit code. Depending on the code, you might find useful information. Check the logs too!

High Memory Troubleshooting Tips

To troubleshoot high memory usage, follow these steps:

Identify Memory-Hungry Containers:
- Use docker stats to find memory-intensive containers.
Consider Reducing the Number of Containers:
- Reduce the number of containers running on the same host to alleviate memory pressure.

Troubleshooting Tips for High CPU Usage in Dynamic Environments

To troubleshoot high CPU usage in dynamic environments, follow these steps:

Limit CPU Usage:
- Limit the CPU usage of a Docker container using the --cpus flag. For example:
```
docker run --cpus="1.0" --name my_container <docker_image_name>
```
  Adjust the value as needed.
Monitor with docker stats:
- Use docker stats to check CPU usage. Investigate further if you notice high usage.
Troubleshoot Inside the Container:
- SSH into the container using docker exec -it YOUR-CONTAINER-ID /bin/bash.
- Run top to identify processes consuming CPU resources.

Troubleshooting Tips for High Memory Usage in Dynamic Environments

To troubleshoot high memory usage, follow these steps:

Identify Memory-Hungry Containers:
- Use docker stats to find memory-intensive containers.
Consider Reducing the Number of Containers:
- Reduce the number of containers running on the same host to alleviate memory pressure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grafana and Prometheus Monitoring Setup

Grafana and Prometheus Monitoring Setup

Overview

Objectives

Table of Contents

Installing Prometheus, Grafana, Cadvisor and Node Exporter

Setting up an Application Metrics Dashboard

Configuring Alerting

Configuring Grafana Alerts

Set Up Data Retention Policy

Manage User Roles and Permissions in Grafana

Troubleshooting Tips 🔍

Low Disk Troubleshooting Tips

High CPU Troubleshooting Tips

Container Down Troubleshooting Tips

High Memory Troubleshooting Tips

Troubleshooting Tips for High CPU Usage in Dynamic Environments

Troubleshooting Tips for High Memory Usage in Dynamic Environments

Pages

Clone this wiki locally