8. Comprehensive monitoring with Prometheus and Grafana

Comprehensive Monitoring Setup with Grafana and Prometheus

Overview

This document provides a detailed guide for setting up a comprehensive monitoring solution using Grafana and Prometheus. Monitoring and Alerting is a core part of DevOps operations and having it in your project improves your confidence that things are working properly.

Objectives

Install and configure Prometheus and Grafana.
Create and configure Grafana dashboards.
Set up alerts based on collected metrics.
Ensure proper data retention and access control.

Installing Prometheus and Grafana
- Prometheus Installation
- Grafana Installation
Configuring the Monitoring Dashboards
- Setting up a Server Metrics dashboard
- Setting up an Application Metrics Dashboard
Configuring Alerting
- Configuring Alert Manager
- Setting up Alerting on Grafana Dashboard
Data Retention and Access Control
- Configuring Data Retention
- Configuring Access Control

Installation of Prometheus and Grafana

Prometheus Installation

You can manually install Prometheus, running each command step by step or using a script that runs it and its dependencies at once. The script below handles the installation of Prometheus and setting it up as a systemd service.

#!/bin/bash

PROMETHEUS_VERSION="2.45.6"
PROMETHEUS_USER="prometheus"
RETENTION_PERIOD="15d"

sudo apt-get update
sudo apt-get install -y wget tar

sudo useradd --no-create-home --shell /bin/false $PROMETHEUS_USER

wget https://github.com/prometheus/prometheus/releases/download/v$PROMETHEUS_VERSION/prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz
tar xvf prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz
sudo mv prometheus-$PROMETHEUS_VERSION.linux-amd64 /usr/local/prometheus
sudo ln -s /usr/local/prometheus/prometheus /usr/local/bin/
sudo ln -s /usr/local/prometheus/promtool /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo cp /usr/local/prometheus/prometheus.yml /etc/prometheus/
sudo chown -R $PROMETHEUS_USER:$PROMETHEUS_USER /etc/prometheus /var/lib/prometheus

sudo tee /etc/systemd/system/prometheus.service > /dev/null <<EOF
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=$PROMETHEUS_USER
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/ \
  --storage.tsdb.retention.time=$RETENTION_PERIOD

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus

Grafana Installation

Run the script below to install Grafana and register it to restart on reboot via systemd.

#!/bin/bash

sudo apt-get update
sudo apt-get install -y apt-transport-https software-properties-common wget

sudo wget -q -O /usr/share/keyrings/grafana.key https://apt.grafana.com/gpg.key
echo "deb [signed-by=/usr/share/keyrings/grafana.key] https://apt.grafana.com stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install -y grafana

sudo systemctl start grafana-server
sudo systemctl enable grafana-server

Setting up Node Exporter

Node Exporter is a critical component of a monitoring system if you hope to set up server metrics monitoring. It scrapes information such as CPU usage, memory usage, Disk I/O, and Network Traffic. You can include it on the server by running the script below:

NODE_EXPORTER_USER="node_exporter"
NODE_EXPORTER_VERSION="1.7.0"

sudo useradd --no-create-home --shell /bin/false $NODE_EXPORTER_USER

# Download and install Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v$NODE_EXPORTER_VERSION/node_exporter-$NODE_EXPORTER_VERSION.linux-amd64.tar.gz
tar xvf node_exporter-$NODE_EXPORTER_VERSION.linux-amd64.tar.gz
sudo mv node_exporter-$NODE_EXPORTER_VERSION.linux-amd64/node_exporter /usr/local/bin/
sudo chown $NODE_EXPORTER_USER:$NODE_EXPORTER_USER /usr/local/bin/node_exporter

# Node Exporter systemd service
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=$NODE_EXPORTER_USER
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=default.target
EOF

Configuring the Monitoring Dashboards

A server monitoring pipeline is never complete without visualization of the metrics and system conditions in a dashboard. Grafana allows one to create beautiful dashboards that display the rich information that Prometheus scrapes. The Grafana community has made some dashboards available freely for use. You simply need to import an available dashboard and further configure it.

This project uses the Node Exporter Full dashboard available via dashboard ID 1860. This dashboard was designed for full monitoring of Linux servers using data scraped by Node Exporter.

Step-by-step process on importing the Node Exporter Dashboard

Navigate to the dashboards page on the Grafana web portal, find the new button at the right side of the screen and click on it. It brings up an import dashboard dropdown. Click on "import" to proceed.

Click on import

Now on the import dashboard page, enter 1860 in the input field that has the placeholder "Grafana.com dashboard URL or ID". Complete the process by clicking the "Load" button.

importing a dashboard

The dashboard takes a few seconds to load. When it is fully loaded, it brings up a dashboard showing the most important system metrics.

Node Exporter Full Dashboard

Configuring Visualizations

Add Panels to Dashboards

Configure Grafana to Use Prometheus as a Data Source

Add Prometheus Data Source:
- Open Grafana in your browser (http://:3000/).
- Log in and navigate to Configuration > Data Sources > Add data source.
- Select Prometheus and set the URL to http://localhost:9090. Click Save & Test.

Open or Create a Dashboard

Log in to Grafana.
Click on the + icon on the left sidebar and select Dashboard.
Click Add new panel or open an existing dashboard to add a new panel.

Configure Visualizations

Query Configuration:
- In the new panel, go to the Query section.
- Select Prometheus as the data source.
- Enter the appropriate Prometheus query to fetch the metrics you want to visualize. For example: rate(node_cpu_seconds_total{job="node_exporter", mode="idle"}[5m])
- Use the query editor to refine and test your query.

Set Up Variables for Dynamic Dashboard Updates

Create Variables:
- Click on the Dashboard settings (gear icon) and navigate to Variables.
- Click Add variable.
- Define the variable settings, such as the name, type, and data source query. For example, to create a variable for project names: label_values(node_exporter_build_info, project)
Use Variables in Panels:
- In the panel query, replace fixed values with the variable. For example: rate(node_cpu_seconds_total{job="node_exporter", project="$project", mode="idle"}[5m])
- This makes the panels dynamically update based on the selected variable value.

Save Dashboards and Ensure Accessibility

Save the Dashboard:
- Click the Save dashboard (disk icon) at the top.
- Provide a name and description for the dashboard and click Save.
Ensure Accessibility:
- Share the dashboard with relevant team members.
- Set appropriate permissions for viewing or editing the dashboard.
- Go to Dashboard settings > Permissions to configure access control.

Alerting Configuration

Setting Up Alerts

Create and Configure Alerts in Grafana

Create an Alert:
- Open the panel where you want to add an alert.
- Click on the Edit button (the pencil icon).
- Go to the Alert tab.

Define Alert Conditions:

Click on Create Alert.
Define the alert conditions based on your Prometheus queries. For example:
- To alert on high request rate, you could use rate(myapp_request_count[1m]) > 100.
- To alert on high request duration, you could use histogram_quantile(0.95, sum(rate(myapp_request_duration_seconds_bucket[5m])) by (le)) > 1.

Set Alert Evaluation and Frequency:

Configure how often the alert rule should be evaluated (e.g., every 1 minute).
Set the alert condition to trigger when it has been met for a specified duration (e.g., 5 minutes).

Configure Alert Notifications:

Click on Add Notification Channel.
Select an existing notification channel or create a new one by going to Alerting > Notification channels > New Channel.
Configure the notification channel (SlacK) with the necessary details.
Save the notification channel.

Save the Alert:

Save the alert configuration and apply the changes to the panel.

Configure Notification Channels

Configuring an Incoming Webhook for Sending Messages

Incoming webhooks are a way to post messages from applications into Slack. Creating an incoming Webhook gives you a unique URL to which you send a JSON payload with the message text and some options.

Getting Started with Incoming Webhooks

The slack api will be leveraged to configure incoming webhooks. The following steps were taken in order to configure an incoming webhook that will be used in posting messages:

Click on the Create your Slack app button.
Click on Create New App, select From scratch to use the configuration UI to manually add basic info, scopes, settings and features to your app.
Give the App Name Kimiko-Telex-Team-Alerts and select HNG11 as the workspace.
Click on Incoming webhooks, activate the incoming webhooks by toggling the button and click on Add New Webhook to Workspace.
Select the #devops-alerts channel you alerts to be sent to and click on allow.
The Webhook URL is now created. Then proceed to the steps below;

Add Notification Channels:

Navigate to Alerting > Notification channels > New Channel.
Add the details for your notification channel. For example, to configure Slack:
- Name: Slack Notifications
- Type: Slack
- Webhook URL: https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX (Replace with your slack webhook URL)
- Mention the channel: #alerts
- Save the notification channel.

Link Notification Channels to Alerts:

In the alert configuration, link the alert to the notification channel you just created.

Alert-Manager Configuration (As an option to using grafana for alerting)

Save the snippet below in a file, make it executable (chmod +x "file.sh") and run the file.

#!/bin/bash

ALERTMANAGER_VERSION="0.27.0"
ALERTMANAGER_USER="alertmanager"

sudo useradd --no-create-home --shell /bin/false $ALERTMANAGER_USER

wget https://github.com/prometheus/alertmanager/releases/download/v$ALERTMANAGER_VERSION/alertmanager-$ALERTMANAGER_VERSION.linux-amd64.tar.gz
tar xvf alertmanager-$ALERTMANAGER_VERSION.linux-amd64.tar.gz
sudo mv alertmanager-$ALERTMANAGER_VERSION.linux-amd64 /usr/local/alertmanager
sudo ln -s /usr/local/alertmanager/alertmanager /usr/local/bin/
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo cp /usr/local/alertmanager/alertmanager.yml /etc/alertmanager/
sudo chown -R $ALERTMANAGER_USER:$ALERTMANAGER_USER /etc/alertmanager /var/lib/alertmanager

sudo tee /etc/alertmanager/alertmanager.yml > /dev/null <<EOF
global:
  resolve_timeout: 1m
  slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'

route:
  receiver: 'Kimiko-Slack-Alert'

receivers:
- name: 'Kimiko-Slack-Alert'
  slack_configs:
  - channel: '#devops-alerts'
    send_resolved: true
EOF

sudo tee /etc/systemd/system/alertmanager.service > /dev/null <<EOF
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=$ALERTMANAGER_USER
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager/

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager

This script automates the installation and configuration of Alertmanager, a component of the Prometheus monitoring system that handles alerts sent by client applications such as the Prometheus server. It integrates with various notification services to send alerts to your team.

#!/bin/bash

ALERTMANAGER_VERSION="0.27.0"
ALERTMANAGER_USER="alertmanager"

Defines the version of Alertmanager to install and the system user to be created for running the Alertmanager service

sudo useradd --no-create-home --shell /bin/false $ALERTMANAGER_USER

Creates a system user named alertmanager without a home directory and disables login shell for security.

wget https://github.com/prometheus/alertmanager/releases/download/v$ALERTMANAGER_VERSION/alertmanager-$ALERTMANAGER_VERSION.linux-amd64.tar.gz
tar xvf alertmanager-$ALERTMANAGER_VERSION.linux-amd64.tar.gz
sudo mv alertmanager-$ALERTMANAGER_VERSION.linux-amd64 /usr/local/alertmanager
sudo ln -s /usr/local/alertmanager/alertmanager /usr/local/bin/

Downloads the specified version of Alertmanager, extracts the files, moves them to /usr/local/alertmanager, and creates symbolic links to make the Alertmanager binary accessible system-wide.

sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo cp /usr/local/alertmanager/alertmanager.yml /etc/alertmanager/
sudo chown -R $ALERTMANAGER_USER:$ALERTMANAGER_USER /etc/alertmanager /var/lib/alertmanager

Creates configuration and data directories for Alertmanager, copies the default configuration file, and sets appropriate permissions to ensure the Alertmanager user owns these directories.

sudo tee /etc/alertmanager/alertmanager.yml > /dev/null <<EOF
global:
  resolve_timeout: 1m
  slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX

route:
  receiver: 'Kimiko-Slack-Alert'

receivers:
- name: 'Kimiko-Slack-Alert'
  slack_configs:
  - channel: '#devops-alerts'
    send_resolved: true
EOF

Configures Alertmanager to use a Slack webhook for sending alerts. It sets the global configuration, routes alerts to the Kimiko-Slack-Alert receiver, and defines the Slack notification channel #devops-alerts

sudo tee /etc/systemd/system/alertmanager.service > /dev/null <<EOF
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=$ALERTMANAGER_USER
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager/

[Install]
WantedBy=multi-user.target
EOF

Creates a systemd service file for Alertmanager to manage it as a system service. This ensures Alertmanager starts on boot and can be controlled using systemctl.

sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager

Reloads the systemd configuration to recognize the new service, enables Alertmanager to start on boot, and starts the Alertmanager service.

Testing Alerts

Test and Verify Alerts

Test alert rules to ensure that notifications are sent correctly when thresholds are breached.
Verify that alerts are actionable and provide relevant information for troubleshooting.

Application Performance metrics using PM2 Prometheus Exporter

Here, we will go through setting up the PM2 Prometheus Exporter to monitor our application metrics. The PM2 Prometheus Exporter is a tool that allows us to expose process metrics from PM2-managed applications in a format compatible with Prometheus. This setup will help us gain insights into out application's performance by monitoring CPU usage, memory usage, process uptime, and more.

Prerequisites

PM2: Installed and running to manage your application processes.
Node.js: Required for installing the PM2 Prometheus Exporter.
Prometheus: Set up to scrape metrics from the PM2 exporter.
Grafana (optional): For visualizing metrics collected by Prometheus.

Our golang application is being monitored by PM2, thus all the above enlisted have already been installed.

Step-by-Step Setup

1. Install PM2

Ensure that PM2 is installed globally on your system:

npm install -g pm2

2. Install PM2 Prometheus Exporter

Install the PM2 Prometheus Exporter globally using npm:

npm install -g pm2-prometheus-exporter

3. Start the PM2 Prometheus Exporter

Run the PM2 Prometheus Exporter to expose metrics on the default port (9209):

pm2-prometheus-exporter

It can also be started using PM2 to manage its lifecycle:

pm2 start pm2-prometheus-exporter --name pm2-exporter

4. Configure Prometheus to Scrape Metrics

Edit the Prometheus configuration file (prometheus.yml) to include the PM2 exporter as a scrape target:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'pm2'
    static_configs:
      - targets: ['localhost:9209']  # Default port for PM2 Prometheus Exporter

After updating the configuration, reload Prometheus to apply the changes:

# Restart Prometheus 
sudo systemctl restart prometheus

5. Verify Metrics

Open the browser and navigate to http://localhost:9209/metrics. You should see metrics in Prometheus format, including:

pm2_cpu_usage_percent: CPU usage percentage by PM2-managed processes.
pm2_memory_usage_rss_bytes: Memory usage (RSS) in bytes.
pm2_uptime_seconds: Process uptime in seconds.
pm2_restart_count_total: Total number of process restarts.

6. Visualize Metrics in Grafana

Import a PM2 Dashboard:
- Click the + button on the left sidebar and select Import.
- Enter the Grafana dashboard ID for PM2 (e.g., 8145) and click Load.
- Select your Prometheus data source and click Import.
Customize Your Dashboard:
- Adjust the dashboard panels to visualize the metrics relevant to your application.
- Monitor CPU, memory, and uptime metrics to ensure optimal performance.

7. Variables for dynamic dashboard updates

When monitoring application running in multiple environments, such as development, staging and production, Dynamic variables in grafana enables us to create a single dashboard that can display data from different environment. Once defined, these variables can be used in panel queries by replacing static values with variable placeholders, such as $environment. This allows the data displayed in each panel to update automatically based on the selected environment from the dropdown menu at the top of the dashboard.

Example Query for Uptime Using Environment Variable: For a metric like pm2_uptime_seconds, which tracks the uptime of the application managed by PM2, we can use a query with an environment variable to filter the data:

pm2_uptime_seconds{environment="$environment"}

Explanation:

pm2_uptime_seconds: This metric represents the uptime of our application in seconds. It's collected by the PM2 Prometheus Exporter.
{environment="$environment"}: This part of the query filters the metric by the environment variable. The $environment variable is a dynamic variable defined in our Grafana dashboard that allows us to switch between different environments like dev, staging, and prod.

Example Queries for Grafana Panels

CPU Usage:

  avg(pm2_cpu_usage_percent) by (app_name)

Memory Usage (RSS):

 avg(pm2_memory_usage_rss_bytes) by (app_name)

Uptime:

  pm2_uptime_seconds

Restart Count:

  pm2_restart_count_total

Best Practices

Automate Exporter Start: Use PM2 or a system service (like systemd) to ensure the PM2 Prometheus Exporter starts automatically on boot.

Configuring Data Retention in Prometheus

Setting Up Data Retention Policies in Prometheus

To manage how long historical metrics are stored in Prometheus and ensure data retention aligns with project requirements and compliance needs, we performed the following steps:

Define the Retention Period: We specified the retention period for storing metrics data using the RETENTION_PERIOD variable in our setup script. In this setup script, we set the retention period to 15 days (15d).

RETENTION_PERIOD="15d"

Update Prometheus Configuration: In the Prometheus systemd service file, we included the --storage.tsdb.retention.time flag to set the data retention policy. This flag tells Prometheus how long to keep the data before deleting it.

sudo tee /etc/systemd/system/prometheus.service > /dev/null <<EOF
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/ \
  --storage.tsdb.retention.time=$RETENTION_PERIOD

[Install]
WantedBy=multi-user.target
EOF

Access Control

Configure User Roles and Permissions in Grafana

To ensure proper access control in Grafana, the following steps were followed to configure user roles and permissions:

Log in to Grafana:
- Access your Grafana instance through the browser and log in with admin credentials.
Add Users:
- Go to Configuration (gear icon) > Users.
- Click Add User, enter user details, and either set a password or send an invitation email.
Assign User Roles: Global Roles:
- Navigate to Server Admin (shield icon) > Users.
- Assign global roles such as Admin, Editor, or Viewer.
Set Permissions for Dashboards:
- Open the dashboard you want to manage.
- Click Dashboard settings (gear icon) > Permissions.
- Add permission rules for specific users, teams, or roles, and set permissions (View, Edit, Admin).

By configuring user roles and permissions, you ensure that only authorised users can view or modify dashboards and alert configurations, enhancing the security and management of your Grafana instance