-
Notifications
You must be signed in to change notification settings - Fork 50
8. Comprehensive monitoring with Prometheus and Grafana
This document provides a detailed guide for setting up a comprehensive monitoring solution using Grafana and Prometheus. Monitoring and Alerting is a core part of DevOps operations and having it in your project improves your confidence that things are working properly.
- Install and configure Prometheus and Grafana.
- Create and configure Grafana dashboards.
- Set up alerts based on collected metrics.
- Ensure proper data retention and access control.
-
Installing Prometheus and Grafana
- Prometheus Installation
- Grafana Installation
-
Configuring the Monitoring Dashboards
- Setting up a Server Metrics dashboard
- Setting up an Application Metrics Dashboard
-
Configuring Alerting
- Configuring Alert Manager
- Setting up Alerting on Grafana Dashboard
-
Data Retention and Access Control
- Configuring Data Retention
- Configuring Access Control
You can manually install Prometheus, running each command step by step or using a script that runs it and its dependencies at once. The script below handles the installation of Prometheus and setting it up as a systemd service.
#!/bin/bash
PROMETHEUS_VERSION="2.45.6"
PROMETHEUS_USER="prometheus"
RETENTION_PERIOD="15d"
sudo apt-get update
sudo apt-get install -y wget tar
sudo useradd --no-create-home --shell /bin/false $PROMETHEUS_USER
wget https://github.com/prometheus/prometheus/releases/download/v$PROMETHEUS_VERSION/prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz
tar xvf prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz
sudo mv prometheus-$PROMETHEUS_VERSION.linux-amd64 /usr/local/prometheus
sudo ln -s /usr/local/prometheus/prometheus /usr/local/bin/
sudo ln -s /usr/local/prometheus/promtool /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo cp /usr/local/prometheus/prometheus.yml /etc/prometheus/
sudo chown -R $PROMETHEUS_USER:$PROMETHEUS_USER /etc/prometheus /var/lib/prometheus
sudo tee /etc/systemd/system/prometheus.service > /dev/null <<EOF
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=$PROMETHEUS_USER
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/ \
--storage.tsdb.retention.time=$RETENTION_PERIOD
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
Run the script below to install Grafana and register it to restart on reboot via systemd.
#!/bin/bash
sudo apt-get update
sudo apt-get install -y apt-transport-https software-properties-common wget
sudo wget -q -O /usr/share/keyrings/grafana.key https://apt.grafana.com/gpg.key
echo "deb [signed-by=/usr/share/keyrings/grafana.key] https://apt.grafana.com stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install -y grafana
sudo systemctl start grafana-server
sudo systemctl enable grafana-server
Node Exporter is a critical component of a monitoring system if you hope to set up server metrics monitoring. It scrapes information such as CPU usage, memory usage, Disk I/O, and Network Traffic. You can include it on the server by running the script below:
NODE_EXPORTER_USER="node_exporter"
NODE_EXPORTER_VERSION="1.7.0"
sudo useradd --no-create-home --shell /bin/false $NODE_EXPORTER_USER
# Download and install Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v$NODE_EXPORTER_VERSION/node_exporter-$NODE_EXPORTER_VERSION.linux-amd64.tar.gz
tar xvf node_exporter-$NODE_EXPORTER_VERSION.linux-amd64.tar.gz
sudo mv node_exporter-$NODE_EXPORTER_VERSION.linux-amd64/node_exporter /usr/local/bin/
sudo chown $NODE_EXPORTER_USER:$NODE_EXPORTER_USER /usr/local/bin/node_exporter
# Node Exporter systemd service
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=$NODE_EXPORTER_USER
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=default.target
EOF
A server monitoring pipeline is never complete without visualization of the metrics and system conditions in a dashboard. Grafana allows one to create beautiful dashboards that display the rich information that Prometheus scrapes. The Grafana community has made some dashboards available freely for use. You simply need to import an available dashboard and further configure it.
This project uses the Node Exporter Full
dashboard available via dashboard ID 1860. This dashboard was designed for full monitoring of Linux servers using data scraped by Node Exporter.
Navigate to the dashboards page on the Grafana web portal, find the new button at the right side of the screen and click on it. It brings up an import dashboard dropdown. Click on "import" to proceed.
Now on the import dashboard page, enter 1860 in the input field that has the placeholder "Grafana.com dashboard URL or ID". Complete the process by clicking the "Load" button.
The dashboard takes a few seconds to load. When it is fully loaded, it brings up a dashboard showing the most important system metrics.
- Configure Grafana to Use Prometheus as a Data Source
- Add Prometheus Data Source:
- Open Grafana in your browser (http://:3000/).
- Log in and navigate to Configuration > Data Sources > Add data source.
- Select Prometheus and set the URL to http://localhost:9090. Click Save & Test.
- Open or Create a Dashboard
- Log in to Grafana.
- Click on the + icon on the left sidebar and select Dashboard.
- Click Add new panel or open an existing dashboard to add a new panel.
- Configure Visualizations
- Query Configuration:
- In the new panel, go to the Query section.
- Select Prometheus as the data source.
- Enter the appropriate Prometheus query to fetch the metrics you want to visualize.
For example:
rate(node_cpu_seconds_total{job="node_exporter", mode="idle"}[5m])
- Use the query editor to refine and test your query.
- Set Up Variables for Dynamic Dashboard Updates
- Create Variables:
- Click on the Dashboard settings (gear icon) and navigate to Variables.
- Click Add variable.
- Define the variable settings, such as the name, type, and data source query. For example, to create a variable for project names:
label_values(node_exporter_build_info, project)
- Use Variables in Panels:
- In the panel query, replace fixed values with the variable. For example:
rate(node_cpu_seconds_total{job="node_exporter", project="$project", mode="idle"}[5m])
- This makes the panels dynamically update based on the selected variable value.
- In the panel query, replace fixed values with the variable. For example:
- Save Dashboards and Ensure Accessibility
- Save the Dashboard:
- Click the Save dashboard (disk icon) at the top.
- Provide a name and description for the dashboard and click Save.
- Ensure Accessibility:
- Share the dashboard with relevant team members.
- Set appropriate permissions for viewing or editing the dashboard.
- Go to Dashboard settings > Permissions to configure access control.
- Create and Configure Alerts in Grafana
- Create an Alert:
- Open the panel where you want to add an alert.
- Click on the Edit button (the pencil icon).
- Go to the Alert tab.
- Define Alert Conditions:
- Click on Create Alert.
- Define the alert conditions based on your Prometheus queries. For example:
- To alert on high request rate, you could use
rate(myapp_request_count[1m]) > 100
. - To alert on high request duration, you could use
histogram_quantile(0.95, sum(rate(myapp_request_duration_seconds_bucket[5m])) by (le)) > 1
.
- To alert on high request rate, you could use
- Set Alert Evaluation and Frequency:
- Configure how often the alert rule should be evaluated (e.g., every 1 minute).
- Set the alert condition to trigger when it has been met for a specified duration (e.g., 5 minutes).
- Configure Alert Notifications:
- Click on Add Notification Channel.
- Select an existing notification channel or create a new one by going to Alerting > Notification channels > New Channel.
- Configure the notification channel (SlacK) with the necessary details.
- Save the notification channel.
- Save the Alert:
- Save the alert configuration and apply the changes to the panel.
Incoming webhooks are a way to post messages from applications into Slack. Creating an incoming Webhook gives you a unique URL to which you send a JSON payload with the message text and some options.
The slack api will be leveraged to configure incoming webhooks. The following steps were taken in order to configure an incoming webhook that will be used in posting messages:
-
Click on the
Create your Slack app
button. -
Click on
Create New App
, selectFrom scratch
to use the configuration UI to manually add basic info, scopes, settings and features to your app. -
Give the App Name
Kimiko-Telex-Team-Alerts
and selectHNG11
as the workspace. -
Click on
Incoming webhooks
, activate the incoming webhooks by toggling the button and click onAdd New Webhook to Workspace
. -
Select the
#devops-alerts
channel you alerts to be sent to and click on allow. -
The Webhook URL is now created. Then proceed to the steps below;
- Add Notification Channels:
- Navigate to Alerting > Notification channels > New Channel.
- Add the details for your notification channel. For example, to configure Slack:
- Name: Slack Notifications
- Type: Slack
- Webhook URL: https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX (Replace with your slack webhook URL)
- Mention the channel: #alerts
- Save the notification channel.
- Link Notification Channels to Alerts:
- In the alert configuration, link the alert to the notification channel you just created.
Save the snippet below in a file, make it executable (chmod +x "file.sh"
) and run the file.
#!/bin/bash
ALERTMANAGER_VERSION="0.27.0"
ALERTMANAGER_USER="alertmanager"
sudo useradd --no-create-home --shell /bin/false $ALERTMANAGER_USER
wget https://github.com/prometheus/alertmanager/releases/download/v$ALERTMANAGER_VERSION/alertmanager-$ALERTMANAGER_VERSION.linux-amd64.tar.gz
tar xvf alertmanager-$ALERTMANAGER_VERSION.linux-amd64.tar.gz
sudo mv alertmanager-$ALERTMANAGER_VERSION.linux-amd64 /usr/local/alertmanager
sudo ln -s /usr/local/alertmanager/alertmanager /usr/local/bin/
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo cp /usr/local/alertmanager/alertmanager.yml /etc/alertmanager/
sudo chown -R $ALERTMANAGER_USER:$ALERTMANAGER_USER /etc/alertmanager /var/lib/alertmanager
sudo tee /etc/alertmanager/alertmanager.yml > /dev/null <<EOF
global:
resolve_timeout: 1m
slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
route:
receiver: 'Kimiko-Slack-Alert'
receivers:
- name: 'Kimiko-Slack-Alert'
slack_configs:
- channel: '#devops-alerts'
send_resolved: true
EOF
sudo tee /etc/systemd/system/alertmanager.service > /dev/null <<EOF
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target
[Service]
User=$ALERTMANAGER_USER
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager/
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager
This script automates the installation and configuration of Alertmanager, a component of the Prometheus monitoring system that handles alerts sent by client applications such as the Prometheus server. It integrates with various notification services to send alerts to your team.
#!/bin/bash
ALERTMANAGER_VERSION="0.27.0"
ALERTMANAGER_USER="alertmanager"
- Defines the version of Alertmanager to install and the system user to be created for running the Alertmanager service
sudo useradd --no-create-home --shell /bin/false $ALERTMANAGER_USER
- Creates a system user named alertmanager without a home directory and disables login shell for security.
wget https://github.com/prometheus/alertmanager/releases/download/v$ALERTMANAGER_VERSION/alertmanager-$ALERTMANAGER_VERSION.linux-amd64.tar.gz
tar xvf alertmanager-$ALERTMANAGER_VERSION.linux-amd64.tar.gz
sudo mv alertmanager-$ALERTMANAGER_VERSION.linux-amd64 /usr/local/alertmanager
sudo ln -s /usr/local/alertmanager/alertmanager /usr/local/bin/
- Downloads the specified version of Alertmanager, extracts the files, moves them to /usr/local/alertmanager, and creates symbolic links to make the Alertmanager binary accessible system-wide.
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo cp /usr/local/alertmanager/alertmanager.yml /etc/alertmanager/
sudo chown -R $ALERTMANAGER_USER:$ALERTMANAGER_USER /etc/alertmanager /var/lib/alertmanager
- Creates configuration and data directories for Alertmanager, copies the default configuration file, and sets appropriate permissions to ensure the Alertmanager user owns these directories.
sudo tee /etc/alertmanager/alertmanager.yml > /dev/null <<EOF
global:
resolve_timeout: 1m
slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX
route:
receiver: 'Kimiko-Slack-Alert'
receivers:
- name: 'Kimiko-Slack-Alert'
slack_configs:
- channel: '#devops-alerts'
send_resolved: true
EOF
- Configures Alertmanager to use a Slack webhook for sending alerts. It sets the global configuration, routes alerts to the Kimiko-Slack-Alert receiver, and defines the Slack notification channel #devops-alerts
sudo tee /etc/systemd/system/alertmanager.service > /dev/null <<EOF
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target
[Service]
User=$ALERTMANAGER_USER
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager/
[Install]
WantedBy=multi-user.target
EOF
- Creates a systemd service file for Alertmanager to manage it as a system service. This ensures Alertmanager starts on boot and can be controlled using systemctl.
sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager
- Reloads the systemd configuration to recognize the new service, enables Alertmanager to start on boot, and starts the Alertmanager service.
- Test alert rules to ensure that notifications are sent correctly when thresholds are breached.
- Verify that alerts are actionable and provide relevant information for troubleshooting.
Here, we will go through setting up the PM2 Prometheus Exporter to monitor our application metrics. The PM2 Prometheus Exporter is a tool that allows us to expose process metrics from PM2-managed applications in a format compatible with Prometheus. This setup will help us gain insights into out application's performance by monitoring CPU usage, memory usage, process uptime, and more.
- PM2: Installed and running to manage your application processes.
- Node.js: Required for installing the PM2 Prometheus Exporter.
- Prometheus: Set up to scrape metrics from the PM2 exporter.
- Grafana (optional): For visualizing metrics collected by Prometheus.
Our golang application is being monitored by PM2, thus all the above enlisted have already been installed.
Ensure that PM2 is installed globally on your system:
npm install -g pm2
Install the PM2 Prometheus Exporter globally using npm:
npm install -g pm2-prometheus-exporter
Run the PM2 Prometheus Exporter to expose metrics on the default port (9209):
pm2-prometheus-exporter
It can also be started using PM2 to manage its lifecycle:
pm2 start pm2-prometheus-exporter --name pm2-exporter
Edit the Prometheus configuration file (prometheus.yml
) to include the PM2 exporter as a scrape target:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'pm2'
static_configs:
- targets: ['localhost:9209'] # Default port for PM2 Prometheus Exporter
After updating the configuration, reload Prometheus to apply the changes:
# Restart Prometheus
sudo systemctl restart prometheus
Open the browser and navigate to http://localhost:9209/metrics
. You should see metrics in Prometheus format, including:
-
pm2_cpu_usage_percent
: CPU usage percentage by PM2-managed processes. -
pm2_memory_usage_rss_bytes
: Memory usage (RSS) in bytes. -
pm2_uptime_seconds
: Process uptime in seconds. -
pm2_restart_count_total
: Total number of process restarts.
-
Import a PM2 Dashboard:
- Click the
+
button on the left sidebar and selectImport
. - Enter the Grafana dashboard ID for PM2 (e.g.,
8145
) and clickLoad
. - Select your Prometheus data source and click
Import
.
- Click the
-
Customize Your Dashboard:
- Adjust the dashboard panels to visualize the metrics relevant to your application.
- Monitor CPU, memory, and uptime metrics to ensure optimal performance.
When monitoring application running in multiple environments, such as development, staging and production, Dynamic variables in grafana enables us to create a single dashboard that can display data from different environment. Once defined, these variables can be used in panel queries by replacing static values with variable placeholders, such as $environment. This allows the data displayed in each panel to update automatically based on the selected environment from the dropdown menu at the top of the dashboard.
Example Query for Uptime Using Environment Variable: For a metric like pm2_uptime_seconds, which tracks the uptime of the application managed by PM2, we can use a query with an environment variable to filter the data:
pm2_uptime_seconds{environment="$environment"}
Explanation:
-
pm2_uptime_seconds: This metric represents the uptime of our application in seconds. It's collected by the PM2 Prometheus Exporter.
-
{environment="$environment"}: This part of the query filters the metric by the environment variable. The $environment variable is a dynamic variable defined in our Grafana dashboard that allows us to switch between different environments like dev, staging, and prod.
- CPU Usage:
avg(pm2_cpu_usage_percent) by (app_name)
- Memory Usage (RSS):
avg(pm2_memory_usage_rss_bytes) by (app_name)
- Uptime:
pm2_uptime_seconds
- Restart Count:
pm2_restart_count_total
- Automate Exporter Start: Use PM2 or a system service (like systemd) to ensure the PM2 Prometheus Exporter starts automatically on boot.
To manage how long historical metrics are stored in Prometheus and ensure data retention aligns with project requirements and compliance needs, we performed the following steps:
- Define the Retention Period: We specified the retention period for storing metrics data using the RETENTION_PERIOD variable in our setup script. In this setup script, we set the retention period to 15 days (15d).
RETENTION_PERIOD="15d"
-
Update Prometheus Configuration:
In the Prometheus systemd service file, we included the
--storage.tsdb.retention.time
flag to set the data retention policy. This flag tells Prometheus how long to keep the data before deleting it.
sudo tee /etc/systemd/system/prometheus.service > /dev/null <<EOF
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/ \
--storage.tsdb.retention.time=$RETENTION_PERIOD
[Install]
WantedBy=multi-user.target
EOF
To ensure proper access control in Grafana, the following steps were followed to configure user roles and permissions:
-
Log in to Grafana:
- Access your Grafana instance through the browser and log in with admin credentials.
-
Add Users:
- Go to Configuration (gear icon) > Users.
- Click Add User, enter user details, and either set a password or send an invitation email.
-
Assign User Roles: Global Roles:
- Navigate to Server Admin (shield icon) > Users.
- Assign global roles such as Admin, Editor, or Viewer.
-
Set Permissions for Dashboards:
- Open the dashboard you want to manage.
- Click Dashboard settings (gear icon) > Permissions.
- Add permission rules for specific users, teams, or roles, and set permissions (View, Edit, Admin).
By configuring user roles and permissions, you ensure that only authorised users can view or modify dashboards and alert configurations, enhancing the security and management of your Grafana instance