-
Notifications
You must be signed in to change notification settings - Fork 265
Grafana and Prometheus Monitoring Setup
This documentation provides a comprehensive guide on setting up Grafana and Prometheus for monitoring. It covers installation steps, configuration details, instructions for accessing and managing dashboards, and troubleshooting tips for maintaining the monitoring setup.
-
Install and Configure Prometheus, Grafana, Cadvisor and Node Exporter:
-
Set up Prometheus for metric collection and Grafana for data visualization.
-
Create and Configure Grafana Dashboards:
-
Develop dashboards to visualize metrics.
-
Set Up Alerts Based on Collected Metrics:
-
Configure alerting to notify team of potential issues.
-
Ensure Proper Data Retention and Access Control:
-
Manage data storage and user access.
-
Setup troubleshooting tips.
- Installing Prometheus, Grafana, Cadvisor and Node Exporter
- Configuring the Monitoring Dashboards
- Configuring Alerting
- Data Retention and Access Control
- Manage Users and Roles in Grafana
- Troubleshooting tips
#!/bin/bash
# Exit immediately if a command exits with a non-zero status
set -e
# Update and install dependencies
apt-get update
apt-get install -y wget curl tar adduser libfontconfig1
# Create users
sudo adduser --system --group --no-create-home prometheus
sudo adduser --system --group --no-create-home node_exporter
sudo adduser --system --group --home /var/lib/grafana grafana
# Install Prometheus
PROMETHEUS_VERSION="2.37.0"
wget https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
tar xvf prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz
cd prometheus-${PROMETHEUS_VERSION}.linux-amd64/
sudo mv prometheus promtool /usr/local/bin/
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo mv consoles/ console_libraries/ prometheus.yml /etc/prometheus/
cd ..
rm -rf prometheus-${PROMETHEUS_VERSION}.linux-amd64*
# Set Prometheus ownership
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
# Install Grafana
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install -y grafana
# Install Node Exporter
NODE_EXPORTER_VERSION="1.3.1"
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
tar xvf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
sudo mv node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/
rm -rf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64*
# Install cAdvisor (latest version as of the last update)
CADVISOR_VERSION="v0.47.0" # Update this to the latest version if needed
sudo apt-get install -y libseccomp2
wget https://github.com/google/cadvisor/releases/download/${CADVISOR_VERSION}/cadvisor-${CADVISOR_VERSION}-linux-amd64
sudo mv cadvisor-${CADVISOR_VERSION}-linux-amd64 /usr/local/bin/cadvisor
sudo chmod +x /usr/local/bin/cadvisor
# Create systemd service files
# Prometheus
cat << EOF | sudo tee /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries
[Install]
WantedBy=multi-user.target
EOF
# Node Exporter
cat << EOF | sudo tee /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
# cAdvisor
cat << EOF | sudo tee /etc/systemd/system/cadvisor.service
[Unit]
Description=cAdvisor
Wants=network-online.target
After=network-online.target
[Service]
User=root
Group=root
Type=simple
ExecStart=/usr/local/bin/cadvisor
[Install]
WantedBy=multi-user.target
EOF
# Reload systemd and start services
sudo systemctl daemon-reload
sudo systemctl enable prometheus node_exporter cadvisor grafana-server
sudo systemctl start prometheus node_exporter cadvisor grafana-server
echo "Installation complete. Please check the status of the services to ensure they are running correctly."
Configure Grafana to Use Prometheus as a Data Source:
- Access Grafana via your web browser (http://localhost:3000/). Since we have an application running on port 3000, we configured Grafana to use 3050 instead.
- Navigate to Configuration > Data Sources > Add data source.
- Select Prometheus and set the URL to http://localhost:9090. Click Save & Test.
Import Node Exporter Dashboard:
- In Grafana, go to the Dashboards page.
- Click the “+” icon and select “Import”.
- Enter Dashboard ID 1860 and click “Load”.
- Choose the Prometheus data source and click “Import”.
Using cAdvisor:
- In Grafana, click the “+” icon and select “Dashboard”.
- Click “import” and the add the dashboard json or ID.
Configuring Dynamic Variables:
- Navigate to the desired dashboard, click on the settings icon, select the variables tab, and create a new dynamic variable.
- In this case, we are creating a dynamic variable to reflect the environments of our containerized applications, such as dev, staging, and prod.
Testing the Dynamic Variable:
- From the image below, our dynamic variable lists the containers based on the environments.
Create and Configure Alerts in Grafana:
- Open the home menu bar and.
- Navigate to the Alert tab and click "Create Alert."
- Define conditions based on Prometheus queries (e.g.,
rate(myapp_request_count[1m]) > 100
).
Configure Alert Evaluation and Frequency:
- Set the evaluation interval (e.g., every minute) and the duration for which the alert condition must be met (e.g., 5 minutes).
Set Up Notification Channels:
- Navigate to Alerting > Contact Points > Click
Add contact point
button. - Configure the notification channel (e.g., Slack) with your webhook URL and other details.
Link Alerts to Notification Channels:
- In the alert configuration, associate the alert with the notification channel created.
Managed Alerts and Notifications:
- We configured Grafana managed alerts to notify the team via our Slack configuration webhook URL, which sends the alerts to the #devops-alerts channel.
- Alerts were created for Low Disk, High CPU, Container Down, and High Memory.
Set Data Retention in Prometheus:
- By default, Prometheus has a data retention period of 15 days. We specified our data retention period to 30 days and set the maximum storage size to 5GB. This was achieved by adding the following flags: --storage.tsdb.retention.time=30d --storage.tsdb.retention.size=5GB. The updated service file looks like this:
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \
--storage.tsdb.retention.time=30d \
--storage.tsdb.retention.size=5GB
Restart=always
[Install]
WantedBy=multi-user.target
User Roles and Permissions in Grafana:
- Log in to Grafana and navigate to the home menu bar. Expand the Administration section to create a team or add users.
- We created a team called FE DevOps and added our team members to the team.
- Team leads were given admin access, while other members were assigned viewer access.
When dealing with low disk space in Docker containers, consider the following steps:
-
Clean Up Unused Resources:
- Use
docker system prune -a
to remove unused images, containers, networks, and volumes. - Schedule this as a regular task to keep your system clean.
- Use
To troubleshoot high CPU usage, follow these steps:
-
Limit CPU Usage:
- You can limit the CPU usage of a Docker container using the
--cpus
flag. For example:Adjust the value (e.g.,docker run --cpus="1.0" --name my_container <docker_image_name>
1.0
for 100% of a single core) as needed.
- You can limit the CPU usage of a Docker container using the
-
Monitor with
docker stats
:- Use
docker stats
to check CPU usage. TheCPU %
column shows the percentage of the host's CPU that the container is using. - If you notice high usage, investigate further.
- Use
-
Troubleshoot Inside the Container:
- SSH into the container using
docker exec -it YOUR-CONTAINER-ID /bin/bash
. - Run
top
to identify processes consuming CPU resources. This helps pinpoint the issue.
- SSH into the container using
-
Override Entrypoint with a Shell:
- Create and start a container from the same failing image.
- Override the entrypoint with a shell (e.g.,
sh
orbash
):docker run -it --entrypoint sh <image_name>
- This drops you into a shell session within the container, allowing you to run your script and investigate why it's exiting unexpectedly.
-
Check Container Status and Logs:
- Use
docker ps -a
to find the most recent stopped container. - Check its exit code. Depending on the code, you might find useful information. Check the logs too!
- Use
To troubleshoot high memory usage, follow these steps:
-
Identify Memory-Hungry Containers:
- Use
docker stats
to find memory-intensive containers.
- Use
-
Consider Reducing the Number of Containers:
- Reduce the number of containers running on the same host to alleviate memory pressure.
To troubleshoot high CPU usage in dynamic environments, follow these steps:
-
Limit CPU Usage:
- Limit the CPU usage of a Docker container using the
--cpus
flag. For example:Adjust the value as needed.docker run --cpus="1.0" --name my_container <docker_image_name>
- Limit the CPU usage of a Docker container using the
-
Monitor with
docker stats
:- Use
docker stats
to check CPU usage. Investigate further if you notice high usage.
- Use
-
Troubleshoot Inside the Container:
- SSH into the container using
docker exec -it YOUR-CONTAINER-ID /bin/bash
. - Run
top
to identify processes consuming CPU resources.
- SSH into the container using
To troubleshoot high memory usage, follow these steps:
-
Identify Memory-Hungry Containers:
- Use
docker stats
to find memory-intensive containers.
- Use
-
Consider Reducing the Number of Containers:
- Reduce the number of containers running on the same host to alleviate memory pressure.
Made with ❤️ by Ravencodes | AugustHottie | CodeReaper0 | bySegunMoses | Suesue | DrInTech22 courtesy of @HNG-Internship