Setting up Monitoring with Grafana and Prometheus

Grafana and Prometheus are essential tools for monitoring applications and infrastructure.

Prometheus collects and stores metrics, providing real-time monitoring and alerting.
Grafana visualizes these metrics, creating dashboards that help track performance.

This documentation covers the installation and configuration of Grafana and Prometheus, including steps for setting up dashboards, configuring alerts, accessing and managing dashboards, and best practices for maintenance.

Prometheus Setup

Step 1: Download and Install Prometheus

Download Prometheus:

wget https://github.com/prometheus/prometheus/releases/download/v2.53.1/prometheus-2.53.1.linux-amd64.tar.gz

Extract the downloaded file:

tar xvfz prometheus-*.tar.gz

Move Prometheus binaries to the desired location:

sudo mv prometheus-2.53.1.linux-amd64 /usr/local/prometheus

Create symbolic links for easy access:

sudo ln -s /usr/local/prometheus/prometheus /usr/local/bin/prometheus
sudo ln -s /usr/local/prometheus/promtool /usr/local/bin/promtool

Create configuration directory and copy the configuration file:

sudo mkdir /etc/prometheus
sudo cp /usr/local/prometheus/prometheus.yml /etc/prometheus/

Step 2: Create Prometheus User and Set Permissions

Create a Prometheus user:

sudo useradd --no-create-home --shell /bin/false prometheus

Set ownership of Prometheus directories:

sudo chown -R prometheus:prometheus /usr/local/prometheus
sudo chown -R prometheus:prometheus /etc/prometheus

Step 3: Configure Prometheus as a Systemd Service

Create the Prometheus service file:

sudo tee /etc/systemd/system/prometheus-langlearnbe.service > /dev/null <<EOL
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
    --config.file /etc/prometheus/langlearn-be.yml \
    --storage.tsdb.path /var/lib/prometheus/ \
    --web.console.templates=/usr/local/prometheus/consoles \
    --web.console.libraries=/usr/local/prometheus/console_libraries \
    --web.listen-address="127.0.0.1:9099" \
    --storage.tsdb.retention.time=15d

[Install]
WantedBy=multi-user.target
EOL

Create data storage directory and set permissions:

sudo mkdir /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

Reload systemd, start and enable Prometheus service:

sudo systemctl daemon-reload
sudo systemctl start prometheus-langlearnbe
sudo systemctl enable prometheus-langlearnbe

Step 4: Install and Configure Node Exporter

Download and extract Node Exporter:

wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xvfz node_exporter-1.8.2.linux-amd64.tar.gz
cd node_exporter-1.8.2.linux-amd64
./node_exporter

Step 5: Prometheus Configuration for Monitoring

Create Prometheus configuration file:

sudo tee /etc/prometheus/langlearn-be.yml > /dev/null <<EOL
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9099']

  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'application'
    static_configs:
      - targets: ['langlearn-deployment/prometheus/metrics']
EOL

Step 6: Install and Configure PHP APCu

Install APCu extension:

sudo apt-get install php8.2-apcu -y

Enable APCu extension:

sudo bash -c 'echo "extension=apcu.so" > /etc/php/8.2/mods-available/apcu.ini'
sudo phpenmod apcu

Restart PHP-FPM and Nginx:

sudo systemctl restart php8.2-fpm
sudo systemctl restart nginx

Step 7: Laravel Integration with Prometheus

Install Prometheus Client for PHP:

composer require promphp/prometheus_client_php

Create Prometheus Service Provider:

php artisan make:provider PrometheusServiceProvider

// Add the following code to app/Providers/PrometheusServiceProvider.php

<?php

namespace App\Providers;

use Illuminate\Support\ServiceProvider;
use Prometheus\CollectorRegistry;
use Prometheus\Storage\APC;

class PrometheusServiceProvider extends ServiceProvider
{
    public function register(): void
    {
        $this->app->singleton(CollectorRegistry::class, function () {
            $adapter = new APC();
            return new CollectorRegistry($adapter);
        });
    }

    public function boot(): void
    {
        //
    }
}

Register the service provider in config/app.php:

sed -i '/App\\Providers\\RouteServiceProvider::class,/a \ \ \ \ \ \ \ \ App\\Providers\\PrometheusServiceProvider::class,' config/app.php

Create Middleware for Monitoring Application Traffic and Response Times:

php artisan make:middleware Prometheus/PrometheusMetricsMiddleware

// Add the following code to app/Http/Middleware/Prometheus/PrometheusMetricsMiddleware.php

<?php

namespace App\Http\Middleware\Prometheus;

use Closure;
use Prometheus\CollectorRegistry;

class PrometheusMetricsMiddleware
{
    protected $registry;

    public function __construct(CollectorRegistry $registry)
    {
        $this->registry = $registry;
    }

    public function handle($request, Closure $next)
    {
        if (!defined('LARAVEL_START')) {
            define('LARAVEL_START', microtime(true));
        }

        $response = $next($request);

        $counter = $this->registry->getOrRegisterCounter(
            'app',
            'requests_total',
            'Total number of requests',
            ['method', 'endpoint']
        );
        $counter->inc([$request->method(), $request->path()]);

        $histogram = $this->registry->getOrRegisterHistogram(
            'app',
            'request_duration_seconds',
            'Request duration in seconds',
            ['method', 'endpoint']
        );
        $histogram->observe(microtime(true) - LARAVEL_START, [$request->method(), $request->path()]);

        return $response;
    }
}

Add Middleware to app/Http/Kernel.php:

'api' => [
    \Illuminate\Routing\Middleware\ThrottleRequests::class.':api',
    \Illuminate\Routing\Middleware\SubstituteBindings::class,
    \App\Http\Middleware\Prometheus\PrometheusMetricsMiddleware::class,
],

Configure Exception Handling for Monitoring:

cat <<'EOL' > app/Exceptions/Handler.php
<?php

namespace App\Exceptions;

use Illuminate\Foundation\Exceptions\Handler as ExceptionHandler;
use Illuminate\Support\Facades\App;
use Prometheus\CollectorRegistry;
use Throwable;

class Handler extends ExceptionHandler
{
    protected $dontFlash = [
        'current_password',
        'password',
        'password_confirmation',
    ];

    public function register(): void
    {
        $this->reportable(function (Throwable $e) {
            $request = request();
            $registry = App::make(CollectorRegistry::class);
            $counter = $registry->getOrRegisterCounter(
                'app', 
                'errors_total', 
                'Total number of errors', 
                ['type', 'endpoint']
            );
            $counter->inc([get_class($e), $request->path()]);
            logger()->error($e);

            throw $e;
        });
    }
}
EOL

Create Metrics Controller:

php artisan make:controller Prometheus/MetricsController

// Add the following code to app/Http/Controllers/Prometheus/MetricsController.php

<?php

namespace App\Http\Controllers\Prometheus;

use App\Http\Controllers\Controller;
use Prometheus\CollectorRegistry;
use Prometheus\RenderTextFormat;

class MetricsController extends Controller
{
    protected $registry;

    public function __construct(CollectorRegistry $registry)
    {
        $this->registry = $registry;
    }

    public function metrics()
    {
        $renderer = new RenderTextFormat();
        $metrics = $renderer->render($this->registry->getMetricFamilySamples());

        return response($metrics, 200, ['Content-Type' => RenderTextFormat::MIME_TYPE]);
    }
}

Add Route for Metrics and Restrict Access to Localhost:

<?php

namespace App\Http\Middleware\Prometheus;

use Closure;
use Illuminate\Http\Request;
use Symfony\Component\HttpFoundation\Response;

class AllowLocalhostOnly
{
    public function handle(Request $request, Closure $next): Response
    {
        if (!$this->isLocalhost($request->ip())) {
            return response('Unauthorized.', 401);
        }
        return $next($request);
    }

    protected function isLocalhost($ip)
    {
        return in_array($ip, ['127.0.0.1', '::1']);
    }
}


**Step 8: Define Metrics Route and Restrict Access to Localhost**

1. Add the following route to restrict Prometheus metrics access to localhost:

```php
Route::middleware([AllowLocalhostOnly::class])->group(function () {
    Route::get('/prometheus/metrics/' . env('PROMETHEUS_SECRET', 'tzUDnGrzj5j3vp0HSz0HKj3LNrYf1cj9'), [MetricsController::class, 'metrics']);
});

Step 8: Nginx Configuration

Add the following Nginx server blocks to configure custom local domain names and directories:

Configuration for `langlearn-api

server {
    server_name langlearn-api;
    root /var/www/langlearnai-be/api/public;
    index index.php index.html;

    location / {
        try_files $uri $uri/ /index.php?$query_string;
    }

    location ~ \.php$ {
        include snippets/fastcgi-php.conf;
        fastcgi_pass unix:/var/run/php/php8.2-fpm.sock;
        fastcgi_param SCRIPT_FILENAME $realpath_root$fastcgi_script_name;
        include fastcgi_params;
    }

    location ~ /\.ht {
        deny all;
    }
}

Configuration for `langlearn-deployment

server {
    server_name langlearn-deployment;
    root /var/www/langlearnai-be/deployment/public;
    index index.php index.html;

    location / {
        try_files $uri $uri/ /index.php?$query_string;
    }

    location ~ \.php$ {
        include snippets/fastcgi-php.conf;
        fastcgi_pass unix:/var/run/php/php8.2-fpm.sock;
        fastcgi_param SCRIPT_FILENAME $realpath_root$fastcgi_script_name;
        include fastcgi_params;
    }

    location ~ /\.ht {
        deny all;
    }
}

Configuration for `langlearn-staging

server {
    server_name langlearn-staging;
    root /var/www/langlearnai-be/staging/public;
    index index.php index.html;

    location / {
        try_files $uri $uri/ /index.php?$query_string;
    }

    location ~ \.php$ {
        include snippets/fastcgi-php.conf;
        fastcgi_pass unix:/var/run/php/php8.2-fpm.sock;
        fastcgi_param SCRIPT_FILENAME $realpath_root$fastcgi_script_name;
        include fastcgi_params;
    }

    location ~ /\.ht {
        deny all;
    }
}

Step 9: Update /etc/hosts File

Add the following entries to your /etc/hosts file to complete the setup:

127.0.0.1 langlearn-api
127.0.0.1 langlearn-deployment
127.0.0.1 langlearn-staging

Step 10: Restart Services

Restart the necessary services to apply the changes:

sudo systemctl restart php8.2-fpm
sudo systemctl restart nginx
sudo systemctl restart prometheus-langlearnbe

Install Grafana Setup

1: Install the prerequisite packages

sudo apt-get install -y apt-transport-https software-properties-common wget

Import GPG Key

sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null

Add a repository for stable releases

echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com/ stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list

Add a repository for beta releases

echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com/ beta main" | sudo tee -a /etc/apt/sources.list.d/grafana.list

Update the list of available packages:

sudo apt-get update

Install Grafana OSS

sudo apt-get install Grafana

Verify that the service is running

sudo systemctl status grafana-server

If not running

sudo systemctl daemon-reload
sudo systemctl start grafana-server
sudo systemctl status grafana-server

Configure the Grafana server to start at boot

sudo systemctl enable grafana-server.service

Login with the default port http://<ip>:3000

default username and password is admin/admin

Connect the Grafana-server to Prometheus

Select configure a new data source
Select Prometheus

in the connection tab use http://localhost:9099 http://<ip>:9099/

Configure the data source
To add the Prometheus data source, complete the following steps:

Click Connections in the left-side menu.
Under Connections, click Add new connection.
Enter Prometheus in the search bar.
Select Prometheus.
Click Add new data source in the upper right.

You will be taken to the Settings tab where you will set up your Prometheus configuration.

In the connection tab use http://localhost:9099
Click on save and test
Click on the menu and locate explorer
Select Prometheus as the data source
Run query against metrics
Click on the menu to go down to the dashboard
Click on create dashboard
Click on add visualization
Select Prometheus as data source
Select code at the right side of the and paste this query for CPU metrics rate(node_cpu_seconds_total{mode="user"}[1m])

For prebuilt dashboard

Click on import from dashboard
Import dashboard from Grafana https://grafana.com/grafana/dashboards/1860
Click on load
Add datasource of Prometheus
Click on import
Enter the Grafana url or id and click on load
Select Prometheus as data source and click import
Click on alerting on the menu dropdown
Click on contact point to set for slack notification
Set a name
Select slack for integration
Use either api secrete or
Click on save

Configure Alert

Navigate to dashboard panel
Locate dashboard where panel will be added
Click on edit a panel
Click on panel that display CPU metrics
Click edit pencil icon
Locate alert tab
Click on create a new alert
Define alert conditions
Input query
Click on save rule and exit
Click on add new alert rule
Enter the name in the alert rule name
Save the rules.

Alert Rule in a .yml file

apiVersion: 1
groups:
    - orgId: 1
      name: Server Metrics
      folder: Server Metrics
      interval: 1m
      rules:
        - uid: bdt7z9wlz5xj4c
          title: CPU Usage
          condition: H
          data:
            - refId: A
              relativeTimeRange:
                from: 86400
                to: 0
              datasourceUid: edt4gy6d7zkzka
              model:
                datasource:
                    type: prometheus
                    uid: edt4gy6d7zkzka
                editorMode: code
                exemplar: false
                expr: 100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) by (instance))
                format: time_series
                instant: true
                interval: ""
                intervalFactor: 1
                intervalMs: 15000
                legendFormat: Busy System
                maxDataPoints: 43200
                range: false
                refId: A
                step: 240
            - refId: H
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 80
                        type: gt
                      operator:
                        type: and
                      query:
                        params:
                            - H
                      reducer:
                        params: []
                        type: last
                      type: query
                datasource:
                    type: __expr__
                    uid: __expr__
                expression: A
                intervalMs: 1000
                maxDataPoints: 43200
                refId: H
                type: threshold
          dashboardUid: rYdddlPWk
          panelId: 77
          noDataState: OK
          execErrState: Error
          for: 1m
          annotations:
            __dashboardUid__: rYdddlPWk
            __panelId__: "77"
            resolved: "- Description: The current CPU usage dropped below 80%. \n- Value:  {{ humanize (index $values \"A\").Value }}%"
            summary: "- Description: The current CPU usage exceeded 80%. \n- Value:  {{ humanize (index $values \"A\").Value }}%"
          labels:
            type: CPU Utilization Alert
          isPaused: false
          notification_settings:
            receiver: HNG Slack notification
        - uid: fdt9ih6gmttdsf
          title: Disk Usage
          condition: C
          data:
            - refId: D
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: edt4gy6d7zkzka
              model:
                datasource:
                    type: prometheus
                    uid: edt4gy6d7zkzka
                editorMode: code
                expr: max(100 - ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes)) by (instance)
                instant: true
                intervalMs: 1000
                legendFormat: __auto
                maxDataPoints: 43200
                range: false
                refId: D
            - refId: C
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 80
                        type: gt
                      operator:
                        type: and
                      query:
                        params:
                            - C
                      reducer:
                        params: []
                        type: last
                      type: query
                datasource:
                    type: __expr__
                    uid: __expr__
                expression: D
                intervalMs: 1000
                maxDataPoints: 43200
                refId: C
                type: threshold
          dashboardUid: rYdddlPWk
          panelId: 152
          noDataState: OK
          execErrState: Error
          for: 1m
          annotations:
            __dashboardUid__: rYdddlPWk
            __panelId__: "152"
            resolved: "- Description: The current disk usage has dropped below 80%. \n- Value:  {{ humanize (index $values \"D\").Value }}% In use"
            summary: "- Description: The current disk usage is now above 80%. \n- Value:  {{ humanize (index $values \"D\").Value }}% In use"
          labels:
            type: High Memory Usage
          isPaused: false
          notification_settings:
            receiver: HNG Slack notification
        - uid: ddt802h77g4xsa
          title: Memory Usage
          condition: G
          data:
            - refId: A
              relativeTimeRange:
                from: 86400
                to: 0
              datasourceUid: edt4gy6d7zkzka
              model:
                datasource:
                    type: prometheus
                    uid: edt4gy6d7zkzka
                editorMode: code
                expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
                format: time_series
                instant: true
                interval: ""
                intervalFactor: 1
                intervalMs: 15000
                legendFormat: RAM Total
                maxDataPoints: 43200
                range: false
                refId: A
                step: 240
            - refId: B
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: edt4gy6d7zkzka
              model:
                datasource:
                    type: prometheus
                    uid: edt4gy6d7zkzka
                editorMode: code
                expr: node_memory_MemAvailable_bytes / 1024 / 1024 / 1024
                instant: true
                intervalMs: 1000
                legendFormat: __auto
                maxDataPoints: 43200
                range: false
                refId: B
            - refId: C
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: edt4gy6d7zkzka
              model:
                datasource:
                    type: prometheus
                    uid: edt4gy6d7zkzka
                editorMode: code
                expr: node_memory_MemTotal_bytes / 1024 / 1024 / 1024
                instant: true
                intervalMs: 1000
                legendFormat: __auto
                maxDataPoints: 43200
                range: false
                refId: C
            - refId: G
              relativeTimeRange:
                from: 86400
                to: 0
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 10
                        type: lt
                      operator:
                        type: and
                      query:
                        params:
                            - G
                      reducer:
                        params: []
                        type: last
                      type: query
                datasource:
                    type: __expr__
                    uid: __expr__
                expression: A
                intervalMs: 1000
                maxDataPoints: 43200
                refId: G
                type: threshold
          dashboardUid: rYdddlPWk
          panelId: 78
          noDataState: OK
          execErrState: Error
          for: 1m
          annotations:
            __dashboardUid__: rYdddlPWk
            __panelId__: "78"
            resolved: |-
                - Description: The available memory on the system has increased to more than 10% of the total memory.
                - Value: {{ humanize (index $values "A").Value }}% ({{ humanize (index $values "B").Value }} GB remaining out of {{ humanize (index $values "C").Value }} GB total)
            summary: |-
                - Description: The available memory on the system has dropped below 10% of the total memory.
                - Value: {{ humanize (index $values "A").Value }}% ({{ humanize (index $values "B").Value }} GB remaining out of {{ humanize (index $values "C").Value }} GB total)
          labels:
            type: Available RAM alert
          isPaused: false
          notification_settings:
            receiver: HNG Slack notification
    - orgId: 1
      name: Application
      folder: Application Error
      interval: 1m
      rules:
        - uid: ddt8rvismshz4b
          title: Application Error
          condition: C
          data:
            - refId: A
              relativeTimeRange:
                from: 300
                to: 0
              datasourceUid: edt4gy6d7zkzka
              model:
                datasource:
                    type: prometheus
                    uid: edt4gy6d7zkzka
                disableTextWrap: false
                editorMode: code
                exemplar: false
                expr: rate(app_errors_total{}[5m])
                fullMetaSearch: false
                includeNullMetadata: true
                instant: false
                interval: ""
                intervalMs: 15000
                legendFormat: __auto
                maxDataPoints: 43200
                range: true
                refId: A
                useBackend: false
            - refId: B
              relativeTimeRange:
                from: 300
                to: 0
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 0
                            - 0
                        type: gt
                      operator:
                        type: and
                      query:
                        params: []
                      reducer:
                        params: []
                        type: avg
                      type: query
                datasource:
                    name: Expression
                    type: __expr__
                    uid: __expr__
                expression: A
                intervalMs: 1000
                maxDataPoints: 43200
                reducer: sum
                refId: B
                type: reduce
            - refId: C
              relativeTimeRange:
                from: 300
                to: 0
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 0.0167
                            - 0
                        type: gt
                      operator:
                        type: and
                      query:
                        params: []
                      reducer:
                        params: []
                        type: avg
                      type: query
                datasource:
                    name: Expression
                    type: __expr__
                    uid: __expr__
                expression: B
                intervalMs: 1000
                maxDataPoints: 43200
                refId: C
                type: threshold
          dashboardUid: rYdddlPWk
          panelId: 325
          noDataState: OK
          execErrState: Error
          for: 1m
          annotations:
            __dashboardUid__: rYdddlPWk
            __panelId__: "325"
            resolved: "- Description: No error has been detected here in the past 5 minutes.\n- Details: \n{{ $ran := false }}\n{{ range $k, $v := $values -}}\n  {{ if not $ran -}}\n     {{ $v.Labels }}\n    {{ $ran = true }}\n  {{ end -}}\n{{ end -}}"
            summary: "- Description: The error rate for this application has surpassed the threshold of 10 errors within a 5-minute period for this specific error.\n- Details: \n{{ $ran := false }}\n{{ range $k, $v := $values -}}\n  {{ if not $ran -}}\n     {{ $v.Labels }}\n    {{ $ran = true }}\n  {{ end -}}\n{{ end -}}"
          labels:
            type: Application Error
          isPaused: false
          notification_settings:
            receiver: HNG Slack notification
    - orgId: 1
      name: No Data
      folder: Application Error
      interval: 5m
      rules:
        - uid: edtd3j7f6iry8a
          title: Application Down
          condition: B
          data:
            - refId: A
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: edt4gy6d7zkzka
              model:
                datasource:
                    type: prometheus
                    uid: edt4gy6d7zkzka
                editorMode: code
                exemplar: false
                expr: up
                instant: true
                intervalMs: 1000
                legendFormat: __auto
                maxDataPoints: 43200
                range: false
                refId: A
            - refId: B
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 1
                            - 0
                        type: gt
                      operator:
                        type: and
                      query:
                        params: []
                      reducer:
                        params: []
                        type: avg
                      type: query
                datasource:
                    name: Expression
                    type: __expr__
                    uid: __expr__
                expression: A
                hide: false
                intervalMs: 1000
                maxDataPoints: 43200
                refId: B
                type: threshold
          dashboardUid: rYdddlPWk
          panelId: 325
          noDataState: OK
          execErrState: Alerting
          for: 5m
          annotations:
            __dashboardUid__: rYdddlPWk
            __panelId__: "325"
            resolved: |-
                - Description:  The below address is back up
                - Details:
                {{ $ran := false }}
                {{ range $k, $v := $values -}}
                  {{ if not $ran -}}
                    {{ $k }}: {{ $v.Labels }}
                    {{ $ran = true }}
                  {{ end -}}
                {{ end -}}
            summary: |-
                - Description:  The below address is down
                - Details:
                {{ $ran := false }}
                {{ range $k, $v := $values -}}
                  {{ if not $ran -}}
                    {{ $k }}: {{ $v.Labels }}
                    {{ $ran = true }}
                  {{ end -}}
                {{ end -}}
          labels:
            type: Application is Down
          isPaused: false
          notification_settings:
            receiver: HNG Slack notification

Data Retention

By default, Prometheus retains data for 15 days. You can customize the data retention period based on requirements.

Troubleshooting Tips

1. Check Service Status

Ensure Prometheus and related services are running:

sudo systemctl status prometheus
sudo systemctl status grafana

2. Verify Configuration Files

Check the Prometheus configuration file (/etc/prometheus/prometheus.yml) for correctness.
Inspect logs for errors or warnings

sudo journalctl -u prometheus
sudo journalctl -u grafana

3. Verify Network Connectivity

Ensure Prometheus can reach its scrape targets and that the targets are returning valid metrics.

4. Monitor Resource Usage

Ensure Prometheus has adequate CPU and memory resources to operate effectively.

5. Check Alertmanager Configuration

Verify that alerts are correctly configured in Alertmanager if they are not firing as expected.

Best Practices

1. Regular Backups

Regularly back up Prometheus configuration files and data to prevent data loss.

2. Version Control

Store your configuration files in a version control system like Git.

3. Automate Configuration Management

Use tools like Ansible, Puppet, or Chef to manage your Prometheus setup.

4. Monitor Prometheus

Set up alerts to monitor the health and performance of Prometheus itself.

5. Keep Software Updated

Regularly update Prometheus and related tools to the latest stable versions.

6. Documentation

Document your monitoring setup, including configuration files, architecture, and troubleshooting steps.

7. Security

Limit access to the Prometheus web UI and metrics endpoints.

8. Redundancy

Set up redundant Prometheus instances to avoid single points of failure and use load balancers to distribute the load.

9. Alert Management

Regularly test alerting rules and tune thresholds to avoid alert fatigue.

Made with ❤️ by Olat-nji | Ujusophy | tulbadex | Darkk-kami | Otie16 courtesy of @HNG-Internship

Provide feedback

Saved searches

Use saved searches to filter your results more quickly