Skip to content

Setting up Monitoring with Grafana and Prometheus

Njoku Ujunwa Sophia edited this page Aug 1, 2024 · 6 revisions

Grafana and Prometheus are essential tools for monitoring applications and infrastructure.

  • Prometheus collects and stores metrics, providing real-time monitoring and alerting.

  • Grafana visualizes these metrics, creating dashboards that help track performance.

This documentation covers the installation and configuration of Grafana and Prometheus, including steps for setting up dashboards, configuring alerts, accessing and managing dashboards, and best practices for maintenance.


Prometheus Setup

Step 1: Download and Install Prometheus

  1. Download Prometheus:
wget https://github.com/prometheus/prometheus/releases/download/v2.53.1/prometheus-2.53.1.linux-amd64.tar.gz
  1. Extract the downloaded file:
tar xvfz prometheus-*.tar.gz
  1. Move Prometheus binaries to the desired location:
sudo mv prometheus-2.53.1.linux-amd64 /usr/local/prometheus
  1. Create symbolic links for easy access:
sudo ln -s /usr/local/prometheus/prometheus /usr/local/bin/prometheus
sudo ln -s /usr/local/prometheus/promtool /usr/local/bin/promtool
  1. Create configuration directory and copy the configuration file:
sudo mkdir /etc/prometheus
sudo cp /usr/local/prometheus/prometheus.yml /etc/prometheus/

Step 2: Create Prometheus User and Set Permissions

  1. Create a Prometheus user:
sudo useradd --no-create-home --shell /bin/false prometheus
  1. Set ownership of Prometheus directories:
sudo chown -R prometheus:prometheus /usr/local/prometheus
sudo chown -R prometheus:prometheus /etc/prometheus

Step 3: Configure Prometheus as a Systemd Service

  1. Create the Prometheus service file:
sudo tee /etc/systemd/system/prometheus-langlearnbe.service > /dev/null <<EOL
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
    --config.file /etc/prometheus/langlearn-be.yml \
    --storage.tsdb.path /var/lib/prometheus/ \
    --web.console.templates=/usr/local/prometheus/consoles \
    --web.console.libraries=/usr/local/prometheus/console_libraries \
    --web.listen-address="127.0.0.1:9099" \
    --storage.tsdb.retention.time=15d

[Install]
WantedBy=multi-user.target
EOL
  1. Create data storage directory and set permissions:
sudo mkdir /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
  1. Reload systemd, start and enable Prometheus service:
sudo systemctl daemon-reload
sudo systemctl start prometheus-langlearnbe
sudo systemctl enable prometheus-langlearnbe

Step 4: Install and Configure Node Exporter

  1. Download and extract Node Exporter:
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xvfz node_exporter-1.8.2.linux-amd64.tar.gz
cd node_exporter-1.8.2.linux-amd64
./node_exporter

Step 5: Prometheus Configuration for Monitoring

  1. Create Prometheus configuration file:
sudo tee /etc/prometheus/langlearn-be.yml > /dev/null <<EOL
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9099']

  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'application'
    static_configs:
      - targets: ['langlearn-deployment/prometheus/metrics']
EOL

Step 6: Install and Configure PHP APCu

  1. Install APCu extension:
sudo apt-get install php8.2-apcu -y
  1. Enable APCu extension:
sudo bash -c 'echo "extension=apcu.so" > /etc/php/8.2/mods-available/apcu.ini'
sudo phpenmod apcu
  1. Restart PHP-FPM and Nginx:
sudo systemctl restart php8.2-fpm
sudo systemctl restart nginx

Step 7: Laravel Integration with Prometheus

  1. Install Prometheus Client for PHP:
composer require promphp/prometheus_client_php
  1. Create Prometheus Service Provider:
php artisan make:provider PrometheusServiceProvider

// Add the following code to app/Providers/PrometheusServiceProvider.php

<?php

namespace App\Providers;

use Illuminate\Support\ServiceProvider;
use Prometheus\CollectorRegistry;
use Prometheus\Storage\APC;

class PrometheusServiceProvider extends ServiceProvider
{
    public function register(): void
    {
        $this->app->singleton(CollectorRegistry::class, function () {
            $adapter = new APC();
            return new CollectorRegistry($adapter);
        });
    }

    public function boot(): void
    {
        //
    }
}
  1. Register the service provider in config/app.php:
sed -i '/App\\Providers\\RouteServiceProvider::class,/a \ \ \ \ \ \ \ \ App\\Providers\\PrometheusServiceProvider::class,' config/app.php
  1. Create Middleware for Monitoring Application Traffic and Response Times:
php artisan make:middleware Prometheus/PrometheusMetricsMiddleware

// Add the following code to app/Http/Middleware/Prometheus/PrometheusMetricsMiddleware.php

<?php

namespace App\Http\Middleware\Prometheus;

use Closure;
use Prometheus\CollectorRegistry;

class PrometheusMetricsMiddleware
{
    protected $registry;

    public function __construct(CollectorRegistry $registry)
    {
        $this->registry = $registry;
    }

    public function handle($request, Closure $next)
    {
        if (!defined('LARAVEL_START')) {
            define('LARAVEL_START', microtime(true));
        }

        $response = $next($request);

        $counter = $this->registry->getOrRegisterCounter(
            'app',
            'requests_total',
            'Total number of requests',
            ['method', 'endpoint']
        );
        $counter->inc([$request->method(), $request->path()]);

        $histogram = $this->registry->getOrRegisterHistogram(
            'app',
            'request_duration_seconds',
            'Request duration in seconds',
            ['method', 'endpoint']
        );
        $histogram->observe(microtime(true) - LARAVEL_START, [$request->method(), $request->path()]);

        return $response;
    }
}
  1. Add Middleware to app/Http/Kernel.php:
'api' => [
    \Illuminate\Routing\Middleware\ThrottleRequests::class.':api',
    \Illuminate\Routing\Middleware\SubstituteBindings::class,
    \App\Http\Middleware\Prometheus\PrometheusMetricsMiddleware::class,
],
  1. Configure Exception Handling for Monitoring:
cat <<'EOL' > app/Exceptions/Handler.php
<?php

namespace App\Exceptions;

use Illuminate\Foundation\Exceptions\Handler as ExceptionHandler;
use Illuminate\Support\Facades\App;
use Prometheus\CollectorRegistry;
use Throwable;

class Handler extends ExceptionHandler
{
    protected $dontFlash = [
        'current_password',
        'password',
        'password_confirmation',
    ];

    public function register(): void
    {
        $this->reportable(function (Throwable $e) {
            $request = request();
            $registry = App::make(CollectorRegistry::class);
            $counter = $registry->getOrRegisterCounter(
                'app', 
                'errors_total', 
                'Total number of errors', 
                ['type', 'endpoint']
            );
            $counter->inc([get_class($e), $request->path()]);
            logger()->error($e);

            throw $e;
        });
    }
}
EOL
  1. Create Metrics Controller:
php artisan make:controller Prometheus/MetricsController

// Add the following code to app/Http/Controllers/Prometheus/MetricsController.php

<?php

namespace App\Http\Controllers\Prometheus;

use App\Http\Controllers\Controller;
use Prometheus\CollectorRegistry;
use Prometheus\RenderTextFormat;

class MetricsController extends Controller
{
    protected $registry;

    public function __construct(CollectorRegistry $registry)
    {
        $this->registry = $registry;
    }

    public function metrics()
    {
        $renderer = new RenderTextFormat();
        $metrics = $renderer->render($this->registry->getMetricFamilySamples());

        return response($metrics, 200, ['Content-Type' => RenderTextFormat::MIME_TYPE]);
    }
}
  1. Add Route for Metrics and Restrict Access to Localhost:
<?php

namespace App\Http\Middleware\Prometheus;

use Closure;
use Illuminate\Http\Request;
use Symfony\Component\HttpFoundation\Response;

class AllowLocalhostOnly
{
    public function handle(Request $request, Closure $next): Response
    {
        if (!$this->isLocalhost($request->ip())) {
            return response('Unauthorized.', 401);
        }
        return $next($request);
    }

    protected function isLocalhost($ip)
    {
        return in_array($ip, ['127.0.0.1', '::1']);
    }
}


**Step 8: Define Metrics Route and Restrict Access to Localhost**

1. Add the following route to restrict Prometheus metrics access to localhost:

```php
Route::middleware([AllowLocalhostOnly::class])->group(function () {
    Route::get('/prometheus/metrics/' . env('PROMETHEUS_SECRET', 'tzUDnGrzj5j3vp0HSz0HKj3LNrYf1cj9'), [MetricsController::class, 'metrics']);
});

Step 8: Nginx Configuration

  1. Add the following Nginx server blocks to configure custom local domain names and directories:

Configuration for `langlearn-api

server {
    server_name langlearn-api;
    root /var/www/langlearnai-be/api/public;
    index index.php index.html;

    location / {
        try_files $uri $uri/ /index.php?$query_string;
    }

    location ~ \.php$ {
        include snippets/fastcgi-php.conf;
        fastcgi_pass unix:/var/run/php/php8.2-fpm.sock;
        fastcgi_param SCRIPT_FILENAME $realpath_root$fastcgi_script_name;
        include fastcgi_params;
    }

    location ~ /\.ht {
        deny all;
    }
}

Configuration for `langlearn-deployment

server {
    server_name langlearn-deployment;
    root /var/www/langlearnai-be/deployment/public;
    index index.php index.html;

    location / {
        try_files $uri $uri/ /index.php?$query_string;
    }

    location ~ \.php$ {
        include snippets/fastcgi-php.conf;
        fastcgi_pass unix:/var/run/php/php8.2-fpm.sock;
        fastcgi_param SCRIPT_FILENAME $realpath_root$fastcgi_script_name;
        include fastcgi_params;
    }

    location ~ /\.ht {
        deny all;
    }
}

Configuration for `langlearn-staging

server {
    server_name langlearn-staging;
    root /var/www/langlearnai-be/staging/public;
    index index.php index.html;

    location / {
        try_files $uri $uri/ /index.php?$query_string;
    }

    location ~ \.php$ {
        include snippets/fastcgi-php.conf;
        fastcgi_pass unix:/var/run/php/php8.2-fpm.sock;
        fastcgi_param SCRIPT_FILENAME $realpath_root$fastcgi_script_name;
        include fastcgi_params;
    }

    location ~ /\.ht {
        deny all;
    }
}

Step 9: Update /etc/hosts File

  1. Add the following entries to your /etc/hosts file to complete the setup:
127.0.0.1 langlearn-api
127.0.0.1 langlearn-deployment
127.0.0.1 langlearn-staging

Step 10: Restart Services

  1. Restart the necessary services to apply the changes:
sudo systemctl restart php8.2-fpm
sudo systemctl restart nginx
sudo systemctl restart prometheus-langlearnbe

Install Grafana Setup

1: Install the prerequisite packages

sudo apt-get install -y apt-transport-https software-properties-common wget
  1. Import GPG Key
sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
  1. Add a repository for stable releases
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com/ stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
  1. Add a repository for beta releases
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com/ beta main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
  1. Update the list of available packages:
sudo apt-get update
  1. Install Grafana OSS
sudo apt-get install Grafana
  1. Verify that the service is running
sudo systemctl status grafana-server
  • If not running
sudo systemctl daemon-reload
sudo systemctl start grafana-server
sudo systemctl status grafana-server
  1. Configure the Grafana server to start at boot
sudo systemctl enable grafana-server.service
  1. Login with the default port http://<ip>:3000
  • default username and password is admin/admin

Connect the Grafana-server to Prometheus

  1. Select configure a new data source

  2. Select Prometheus

  • in the connection tab use http://localhost:9099 http://<ip>:9099/
  1. Configure the data source

  2. To add the Prometheus data source, complete the following steps:

  • Click Connections in the left-side menu.
  • Under Connections, click Add new connection.
  • Enter Prometheus in the search bar.
  • Select Prometheus.
  • Click Add new data source in the upper right.

You will be taken to the Settings tab where you will set up your Prometheus configuration.

  1. In the connection tab use http://localhost:9099

  2. Click on save and test

  3. Click on the menu and locate explorer

  4. Select Prometheus as the data source

  5. Run query against metrics

  6. Click on the menu to go down to the dashboard

  7. Click on create dashboard

  8. Click on add visualization

  9. Select Prometheus as data source

  10. Select code at the right side of the and paste this query for CPU metrics rate(node_cpu_seconds_total{mode="user"}[1m])

For prebuilt dashboard

  1. Click on import from dashboard

  2. Import dashboard from Grafana https://grafana.com/grafana/dashboards/1860

  3. Click on load

  4. Add datasource of Prometheus

  5. Click on import

  6. Enter the Grafana url or id and click on load

  7. Select Prometheus as data source and click import

  8. Click on alerting on the menu dropdown

  9. Click on contact point to set for slack notification

  10. Set a name

  11. Select slack for integration

  12. Use either api secrete or

  13. Click on save


Configure Alert

  1. Navigate to dashboard panel

  2. Locate dashboard where panel will be added

  3. Click on edit a panel

  4. Click on panel that display CPU metrics

  5. Click edit pencil icon

  6. Locate alert tab

  7. Click on create a new alert

  8. Define alert conditions

  9. Input query

  10. Click on save rule and exit

  11. Click on add new alert rule

  12. Enter the name in the alert rule name

  13. Save the rules.

Alert Rule in a .yml file

apiVersion: 1
groups:
    - orgId: 1
      name: Server Metrics
      folder: Server Metrics
      interval: 1m
      rules:
        - uid: bdt7z9wlz5xj4c
          title: CPU Usage
          condition: H
          data:
            - refId: A
              relativeTimeRange:
                from: 86400
                to: 0
              datasourceUid: edt4gy6d7zkzka
              model:
                datasource:
                    type: prometheus
                    uid: edt4gy6d7zkzka
                editorMode: code
                exemplar: false
                expr: 100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) by (instance))
                format: time_series
                instant: true
                interval: ""
                intervalFactor: 1
                intervalMs: 15000
                legendFormat: Busy System
                maxDataPoints: 43200
                range: false
                refId: A
                step: 240
            - refId: H
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 80
                        type: gt
                      operator:
                        type: and
                      query:
                        params:
                            - H
                      reducer:
                        params: []
                        type: last
                      type: query
                datasource:
                    type: __expr__
                    uid: __expr__
                expression: A
                intervalMs: 1000
                maxDataPoints: 43200
                refId: H
                type: threshold
          dashboardUid: rYdddlPWk
          panelId: 77
          noDataState: OK
          execErrState: Error
          for: 1m
          annotations:
            __dashboardUid__: rYdddlPWk
            __panelId__: "77"
            resolved: "- Description: The current CPU usage dropped below 80%. \n- Value:  {{ humanize (index $values \"A\").Value }}%"
            summary: "- Description: The current CPU usage exceeded 80%. \n- Value:  {{ humanize (index $values \"A\").Value }}%"
          labels:
            type: CPU Utilization Alert
          isPaused: false
          notification_settings:
            receiver: HNG Slack notification
        - uid: fdt9ih6gmttdsf
          title: Disk Usage
          condition: C
          data:
            - refId: D
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: edt4gy6d7zkzka
              model:
                datasource:
                    type: prometheus
                    uid: edt4gy6d7zkzka
                editorMode: code
                expr: max(100 - ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes)) by (instance)
                instant: true
                intervalMs: 1000
                legendFormat: __auto
                maxDataPoints: 43200
                range: false
                refId: D
            - refId: C
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 80
                        type: gt
                      operator:
                        type: and
                      query:
                        params:
                            - C
                      reducer:
                        params: []
                        type: last
                      type: query
                datasource:
                    type: __expr__
                    uid: __expr__
                expression: D
                intervalMs: 1000
                maxDataPoints: 43200
                refId: C
                type: threshold
          dashboardUid: rYdddlPWk
          panelId: 152
          noDataState: OK
          execErrState: Error
          for: 1m
          annotations:
            __dashboardUid__: rYdddlPWk
            __panelId__: "152"
            resolved: "- Description: The current disk usage has dropped below 80%. \n- Value:  {{ humanize (index $values \"D\").Value }}% In use"
            summary: "- Description: The current disk usage is now above 80%. \n- Value:  {{ humanize (index $values \"D\").Value }}% In use"
          labels:
            type: High Memory Usage
          isPaused: false
          notification_settings:
            receiver: HNG Slack notification
        - uid: ddt802h77g4xsa
          title: Memory Usage
          condition: G
          data:
            - refId: A
              relativeTimeRange:
                from: 86400
                to: 0
              datasourceUid: edt4gy6d7zkzka
              model:
                datasource:
                    type: prometheus
                    uid: edt4gy6d7zkzka
                editorMode: code
                expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
                format: time_series
                instant: true
                interval: ""
                intervalFactor: 1
                intervalMs: 15000
                legendFormat: RAM Total
                maxDataPoints: 43200
                range: false
                refId: A
                step: 240
            - refId: B
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: edt4gy6d7zkzka
              model:
                datasource:
                    type: prometheus
                    uid: edt4gy6d7zkzka
                editorMode: code
                expr: node_memory_MemAvailable_bytes / 1024 / 1024 / 1024
                instant: true
                intervalMs: 1000
                legendFormat: __auto
                maxDataPoints: 43200
                range: false
                refId: B
            - refId: C
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: edt4gy6d7zkzka
              model:
                datasource:
                    type: prometheus
                    uid: edt4gy6d7zkzka
                editorMode: code
                expr: node_memory_MemTotal_bytes / 1024 / 1024 / 1024
                instant: true
                intervalMs: 1000
                legendFormat: __auto
                maxDataPoints: 43200
                range: false
                refId: C
            - refId: G
              relativeTimeRange:
                from: 86400
                to: 0
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 10
                        type: lt
                      operator:
                        type: and
                      query:
                        params:
                            - G
                      reducer:
                        params: []
                        type: last
                      type: query
                datasource:
                    type: __expr__
                    uid: __expr__
                expression: A
                intervalMs: 1000
                maxDataPoints: 43200
                refId: G
                type: threshold
          dashboardUid: rYdddlPWk
          panelId: 78
          noDataState: OK
          execErrState: Error
          for: 1m
          annotations:
            __dashboardUid__: rYdddlPWk
            __panelId__: "78"
            resolved: |-
                - Description: The available memory on the system has increased to more than 10% of the total memory.
                - Value: {{ humanize (index $values "A").Value }}% ({{ humanize (index $values "B").Value }} GB remaining out of {{ humanize (index $values "C").Value }} GB total)
            summary: |-
                - Description: The available memory on the system has dropped below 10% of the total memory.
                - Value: {{ humanize (index $values "A").Value }}% ({{ humanize (index $values "B").Value }} GB remaining out of {{ humanize (index $values "C").Value }} GB total)
          labels:
            type: Available RAM alert
          isPaused: false
          notification_settings:
            receiver: HNG Slack notification
    - orgId: 1
      name: Application
      folder: Application Error
      interval: 1m
      rules:
        - uid: ddt8rvismshz4b
          title: Application Error
          condition: C
          data:
            - refId: A
              relativeTimeRange:
                from: 300
                to: 0
              datasourceUid: edt4gy6d7zkzka
              model:
                datasource:
                    type: prometheus
                    uid: edt4gy6d7zkzka
                disableTextWrap: false
                editorMode: code
                exemplar: false
                expr: rate(app_errors_total{}[5m])
                fullMetaSearch: false
                includeNullMetadata: true
                instant: false
                interval: ""
                intervalMs: 15000
                legendFormat: __auto
                maxDataPoints: 43200
                range: true
                refId: A
                useBackend: false
            - refId: B
              relativeTimeRange:
                from: 300
                to: 0
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 0
                            - 0
                        type: gt
                      operator:
                        type: and
                      query:
                        params: []
                      reducer:
                        params: []
                        type: avg
                      type: query
                datasource:
                    name: Expression
                    type: __expr__
                    uid: __expr__
                expression: A
                intervalMs: 1000
                maxDataPoints: 43200
                reducer: sum
                refId: B
                type: reduce
            - refId: C
              relativeTimeRange:
                from: 300
                to: 0
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 0.0167
                            - 0
                        type: gt
                      operator:
                        type: and
                      query:
                        params: []
                      reducer:
                        params: []
                        type: avg
                      type: query
                datasource:
                    name: Expression
                    type: __expr__
                    uid: __expr__
                expression: B
                intervalMs: 1000
                maxDataPoints: 43200
                refId: C
                type: threshold
          dashboardUid: rYdddlPWk
          panelId: 325
          noDataState: OK
          execErrState: Error
          for: 1m
          annotations:
            __dashboardUid__: rYdddlPWk
            __panelId__: "325"
            resolved: "- Description: No error has been detected here in the past 5 minutes.\n- Details: \n{{ $ran := false }}\n{{ range $k, $v := $values -}}\n  {{ if not $ran -}}\n     {{ $v.Labels }}\n    {{ $ran = true }}\n  {{ end -}}\n{{ end -}}"
            summary: "- Description: The error rate for this application has surpassed the threshold of 10 errors within a 5-minute period for this specific error.\n- Details: \n{{ $ran := false }}\n{{ range $k, $v := $values -}}\n  {{ if not $ran -}}\n     {{ $v.Labels }}\n    {{ $ran = true }}\n  {{ end -}}\n{{ end -}}"
          labels:
            type: Application Error
          isPaused: false
          notification_settings:
            receiver: HNG Slack notification
    - orgId: 1
      name: No Data
      folder: Application Error
      interval: 5m
      rules:
        - uid: edtd3j7f6iry8a
          title: Application Down
          condition: B
          data:
            - refId: A
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: edt4gy6d7zkzka
              model:
                datasource:
                    type: prometheus
                    uid: edt4gy6d7zkzka
                editorMode: code
                exemplar: false
                expr: up
                instant: true
                intervalMs: 1000
                legendFormat: __auto
                maxDataPoints: 43200
                range: false
                refId: A
            - refId: B
              relativeTimeRange:
                from: 600
                to: 0
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 1
                            - 0
                        type: gt
                      operator:
                        type: and
                      query:
                        params: []
                      reducer:
                        params: []
                        type: avg
                      type: query
                datasource:
                    name: Expression
                    type: __expr__
                    uid: __expr__
                expression: A
                hide: false
                intervalMs: 1000
                maxDataPoints: 43200
                refId: B
                type: threshold
          dashboardUid: rYdddlPWk
          panelId: 325
          noDataState: OK
          execErrState: Alerting
          for: 5m
          annotations:
            __dashboardUid__: rYdddlPWk
            __panelId__: "325"
            resolved: |-
                - Description:  The below address is back up
                - Details:
                {{ $ran := false }}
                {{ range $k, $v := $values -}}
                  {{ if not $ran -}}
                    {{ $k }}: {{ $v.Labels }}
                    {{ $ran = true }}
                  {{ end -}}
                {{ end -}}
            summary: |-
                - Description:  The below address is down
                - Details:
                {{ $ran := false }}
                {{ range $k, $v := $values -}}
                  {{ if not $ran -}}
                    {{ $k }}: {{ $v.Labels }}
                    {{ $ran = true }}
                  {{ end -}}
                {{ end -}}
          labels:
            type: Application is Down
          isPaused: false
          notification_settings:
            receiver: HNG Slack notification


Data Retention

By default, Prometheus retains data for 15 days. You can customize the data retention period based on requirements.


Troubleshooting Tips

1. Check Service Status

  • Ensure Prometheus and related services are running:
sudo systemctl status prometheus
sudo systemctl status grafana

2. Verify Configuration Files

  • Check the Prometheus configuration file (/etc/prometheus/prometheus.yml) for correctness.
  • Inspect logs for errors or warnings
sudo journalctl -u prometheus
sudo journalctl -u grafana

3. Verify Network Connectivity

  • Ensure Prometheus can reach its scrape targets and that the targets are returning valid metrics.

4. Monitor Resource Usage

  • Ensure Prometheus has adequate CPU and memory resources to operate effectively.

5. Check Alertmanager Configuration

Verify that alerts are correctly configured in Alertmanager if they are not firing as expected.


Best Practices

1. Regular Backups

  • Regularly back up Prometheus configuration files and data to prevent data loss.

2. Version Control

  • Store your configuration files in a version control system like Git.

3. Automate Configuration Management

  • Use tools like Ansible, Puppet, or Chef to manage your Prometheus setup.

4. Monitor Prometheus

  • Set up alerts to monitor the health and performance of Prometheus itself.

5. Keep Software Updated

  • Regularly update Prometheus and related tools to the latest stable versions.

6. Documentation

  • Document your monitoring setup, including configuration files, architecture, and troubleshooting steps.

7. Security

  • Limit access to the Prometheus web UI and metrics endpoints.

8. Redundancy

  • Set up redundant Prometheus instances to avoid single points of failure and use load balancers to distribute the load.

9. Alert Management

  • Regularly test alerting rules and tune thresholds to avoid alert fatigue.