Prometheus

Prometheus

Part 01: The Basics

Prometheus is an open-source monitoring and alerting tool. It collects data about applications and systems and allows you to visualize the data and issue alerts based on the data.

Monitoring is an important component of DevOps automation. To manage a robust and complex infrastructure, you need to be able to quickly and easily understand what is happening in your systems.

High Level Use Cases:

Metric Collection: Collect important metrics about your systems and applications in one place.
Visualization: Build dashboards that provide an overview of the health of your systems.
Alerting: Send you an email or mobile notification when something is broken.

Prometheus Architecture

Prometheus Architecture Diagram:

The components of a Prometheus system are:

Prometheus Server: A central server that gathers metrics and makes them available.
Exporters: Agents that expose data about systems and applications for collecting by the Prometheus server.
Client Libraries: Easily turn your custom application into an exporter that exposes metrics in a format Prometheus can consume.
Prometheus Pushgateway: Allows pushing metrics to Prometheus for certain specific use cases. Pushgateway acts as middle-man.
Alert Manager: Sends alerts triggered by metric data.
Visualization Tools: Provide useful ways to view metric data. These are not necessarily part of Prometheus.

Prometheus collects metrics using a pull model. This means the Prometheus server pulls metric data from exporters - agents do not push data to the Prometheus server.

Prometheus Limitation

While Prometheus is a great tool for a variety of use cases, it is important to understand when it is not the best tool:

100% Accuracy (e.g., per-request billing): Prometheus is designed to operate even under failure conditions. This means it will continue to provide data even if new data is not available due to failure and outages. If you need 100% up-to-the-minute accuracy, such as in the case of per request billing, Prometheus may not be the best tool to use.
Non Time-Series Data (e.g., log aggregation): Prometheus is built to monitor time-series metrics, especially data that is numeric. It is not the best choice for collecting more generic types of data such as system logs.

Part 02: Installation and Configuration

Installation Options:

Use pre-compiled binaries from here.

#!/bin/bash

sudo useradd -M -r -s /bin/false prometheus
sudo mkdir /etc/prometheus /var/lib/prometheus
# Now download the latest version
tar xzf PROMETHEUS.tar.gz
sudo cp PROMETHEUS/{prometheus,promtool} /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/{prometheus,promtool}
sudo cp -r PROMETHEUS/{consoles,console_libraries} /etc/prometheus/
sudo cp PROMETHEUS/prometheus.yml /etc/prometheus/prometheus.yml
sudo chown -R prometheus:prometheus /etc/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
# Test prometheus: prometheus --config.file=/etc/prometheus/prometheus.yml

# We need to turn prometheus to systemd service
sudo cat >> /etc/systemd/system/prometheus.service << EOF
[Unit]
Description=Prometheus Time Series Collection and Processing Server
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
    --config.file /etc/prometheus/prometheus.yml \
    --storage.tsdb.path /var/lib/prometheus/ \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl start prometheus
sudo systemctl enable prometheus
# Test prometheus service: sudo systemctl enable prometheus
# The default port of prometheus is 9090

Build from source.
Run with Docker using pre-built images.

Configuration

Prometheus configuration file format is YAML.
Reloading the Configuration:
1. Restarting Prometheus: systemctl restart prometheus
2. Reloading Prometheus: systemctl reload prometheus
3. Send SIGHUP to Prometheus Process: killall -HUP prometheus
Read configuration file from URL: curl localhost:9090/api/v1/status/config

Exporters:

A Prometheus exporter is any application that exposes metric data in a format that can be controlled (or "scraped") by the Prometheus server.

Configure Node Exporter:

# Create a user on the server you want to monitor
sudo useradd -M -r -s /bin/false node_exporter

# Download Node Exporter Binary from Prometheus website and extract it
wget "LINK"
tar xvzf node_exporter.tar.gz

# Copy the binary file to suitable location for the system
sudo cp node_exporter /usr/local/bin/

# Change ownership of the binary to node_exporter user
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

# Create a system service for the node_exporter
sudo cat >> /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description= Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

# Check connection of the Node Exporter
curl localhost:9100/metrics

Create a service for init based servers (Centos 6)

# Command Usage: sudo service node_exporter status|stop|start
sudo cat >> /etc/init.d/node_exporter << EOF
#!/bin/bash
OPTIONS=`cat /etc/sysconfig/node_exporter`
RETVAL=0
PROG="node_exporter"
EXEC="/usr/local/bin/node_exporter"
LOCKFILE="/var/lock/subsys/$PROG"
LOGFILE=/var/log/node_exporter.log
ErrLOGFILE=/var/log/node_exporter_error.log

# Source function library.
if [ -f /etc/rc.d/init.d/functions ]; then
  . /etc/rc.d/init.d/functions
else
  echo "/etc/rc.d/init.d/functions is not exists"
  exit 0
fi

start() {
  if [ -f $LOCKFILE ]
  then
    echo "$PROG is already running!"
  else
    echo -n "Starting $PROG: "
    nohup $EXEC $OPTIONS > $LOGFILE 2> $ErrLOGFILE &
    RETVAL=$?
    [ $RETVAL -eq 0 ] && touch $LOCKFILE && success || failure
    echo
    return $RETVAL
  fi
}

stop() {
  echo -n "Stopping $PROG: "
  killproc $EXEC
  RETVAL=$?
  [ $RETVAL -eq 0 ] && rm -r $LOCKFILE && success || failure
  echo
}

restart ()
{
  stop
  sleep 1
  start
}

case "$1" in
  start)
    start
    ;;
  stop)
    stop
    ;;
  status)
    status $PROG
    ;;
  restart)
    restart
    ;;
  *)
    echo "Usage: $0 {start|stop|restart|status}"
    exit 1
esac
exit $RETVAL

Scrape Config:

The scrape_config section of the Prometheus config file provides a list of targets the Prometheus server will scrape, such as a Node Exporter running on a Linux machine.
Prometheus server will scrape these targets periodically to collect metric data.

We need to add the Node Exporter specification to Prometheus configuration file:

# Start of /etc/prometheus/prometheus.yml config file
.
.
.
scrape_configs:
  # Prometheus server
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
  # Node Exporter Servers
  - job_name: 'NAME_OF_SERVER'
    static_configs:
    - targets: ['PRIVATE_IP:9100']
.
.
.
# End of /etc/prometheus/prometheus.yml config file

Don't forget to restart/reload the prometheus service.

Part 03: Prometheus Data (Data Model for Storing Data + Query Language to Interact with Prometheus Data)

Prometheus is built around storing time-series data.
Time-series data consists of a series of values associated with different points in time.
Every metric in Prometheus tracks a particular value over time.

Metric Names

Every metric in Prometheus has a metric name. THe metric name refers to the general feature of a system or application that is being measured.

An example of a metric name: node_cpu_seconds_total

Note that the metric name merely refers to the feature being measured. Metric names do not point to a specific data value but potentially a collection of many values.

Metric Labels

Prometheus uses labels to provide a dimensional data model. This means we can use labels to specify additional things, such as which node's CPU usage is being represented.

A unique combination of a metric name and a set of labels identifies a particular set of time-series data. This example uses a label (It can have more labels) called CPU to refer to usage of a specific CPU: node_cpu_seconds_total{cpu="0"}

Metric Types

Metric types refer to different ways in which exporters represent the metric data they provide.

Metric types are not represented in any special way in a Prometheus server, but it is important to understand them in order to properly interpret your metrics.

Counter : A counter is a single number that can only increase or be reset to zero. Counters represent cumulative values. Examples: Number of application restarts, Number of HTTP requests served by an application, etc. Example in querying: node_cpu_seconds_total[5m]
Gauge : A gauge is a single number that can increase and decrease over time. Examples: CPU/Memory usage, current active threads, number of concurrent HTTP requests, etc. Example in querying: node_memory_MemAvailable_bytes
Histogram : A histogram counts the number of observation/events that fall into a set of configurable buckets, each with its own separate time series. A histogram will use labels to differentiate between buckets. The below example provides the number of HTTP requests whose duration falls into each bucket. Histogram also include separate metric names to expose the _sum of all observed values and the total _count of events.
- prometheus_http_request_duration_seconds_bucket{le="0.3"}
- prometheus_http_request_duration_seconds_bucket{le="1.0"}
- prometheus_http_request_duration_seconds_sum
- prometheus_http_request_duration_seconds_count
Summary : A summary is similar to a histogram, but it exposes metrics in the form of quantiles instead of buckets. While buckets divide values based on specific boundaries, quantiles divide values based on the percentiles into which they fall. Like histograms, summaries also expose the _sum and _count metrics. This value represents the number of HTTP requests whose duration falls within the 95th percentile of all requests or the top 5% longest requests: prometheus_http_request_duration_seconds{quantile="0.95"} or go_gc_duration_seconds

Querying

Querying allows you to access and work with your metric data in Prometheus.

You can use PromQL (Prometheus Query Language) to write queries and retrieve useful information from the metric data collected by Prometheus.

You can use Prometheus queries in a variety of ways to obtain and work with data:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Prometheus

Part 01: The Basics

Prometheus Architecture

Prometheus Limitation

Part 02: Installation and Configuration

Installation Options:

Configuration

Part 03: Prometheus Data (Data Model for Storing Data + Query Language to Interact with Prometheus Data)

Metric Names

Metric Labels

Metric Types

Querying

Selectors

Part 04: Visualization

Part 05: Collecting Metrics

Part 06: Alerting

Part 07: Advanced Concepts (HA and Federation)

Part 08: Security

Part 09: Client Libraries

Part 10: References

Files

README.md

Latest commit

History

README.md

File metadata and controls

Prometheus

Part 01: The Basics

Prometheus Architecture

Prometheus Limitation

Part 02: Installation and Configuration

Installation Options:

Configuration

Part 03: Prometheus Data (Data Model for Storing Data + Query Language to Interact with Prometheus Data)

Metric Names

Metric Labels

Metric Types

Querying

Selectors

Part 04: Visualization

Part 05: Collecting Metrics

Part 06: Alerting

Part 07: Advanced Concepts (HA and Federation)

Part 08: Security

Part 09: Client Libraries

Part 10: References