- Prometheus
- Part 01: The Basics
- Part 02: Installation and Configuration
- Part 03: Prometheus Data (Data Model for Storing Data + Query Language to Interact with Prometheus Data)
- Part 04: Visualization
- Part 05: Collecting Metrics
- Part 06: Alerting
- Part 07: Advanced Concepts (HA and Federation)
- Part 08: Security
- Part 09: Client Libraries
- Part 10: References
Prometheus is an open-source monitoring and alerting tool. It collects data about applications and systems and allows you to visualize the data and issue alerts based on the data.
Monitoring is an important component of DevOps automation. To manage a robust and complex infrastructure, you need to be able to quickly and easily understand what is happening in your systems.
High Level Use Cases:
- Metric Collection: Collect important metrics about your systems and applications in one place.
- Visualization: Build dashboards that provide an overview of the health of your systems.
- Alerting: Send you an email or mobile notification when something is broken.
Prometheus Architecture Diagram:
The components of a Prometheus system are:
- Prometheus Server: A central server that gathers metrics and makes them available.
- Exporters: Agents that expose data about systems and applications for collecting by the Prometheus server.
- Client Libraries: Easily turn your custom application into an exporter that exposes metrics in a format Prometheus can consume.
- Prometheus Pushgateway: Allows pushing metrics to Prometheus for certain specific use cases. Pushgateway acts as middle-man.
- Alert Manager: Sends alerts triggered by metric data.
- Visualization Tools: Provide useful ways to view metric data. These are not necessarily part of Prometheus.
Prometheus collects metrics using a pull model. This means the Prometheus server pulls metric data from exporters - agents do not push data to the Prometheus server.
While Prometheus is a great tool for a variety of use cases, it is important to understand when it is not the best tool:
- 100% Accuracy (e.g., per-request billing): Prometheus is designed to operate even under failure conditions. This means it will continue to provide data even if new data is not available due to failure and outages. If you need 100% up-to-the-minute accuracy, such as in the case of per request billing, Prometheus may not be the best tool to use.
- Non Time-Series Data (e.g., log aggregation): Prometheus is built to monitor time-series metrics, especially data that is numeric. It is not the best choice for collecting more generic types of data such as system logs.
- Use pre-compiled binaries from here.
#!/bin/bash sudo useradd -M -r -s /bin/false prometheus sudo mkdir /etc/prometheus /var/lib/prometheus # Now download the latest version tar xzf PROMETHEUS.tar.gz sudo cp PROMETHEUS/{prometheus,promtool} /usr/local/bin/ sudo chown prometheus:prometheus /usr/local/bin/{prometheus,promtool} sudo cp -r PROMETHEUS/{consoles,console_libraries} /etc/prometheus/ sudo cp PROMETHEUS/prometheus.yml /etc/prometheus/prometheus.yml sudo chown -R prometheus:prometheus /etc/prometheus sudo chown prometheus:prometheus /var/lib/prometheus # Test prometheus: prometheus --config.file=/etc/prometheus/prometheus.yml # We need to turn prometheus to systemd service sudo cat >> /etc/systemd/system/prometheus.service << EOF [Unit] Description=Prometheus Time Series Collection and Processing Server Wants=network-online.target After=network-online.target [Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/bin/prometheus \ --config.file /etc/prometheus/prometheus.yml \ --storage.tsdb.path /var/lib/prometheus/ \ --web.console.templates=/etc/prometheus/consoles \ --web.console.libraries=/etc/prometheus/console_libraries ExecReload=/bin/kill -HUP $MAINPID [Install] WantedBy=multi-user.target EOF sudo systemctl daemon-reload sudo systemctl start prometheus sudo systemctl enable prometheus # Test prometheus service: sudo systemctl enable prometheus # The default port of prometheus is 9090
- Build from source.
- Run with Docker using pre-built images.
- Prometheus configuration file format is YAML.
- Reloading the Configuration:
- Restarting Prometheus:
systemctl restart prometheus
- Reloading Prometheus:
systemctl reload prometheus
- Send SIGHUP to Prometheus Process:
killall -HUP prometheus
- Restarting Prometheus:
- Read configuration file from URL:
curl localhost:9090/api/v1/status/config
Exporters:
- A Prometheus exporter is any application that exposes metric data in a format that can be controlled (or "scraped") by the Prometheus server.
- Configure Node Exporter:
# Create a user on the server you want to monitor sudo useradd -M -r -s /bin/false node_exporter # Download Node Exporter Binary from Prometheus website and extract it wget "LINK" tar xvzf node_exporter.tar.gz # Copy the binary file to suitable location for the system sudo cp node_exporter /usr/local/bin/ # Change ownership of the binary to node_exporter user sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter # Create a system service for the node_exporter sudo cat >> /etc/systemd/system/node_exporter.service << EOF [Unit] Description= Prometheus Node Exporter Wants=network-online.target After=network-online.target [Service] User=node_exporter Group=node_exporter Type=simple ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target EOF sudo systemctl daemon-reload sudo systemctl start node_exporter sudo systemctl enable node_exporter # Check connection of the Node Exporter curl localhost:9100/metrics
- Create a service for init based servers (Centos 6)
# Command Usage: sudo service node_exporter status|stop|start sudo cat >> /etc/init.d/node_exporter << EOF #!/bin/bash OPTIONS=`cat /etc/sysconfig/node_exporter` RETVAL=0 PROG="node_exporter" EXEC="/usr/local/bin/node_exporter" LOCKFILE="/var/lock/subsys/$PROG" LOGFILE=/var/log/node_exporter.log ErrLOGFILE=/var/log/node_exporter_error.log # Source function library. if [ -f /etc/rc.d/init.d/functions ]; then . /etc/rc.d/init.d/functions else echo "/etc/rc.d/init.d/functions is not exists" exit 0 fi start() { if [ -f $LOCKFILE ] then echo "$PROG is already running!" else echo -n "Starting $PROG: " nohup $EXEC $OPTIONS > $LOGFILE 2> $ErrLOGFILE & RETVAL=$? [ $RETVAL -eq 0 ] && touch $LOCKFILE && success || failure echo return $RETVAL fi } stop() { echo -n "Stopping $PROG: " killproc $EXEC RETVAL=$? [ $RETVAL -eq 0 ] && rm -r $LOCKFILE && success || failure echo } restart () { stop sleep 1 start } case "$1" in start) start ;; stop) stop ;; status) status $PROG ;; restart) restart ;; *) echo "Usage: $0 {start|stop|restart|status}" exit 1 esac exit $RETVAL
Scrape Config:
- The scrape_config section of the Prometheus config file provides a list of targets the Prometheus server will scrape, such as a Node Exporter running on a Linux machine.
- Prometheus server will scrape these targets periodically to collect metric data.
- We need to add the Node Exporter specification to Prometheus configuration file:
# Start of /etc/prometheus/prometheus.yml config file . . . scrape_configs: # Prometheus server - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] # Node Exporter Servers - job_name: 'NAME_OF_SERVER' static_configs: - targets: ['PRIVATE_IP:9100'] . . . # End of /etc/prometheus/prometheus.yml config file
- Don't forget to restart/reload the prometheus service.
Part 03: Prometheus Data (Data Model for Storing Data + Query Language to Interact with Prometheus Data)
- Prometheus is built around storing time-series data.
- Time-series data consists of a series of values associated with different points in time.
- Every metric in Prometheus tracks a particular value over time.
Every metric in Prometheus has a metric name. THe metric name refers to the general feature of a system or application that is being measured.
An example of a metric name: node_cpu_seconds_total
Note that the metric name merely refers to the feature being measured. Metric names do not point to a specific data value but potentially a collection of many values.
Prometheus uses labels to provide a dimensional data model. This means we can use labels to specify additional things, such as which node's CPU usage is being represented.
A unique combination of a metric name and a set of labels identifies a particular set of time-series data. This example uses a label (It can have more labels) called CPU to refer to usage of a specific CPU: node_cpu_seconds_total{cpu="0"}
Metric types refer to different ways in which exporters represent the metric data they provide.
Metric types are not represented in any special way in a Prometheus server, but it is important to understand them in order to properly interpret your metrics.
- Counter : A counter is a single number that can only increase or be reset to zero. Counters represent cumulative values. Examples: Number of application restarts, Number of HTTP requests served by an application, etc. Example in querying:
node_cpu_seconds_total[5m]
- Gauge : A gauge is a single number that can increase and decrease over time. Examples: CPU/Memory usage, current active threads, number of concurrent HTTP requests, etc. Example in querying:
node_memory_MemAvailable_bytes
- Histogram : A histogram counts the number of observation/events that fall into a set of configurable buckets, each with its own separate time series. A histogram will use labels to differentiate between buckets. The below example provides the number of HTTP requests whose duration falls into each bucket. Histogram also include separate metric names to expose the _sum of all observed values and the total _count of events.
prometheus_http_request_duration_seconds_bucket{le="0.3"}
prometheus_http_request_duration_seconds_bucket{le="1.0"}
prometheus_http_request_duration_seconds_sum
prometheus_http_request_duration_seconds_count
- Summary : A summary is similar to a histogram, but it exposes metrics in the form of quantiles instead of buckets. While buckets divide values based on specific boundaries, quantiles divide values based on the percentiles into which they fall. Like histograms, summaries also expose the _sum and _count metrics. This value represents the number of HTTP requests whose duration falls within the 95th percentile of all requests or the top 5% longest requests:
prometheus_http_request_duration_seconds{quantile="0.95"}
orgo_gc_duration_seconds
Querying allows you to access and work with your metric data in Prometheus.
You can use PromQL (Prometheus Query Language) to write queries and retrieve useful information from the metric data collected by Prometheus.
You can use Prometheus queries in a variety of ways to obtain and work with data:
- Expression browser
- Prometheus HTTP API
- Visualization tools such as Grafana