Usecase: If you have a service on many servers and want to expose throughput-like metrics with thousands of events per second to prometheus, and you can access the servers through SSH, this is for you. Just add a CSV-like log writer to your software, attach this exporter to the servers, and attach Prometheus to this service. Instead of transferring megabytes of Prometheus protocol data over the network, this service extracts and accummulates metrics incrementally from the attached services, reducing traffic to a fraction. The time-resolution precision of the metrics is only based on the sampling frequency.
It is capable of processing at least 100 servers with thousands of requests per second on a small VM (peak usage 16 cores, average usage 4 cores, average RAM usage below 300MB, average incoming SSH traffic below 600kB/s, not including resource requirements for Prometheus itself).
"CSV-like" in this case means "space-separated, double-quote delimited format"; this piece of software was primarily developed for parsing access logs, but if needed, it can be extended to parse any CSV-based format that FastCsvParser can handle.
Metrics are exposed at host:5000/metrics
.
The configuration format is defined as follows. Identifiers prefixed with a $
are names that can be
chosen freely. Stuff enclosed within [...]
is optional, the text enclosed within <...>
describes
the expected types. Please note that the tilde (~
) is equivalent to null
.
global:
# The metrics' time-to-live.
ttl: <seconds = 60>
[background-resilience: <integer = 1>] # how many "ttl" time spans to keep the metrics in background after they
# exceed their ttl
[long-term-resilience: <integer = 10>] # how many "ttl" time spans to keep long-term metrics in background after they
# exceed their ttl
# If prefix is set, all metrics (including process metrics)
# will be exposed as "prefix:metric-name".
[prefix: <string>]
[histograms: <buckets-list>]
format: <metrics-list>
[script: <script-name>]
# If a script is given, but no reload-interval, it is executed only once at startup.
[reload-interval: <seconds>]
ssh:
[<connection-settings>]
environments: <list-of-environments>
<buckets-list> :=
# Numbers do not need to be ordered, an implicit "+Inf" will be added.
# If the values are not set, the default will be used, which is
# [.005, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, 7.5, 10]
$bucket_name_1: <~ | list-of-numbers>
$bucket_name_2: <~ | list-of-numbers>
...
<metrics-list> :=
$metric_name_1: <type>
$metric_name_2: <type>
<type> :=
# Example: clf_number + request_bytes_sent
(number | clf_number | label | request_header) [+ $bucket_name]
<list-of-environments> :=
$environment_name_1:
hosts: <hostnames-or-ip-list>
[connection: <connection-settings>]
$environment_name_2:
hosts: <hostnames-or-ip-list>
[connection: <connection-settings>]
...
# Note that a few restrictions exist for the connection settings.
# 1. "file" and "user" are defined as required, but this means only
# that they must be either set at the global "ssh" level, or at
# the environment level.
# 2. At least one of "password" or "pkey" must be set.
# 3. If one of the settings is not set explicitly on environment level,
# the value is inherited from the global "ssh" level.
<connection-settings> :=
file: <string>
user: <string>
[password: <string>]
[pkey: <string>]
[connect-timeout: <seconds = 30>]
[read-timeout-ms: <ms = 60000]
The supported type
values are:
number
for floating point values, exposed as acounter
, or as ahistogram
.clf_number
, which behaves the same asnumber
, except that a single-
will be mapped to zero.label
will use the column's value as a label.request_header
expects a HTTP request header; a value ofPOST /foo HTTP/1.1
will emit the labelsrequest_method="POST"
,request_uri="/foo"
andrequest_http_version="HTTP/1.1"
.
Each metric (except process metrics) will contain the label environment
.
Scalar metrics will be accumulated; successfully processed lines will be counted in lines_parsed
, with its labels
set to the CSV line's labels. If something bad happens, the erroneous lines will be counted in parser_errors
,
but as the entry could not reliably parsed at this point, it will only contain the environment
label.
For example, to parse access logs, you could use this.
format: # based on "%h %l %u %t \"%r\" %>s %b"
- remote_host: label
- ~ # ignore remote logname
- remote_user: label
- ~ # ignore timestamp
- request_header: request_header # special parser that emits the labels "request_http_version", "request_uri" and "request_method"
- status: label
- body_bytes_sent: clf_number # maps a single dash to zero, otherwise behaves like "number"
Place your scrapeconfig.yml
either in /etc
, or
provide the environment variable SCRAPECONFIG
with a config file path;
see here for a config file example, showing all of its features.
A docker image, containing python3
and curl
, is available
here.
Metrics track when they were last updated. If a metric doesn't change within the TTL specified in the
config file (which defaults to 60 seconds), it will not be exposed via /metrics
anymore; this is the
first phase of garbage collection to avoid excessive traffic. If a metric is in the first phase of garbage
collection, and doesn't receive an update for another background-resilience
periods of the specified TTL,
it will be fully evicted from the processing.
A few metrics (namely, the parser_errors
and lines_parsed
metrics) are in "long-term mode"; they will be evicted
after long-term-resilience
periods of the TTL; the connected
metric will never be evicted.
Practice has shown that this multi-phase metric garbage collection is strictly necessary to avoid excessive response sizes and to avoid the process to be clogged up by processing dead metrics. It doesn't have any known serious impacts on the metrics' values, though.
Performance matters, and the exported metrics are not usable immediately in most cases. The following Prometheus rules have been tested in high-traffic situations, and sped up Prometheus queries immensely.
Adjust as necessary, these are only examples.
groups:
- name: access_logs_precompute
interval: 5s
rules:
- record: "prefix:lines_parsed_per_second"
expr: "irate(prefix:lines_parsed_total[1m])"
- record: "prefix:body_bytes_sent_per_second"
expr: "irate(prefix:body_bytes_sent_total[1m])"
- record: "prefix:request_length_per_second"
expr: "irate(prefix:request_length_total[1m])"
- record: "prefix:request_time_per_second"
expr: "irate(prefix:request_time_total[1m])"