Skip to content

[Web service] prometheus metrics exporter

HouzuoGuo edited this page Feb 24, 2021 · 8 revisions

Introduction

Hosted by laitos web server, the endpoint serves metrics information collected from the following sources in the prometheus-exporter format:

  • All web service handlers: time to first byte, processing duration, size of response.
  • Program resource usage: CPU time consumed, number of context switches, time spent on run queue and wait queue.
  • All web proxy requests: time to first byte, connection duration, size of response.

Configuration

Under the JSON key HTTPHandlers, add a string property called PrometheusMetricsEndpoint, value being the URL location of the service.

Keep the location a secret to yourself and make it difficult to guess. Here is an example:

{
    ...

    "HTTPHandlers": {
        ...

        "PrometheusMetricsEndpoint": "/my-precious-metrics",

        ...
    },

    ...
}

Run

Modify the laitos program launch command by adding the parameter -prominteg to it. The parameter works as the master switch to turn on all points of integration with prometheus:

sudo ./laitos -prominteg -config <CONFIG FILE> -daemons ...,httpd,...

The service is hosted by web server, therefore remember to run web server.

Usage

Install prometheus and edit its configuration file (often located at /etc/prometheus/prometheus.yml), tell prometheus to periodically download the exporter data from this endpoint:

...
scrape_configs:
    - job_name: 'laitos'
      scrape_interval: 20s
      scrape_timeout: 5s
      scheme: https # or https
      metrics_path: '/my-precious-metrics'
      static_configs:
          - targets: ['laitos-server.example.com:443', 'another-laitos-server.example.com:80']

Tips

Visit prometheus web UI (or Grafana dashboard if they are integrated), and try out the following equations for plotting program resource usage:

  • Percentage of involuntary context switches, 3-minutes running average: (sum(rate(laitos_proc_num_involuntary_switches[3m])) by (instance) / (sum(rate(laitos_proc_num_involuntary_switches[3m])) by (instance) + sum(rate(laitos_proc_num_voluntary_switches[3m])) by (instance))) * 100
  • Seconds of CPU time spent by laitos server (including children) in user and kernel mode, 3-minutes running average: sum(rate(laitos_proc_num_kernel_mode_sec_incl_children[3m]) + rate(laitos_proc_num_user_mode_sec_incl_children[3m])) by (instance)
  • Percentage of time spent as runnable according to OS scheduler (higher is better), 3-minutes running average: (sum(rate(laitos_proc_num_run_sec[3m])) by (instance) / (sum(rate(laitos_proc_num_run_sec[3m])) by (instance) + sum(rate(laitos_proc_num_wait_sec[3m])) by (instance))) * 100

And try out these for plotting web server stats:

  • Time-to-first-byte across all handlers at 95% quantile, 3-minutes running average: histogram_quantile(0.95, sum(rate(laitos_httpd_response_time_to_first_byte_seconds_bucket[3m])) by (le, instance))
  • Processing duration (including IO) across all handlers at 95% quantile, 3-minutes running average: histogram_quantile(0.95, sum(rate(laitos_httpd_handler_duration_seconds_bucket[3m])) by (le, instance))
  • Size of HTTP response across all handlers at 95% quantile, 3-minutes running average: histogram_quantile(0.95, sum(rate(laitos_httpd_response_size_bytes_bucket[3m])) by (le, instance))

And try out these for plotting web proxy stats:

  • Number of proxy requests per minute, 1-minute running average: sum(rate(laitos_httpproxy_response_size_bytes_count[1m])) by (instance)
  • Bytes transferred to proxy clients per minute, 1-minute running average: sum(rate(laitos_httpproxy_response_size_bytes_sum[1m])) by (instance)
  • Top 10 proxy destinations by data transfer (total MBs over 3hrs): topk(10, sum by (host) (rate(laitos_httpproxy_response_size_bytes_sum[180m]))) * 180 * 60 / 1048576
  • Top 10 proxy destinations by num of connections (total over 3 hours): topk(10, sum by (host) (rate(laitos_httpproxy_response_size_bytes_count[180m]))) * 180 * 60
  • Top 10 proxy destinations by connection duration (total seconds over 3 hours): topk(10, sum by (host) (rate(laitos_httpproxy_handler_duration_seconds_sum[180m]))) * 180 * 60
  • Size of proxy response across all destinations at 90% quantile, 3-minutes running average: histogram_quantile(0.90, sum(rate(laitos_httpproxy_response_size_bytes_bucket[3m])) by (le, instance))
  • Time-to-first-byte across all proxy destinations at 50% quantile, 3-minutes running average: histogram_quantile(0.50, sum(rate(laitos_httpproxy_response_time_to_first_byte_seconds_bucket[3m])) by (le, instance))
  • Processing duration (including IO) across all proxy destinations at 50% quantile, 3-minutes running average: histogram_quantile(0.50, sum(rate(laitos_httpproxy_handler_duration_seconds_bucket[3m])) by (le, instance))
Clone this wiki locally