Skip to content

Commit

Permalink
Merge pull request #5 from vsaveliev/hotfix/alerts_messages_and_dashb…
Browse files Browse the repository at this point in the history
…oards

Hotfix/alerts messages and dashboards
  • Loading branch information
vsaveliev authored Nov 29, 2017
2 parents 5af17eb + aeb3874 commit 7a3b403
Show file tree
Hide file tree
Showing 39 changed files with 1,338 additions and 335 deletions.
5 changes: 2 additions & 3 deletions inventory/group_vars/all.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,6 @@ k8s_prometheus_namespace: prometheus
# Additional Kubernetes namespaces
k8s_namespaces:
- dev
- '{{ k8s_prometheus_namespace }}'

# On-prem LB services
# List of services which use TCP LB for k8s masters/nodes
Expand Down Expand Up @@ -399,7 +398,7 @@ gce_credentials_file: '{{ ansible_env.HOME }}/gcloud.json'
gce_project_id: my-project-id

# Slack webhook URL for Prometheus alerts
prometheus_slack_api_url: ''
k8s_prometheus_slack_api_url: ''

# Slack channel for Prometheus alerts
prometheus_slack_channel: ''
k8s_prometheus_slack_channel: ''
1 change: 1 addition & 0 deletions playbooks/system/firewall.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
- 10250/tcp # kubelet port
- 10255/tcp # kubelet port
- 4194/tcp # cAdvisor port
- 9100/tcp # Prometheus node-exporter port
nodes_ports:
- 10250/tcp # kubelet port
- 10255/tcp # kubelet port
Expand Down
127 changes: 102 additions & 25 deletions roles/prometheus/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
Prometheus role
=========
===============

This role installs Prometheus for Kubernetes cluster (endpoints, pods, nodes, istio, ...)
This role installs Prometheus for Kubernetes cluster (endpoints, pods, nodes, istio, ...) with some basic alerts, dashboards and etc.

[Official documentation](https://prometheus.io/docs/introduction/overview/)

[![Contributions Welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/k8s-community/cluster-deploy/issues)

Expand All @@ -13,36 +15,111 @@ No special requirements.
Role Variables
--------------

Available variables are listed below, along with default values (see `defaults/main.yml`):
You can see all available params in `defaults/main.yml` with default values and descriptions why it needs. By default, all prometheus components will be created in `prometheus` namespace (even if it doesn't exist before). Node exporter will be ran on all nodes (even on master nodes).

Prometheus version:
```yaml
k8s_prometheus_image_tag: v1.5.1
```
Domain name for prometheus (if it's empty so ingress object isn't created):
```yaml
k8s_prometheus_name: ''
```
How to create new alert
-----------------------

Domain name for prometheus alert manager (if it's empty so ingress isn't created):
```yaml
k8s_prometheus_alertmanager_name: ''
```
New alerts can be added in `templates/alerts` directory (check already existing alerts). For example:

Domain name for prometheus push gateway (if it's empty so ingress isn't created):
```yaml
k8s_prometheus_pushgateway_name: ''
```
#
# Alert on deployment has not enough replicas
#
- alert: DeploymentReplicasMismatch
expr: (kube_deployment_spec_replicas != kube_deployment_status_replicas_available)
or (kube_deployment_spec_replicas unless kube_deployment_status_replicas_unavailable)
for: 5m
labels:
notify: sre
severity: warning
annotations:
summary: "{{ $labels.kubernetes_namespace }}/{{ $labels.deployment }}: Deployment is failed"
description: "{{ $labels.kubernetes_namespace }}/{{ $labels.deployment }}: Deployment is failed - observed replicas != intended replicas"

New alerts
--------------
It's better to have short description of alert in top for other people. You can use different labels for alerts, we offer to follow recommendations from `templates/alerts/common.conf`.

New alerts can be added in templates/alerts directory.
If you want to create a new file with alerts in `templates/alerts` then you need to add line with file name also in `templates/server.yaml` after creation. Template:

New scrape configs
--------------
{% raw %}
#
# Some alerts for something
#
- name: some-alerts
rules:

#
# Alert on something
#
- alert: SomethingWrong
....

{% endraw %}

More details about alerts on: [Official documentation about alerts](https://prometheus.io/docs/alerting/rules/)

What configs should app have
----------------------------

You will have to do some changes in manifests / charts if you want to monitor your apps. Changes are described below.

Ingress should have:

annotations:
prometheus.io/probe: 'true'

Black box exporter would check your app via HTTPS check if it needs this check (and SSL certificate expiration).

Service should have:

annotations:
prometheus.io/scrape: 'true'
prometheus.io/probe: 'true'
# by default (pass this values only if it should be different)
prometheus.io/port: '8080'
prometheus.io/path: '/metrics'

Black box exporter would check your app via HTTP check if other apps can communicate with it inside Kubernetes cluster. Also all metrics would be scraped from each pod of your app to create your custom alerts in future. By default, the system monitors only 5XX HTTP codes for apps.

How to create new scrape configs
--------------------------------

New scrape configs can be added in `templates/scrape_configs` directory (check already existing scrapes). If you create a new file then you need to add it also in `templates/server.yaml`. Template:

{% raw %}
# A scrape configuration for something.
#
- job_name: some-thing
...

{% endraw %}

All details about scrape config on: [Official documentation about scrape configs](https://prometheus.io/docs/operating/configuration/#<scrape_config>)

How to add dashboard in Grafana
-----------------------------------

1. create a new dashboard manually via `Dashboards --> New` or find existing on [grafana.com](https://grafana.com/dashboards)
2. export (download) it on your computer
3. copy content of downloaded JSON file
4. create a new file in `templates/grafana-dashboards` directory. Template (`templates/grafana-dashboards/dashboard-template.json`):

{% raw %}
{
"dashboard": {
... <copied content data> ...
},
"overwrite": true,
"inputs": [
{
"name": "DS_PROMETHEUS",
"type": "datasource",
"pluginId": "prometheus",
"value": "prometheus"
}
]
}{% endraw %}

New scrape configs can be added in templates/scrape_configs directory.
5. add line with file name in `templates/grafana.yaml`.

Example Playbook
----------------
Expand Down
65 changes: 51 additions & 14 deletions roles/prometheus/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -1,45 +1,82 @@
---
# Kubernetes configs path
k8s_conf_dir: /etc/kubernetes
k8s_addons_dir: '{{ k8s_conf_dir }}/addons'
k8s_prometheus_dir: '{{ k8s_conf_dir }}/addons/prometheus'

# Master hosts names
k8s_master_hosts: {}

# Prometheus host names (for Prometheus, AlertManager, PushGateway, Grafana)
# Retention days (how long to store data)
k8s_prometheus_retention_days: 7

# Prometheus host name (it will be created if it doesn't exist)
k8s_prometheus_name: ''
# AlertManager host name (it will be created if it doesn't exist)
k8s_prometheus_alertmanager_name: ''
# PushGateway host name (it will be created if it doesn't exist)
k8s_prometheus_pushgateway_name: ''
# Grafana host name (it will be created if it doesn't exist)
k8s_prometheus_grafana_name: ''
# AlertManager host name (it will be created if it doesn't exist)
k8s_prometheus_blackbox_name: ''

# Prometheus namespace
k8s_prometheus_namespace: default
# Prometheus namespace (it will be created if it doesn't exist)
k8s_prometheus_namespace: prometheus

# Prometheus images
# Prometheus image
k8s_prometheus_image: prom/prometheus
k8s_prometheus_image_tag: v2.0.0-rc.1
k8s_prometheus_image_tag: v2.0.0
# Alert manager image
k8s_prometheus_alertmanager_image: prom/alertmanager
k8s_prometheus_alertmanager_image_tag: v0.9.1
# Kube state metrics image
k8s_prometheus_ksm_image: gcr.io/google_containers/kube-state-metrics
k8s_prometheus_ksm_image_tag: v0.5.0
# Node exporter image
k8s_prometheus_nodeexport_image: prom/node-exporter
k8s_prometheus_nodeexport_image_tag: v0.15.0
k8s_prometheus_nodeexport_image_tag: v0.15.1
# Black box exporter image
k8s_prometheus_blackbox_image: prom/blackbox-exporter
k8s_prometheus_blackbox_image_tag: v0.10.0

# Prometheus additional components

# Push gateway image
k8s_prometheus_pushgateway_image: prom/pushgateway
k8s_prometheus_pushgateway_image_tag: v0.4.0
# Grafan image (+ grafana watcher)
k8s_prometheus_grafana_image: grafana/grafana
k8s_prometheus_grafana_image_tag: 4.5.2
k8s_prometheus_grafana_watcher_image: quay.io/coreos/grafana-watcher
k8s_prometheus_grafana_watcher_image_tag: v0.0.8

k8s_prometheus_blackbox_image: prom/blackbox-exporter
k8s_prometheus_blackbox_image_tag: v0.10.0

# Prometheus additional images
# Config reload image
k8s_prometheus_configreload_image: jimmidyson/configmap-reload
k8s_prometheus_configreload_image_tag: v0.1

# Prometheus scrape configs for Istio
k8s_prometheus_scrape_istio_metrics: false

# Alerts default route is stub
k8s_prometheus_alerts_default_route: 'null'

# Prometheus alert configs for Slack
prometheus_slack_api_url: ''
prometheus_slack_channel: ''
k8s_prometheus_slack_alerts_enabled: false
k8s_prometheus_slack_api_url: ''
k8s_prometheus_slack_channel: ''
k8s_prometheus_slack_message_title: '{% raw %}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]{{.CommonAnnotations.summary}}{% endraw %}'
# be careful with tabs - it can break structure of yml file
k8s_prometheus_slack_message_body: |
{% raw %}{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
*Description:* {{ .Annotations.description }}
*Details*:
{{ range .Labels.SortedPairs }} • {{ .Name }}: `{{ .Value }}`
{{ end }}
{{ end }}{% endraw %}
# Prometheus alert configs for Telegram
k8s_prometheus_telegram_alerts_enabled: false
k8s_prometheus_telegram_webhook: ''

# Prometheus scrape configs for Cockroachdb
k8s_prometheus_scrape_cockroachdb_metrics: false
9 changes: 5 additions & 4 deletions roles/prometheus/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,22 @@
state: directory
mode: 0755
with_items:
- '{{ k8s_addons_dir }}'
- '{{ k8s_prometheus_dir }}'

- name: Prometheus
template:
src: "{{ item }}"
dest: "{{ k8s_addons_dir }}/{{ item }}"
dest: "{{ k8s_prometheus_dir }}/{{ item }}"
with_items:
- config.yaml
- prometheus.yaml

- name: Deploy script for Prometheus
template:
src: deploy-prometheus.sh
dest: "{{ k8s_addons_dir }}/deploy-prometheus.sh"
dest: "{{ k8s_prometheus_dir }}/deploy-prometheus.sh"
mode: 0755

- name: Run deploy script for Prometheus
command: "{{ k8s_addons_dir }}/deploy-prometheus.sh"
command: "{{ k8s_prometheus_dir }}/deploy-prometheus.sh"
when: inventory_hostname in k8s_master_hosts[0]
62 changes: 45 additions & 17 deletions roles/prometheus/templates/alert-manager.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,30 +10,58 @@ metadata:
name: prometheus-alertmanager
data:
alertmanager.yml: |-
{% if k8s_prometheus_slack_alerts_enabled %}
global:
slack_api_url: '{{ prometheus_slack_api_url }}'
slack_api_url: '{{ k8s_prometheus_slack_api_url }}'
{% endif %}
receivers:
- name: slack-receiver
- name: 'null'
{% if k8s_prometheus_slack_alerts_enabled %}
- name: slack
slack_configs:
- channel: '{{ prometheus_slack_channel }}'
- channel: '{{ k8s_prometheus_slack_channel }}'
send_resolved: true
{% raw %}
title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]{{.CommonAnnotations.summary}}'
title: '{{ k8s_prometheus_slack_message_title }}'
text: >-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
*Description:* {{ .Annotations.description }}
*Details*:
{{ range .Labels.SortedPairs }} • {{ .Name }}: `{{ .Value }}`
{{ end }}
{{ end }}
{% endraw %}
{{ k8s_prometheus_slack_message_body }}
{% endif %}
{% if k8s_prometheus_telegram_alerts_enabled %}
- name: telegram
webhook_configs:
- send_resolved: True
url: {{ k8s_prometheus_telegram_webhook }}
{% endif %}
route:
group_wait: 10s
group_interval: 5m
receiver: slack-receiver
repeat_interval: 3h
group_interval: 1h
receiver: {{ k8s_prometheus_alerts_default_route }}
repeat_interval: 4h
routes:
- receiver: 'null'
match:
alertname: DeadMansSwitch
{% if k8s_prometheus_slack_alerts_enabled %}
- receiver: slack
match_re:
notify: ^sre|dev$
continue: true
{% endif %}
{% if k8s_prometheus_telegram_alerts_enabled %}
- receiver: telegram
match:
notify: sre
{% endif %}
---

Expand Down Expand Up @@ -149,7 +177,7 @@ spec:
- name: storage-volume
emptyDir: {}

{% if k8s_prometheus_alertmanager_name is defined and k8s_prometheus_alertmanager_name != '' %}
{% if k8s_prometheus_alertmanager_name != '' %}
---

apiVersion: extensions/v1beta1
Expand Down
Loading

0 comments on commit 7a3b403

Please sign in to comment.