If you would like to assist me with proper Russian to English translation, please feel free to send PR's or contact me directly. Thanks.
The utility intending to be a universal daemon capable of performing periodic checks (health checks) of various IT systems,
send alerts and perform some actions if the check status changes.
Configuration storage implemented using the Koanf library.
By default, the configuration loading from config.yaml
file in the current directory.
You can use fly.io free tier to run Checker PoC, and for personal use: https://fly.io/docs/speedrun/
You can find example configurations files in docs/examples folder.
google.yaml
is very simple config, only checking google.com with log output, and no alerting methods defined.
This configurations is used when running default service on Heroku.
bigconfig.yaml
contains more robust example of healthchecks for various services, divided to two virtual projects.
Project CI pipeline includes building Docker image step, which needs REGISTRY_LOGIN
and REGISTRY_PASSWORD
secret variables to login to Docker Hub.
REGISTRY_LOGIN
should contain your Docker Hub login, and REGISTRY_PASSWORD
- your password or (better) personal access token.
$ ./checker
Usage:
checker [command]
Available Commands:
check run scheduler and execute checks
completion generate the autocompletion script for the specified shell
gentoken generate auth token
help Help about any command
list list config elements
singlecheck execute single check by UUID
testcfg unmarshal config file into config structure
version Print the version number of Checker
Flags:
-b, --botsEnabled Whether to enable active bot (default true)
-u, --checkUUID string UUID to check with SingleCheck
-c, --config string config file
-f, --configformat string config file format (default "yaml")
-s, --configsource string config file source: file, consul, s3, env
-w, --configwatchtimeout string config watch period (default "5s")
-D, --debugLevel string Debug level: Debug,Info,Warn,Error,Fatal,Panic (default "warn")
-h, --help help for checker
-l, --logformat string log format: text/json (default "text")
-W, --watchConfig Whether to watch config file changes on disk (default true)
Use "checker [command] --help" for more information about a command.
Configuration file can be in any format supported by the Koanf library. Also, parameters can be loaded from CHECKER_* environment variables (see the Koanf's documentation).
The -s
switch allows you to switch the receiving of the config to Consul, S3 or ENV variable CHECKER_CONFIG.
For S3, settings are taken from these ENV variables:
AWS_ACCESS_KEY_ID - key ID AWS_SECRET_ACCESS_KEY - secret key AWS_REGION - region AWS_BUCKET - bucket name AWS_OBJECT_KEY - path to the object from the bucket root
For Consul, two ENV variables are read: CONSUL_ADDR and CONSUL_PATH. From the first, the URL of the Consul server is taken, from the second - the path to the KV key with the config.
The KV key must contain the complete configuration, in the formats yaml
, json
, toml
, hcl
(set by the -f key), loading from the KV tree structure is not supported.
Each time period set by the --configwatchtimeout
key (by default 5s
) Checker tries to reread the config. If the config is loaded successfully, its validity and compliance with the current configuration are compared.
If the config is valid and differs from the current configuration, it replaces the current configuration, and the scheduler and bots are restarted.
The config loaded from the file system is also automatically monitored for updates.
The configuration is loaded according to the template (json):
{
"defaults": {},
"actors": {},
"alerts": {},
"projects": {},
"consul_catalog": {}
}
Secret parameters (passwords, tokens) can be saved in the Hashicorp Vault, at the moment it supports downloading secrets for telegram bots, JWT authorization, passwords for SQL databases and http checks.
Format: vault: secret / path / to / token: field
. The value of the field field
from the path secret / path / to / token
will be used as the secret.
Secrets retrieved from the Vault are cached for 5 minutes to reduce the load on the Vault.
The defaults
block in the parameters
subblock describes the default check parameters that are applied to the settings of projects, if they have not been reassigned in the parameters
block of a specific project.
Parameter and http_port
in the defaults
block contain the default port for the HTTP server.
If the PORT environment variable is set, the port number from it is used.
check_period: 600s The frequency of testing and running alerts (in seconds).
// TODO check the feature
report_period: reportdisabled check report submission period
// TODO check the features
min_health: the minimum number of live checks within the healthchck that does not put the project in critical status
allow_fails: the number of checks that have failed to the critical status that can pass before the alert is sent to the critical channel
mode: notification mode, in loud mode alerts are sent to carts, in quiet mode they are only output to stdout.
noncrit_alert: the name of the notification method for non-critical alerts
crit_alert: the name of the alert method for critical alerts
command_channel: the name of the notification method for receiving a command into the bot (by default, the noncritical_channel parameter is taken)
// TODO add certificate checking for all tls, not just https
ssl_expiration_period: checking the proximity of the expiration time of SSL certificates during http checks
mentions: whom to notify in alerts for this project. It is convenient for all chat participants to keep it muted, and notify the person on specific problems.
bots_enabled: wheether to allow run telegram bot
An "actor" is an action that must be performed when the verification status changes (actor_up/actor_down).
Actors (actions) are described in the ʻactors` block.
// TODO
Three types of notifications are supported: telegram, slack / mattermost, log.
The block should contain sub-blocks, with settings specific to each notification method:
// Common parameters
name: The name of the notification method
type: The type of notification method (log, telegram, slack or mattermost)
// telegram parameters
bot_token: token
noncritical_channel: Channel for non-critical notifications
critical_channel: Channel for critical alerts
// slack / mattermost parameters
mattermost_webhook_url: webhook url. Used for all types of alerts and ChatOps.
If there is no alerts
block, all alerts will be sent only to the log.
Parameter severity: critical
may be set in each check, to explicitly send alerts to Critical alerts channel.
With the help of messages to the bot, you can manage alerts and the mode of checking projects. Command line switch The following commands are supported:
/qa with a regular chat message - completely disables all notifications (analogue of quiet in the defaults block)
/la with a regular message in the chat - turns on all notifications (analogous to loud in the defaults block)
Commands for managing alerts for the specified item.
The /qp,/lp <project_name>
and /qu,/lu <UUID>
commands control project alerts and specific checks.
They can be sent as a regular chat message, or as a response to a specific alert.
In case of response to an alert, the project name or verification UUID is extracted from this alert.
The healthchecks
block must contain blocks describing check sets and optionally a parameters
block.
These settings overlap the project level and root level settings.
Each set of checks has a name in the name
field and a description of the checks in the checks
block.
Checks of different types are supported (mandatory parameters are marked with *
sign).
- HTTP
- ICMP ping
- TCP port
- GetFile
- Database queries execution
- Database field age
- Database replication
- Redis (pub/sub)
*type: "http"
*url: URL to check (GET method)
code: a set of possible HTTP codes for a successful response (slice int, for example `[200,420]` by default only 200)
answer: Text to search in the HTTP Body of the response
answer_present: check whether the text is present (by default "present") or not ("absent")
headers: An array of HTTP headers added to HTTP request:
{
"User-Agent": "custom_user_aget"
}
timeout: time to wait for a response
auth: block containing credentials if http basic authentication is required.
"auth": {
"user": "username",
"password": "S3cr3t!"
}
skip_check_ssl: do not check the validity of the server SSL certificate
stop_follow_redirects: do not follow HTTP redirects
cookies: an array of http.Cookie objects (you can pass any parameters from https://golang.org/src/net/http/cookie.go
"cookies": [
{
"name": "test_cookie",
"value": "12345"
}
]
*type: "icmp"
*host: hostname or IP address to check
*timeout: time to wait for a response (compared to the average RTT for all attempts)
*count: number of requests sent
Checks that the port is open and responds at the right time
*type: "tcp"
*host: hostname or IP address to check
*port: TCP port number
*timeout: time to wait for a response
attempts: number of attempts to open the port (default 3)
Downloads the file and checks its md5 hash.
Each file is downloaded onto the local file system, and deleted after verification. It is necessary to consider possible restrictions on the size of the underlying file system.
*type: "getfile"
*host: url from where to download the file
*hash: md5 hash to compare file to
Checking execution of database queries (MySQL, PostgreSQL, Clickhouse)
*type: check type - mysql_query, pgsql_query, clickhouse_query
*host: database server address
port: port to connect (if omitted, default ports are used)
timeout: connection and request execution timeout (connection time and request time are checked separately)
*sql_query_config: contains query parameters
**dbname: base name
**username: username
**password: password
query: the query to execute. if omitted, `select 1` is executed and the response is not validated
response: the response against which the value returned from the base is checked.
_one_ field is expected in the response. If omitted, only the fact of a successful request is checked.
{
"type": "mysql_query",
"host": "192.168.132.101",
"port": 3306,
"timeout": "1s",
"sql_query_config": {
"username": "username",
"dbname": "dbname",
"password": "vault:secret/cluster/userA/pass:value",
"query": "select reg_date from users order by id asc limit 1;",
"response": "1278938100"
}
}
Checking the age of a record in the database (MySQL, PostgreSQL, Clickhouse). This check expects one field containing an integer in UnixTime format or in Timestamp format with timezone.
*type: check type - clickhouse_query_unixtime, mysql_query_unixtime, pgsql_query_unixtime, pgsql_query_timestamp
*host: database server address
port: port to connect (if omitted, default ports are used)
timeout: timeout for connection and request execution
*sql_query_config: contains query parameters
*dbname: database name
*username: username
*password: password
query: the query to execute. if omitted, `select 1` is executed and the response is not validated
difference: maximum difference from the current time. if omitted, no check is performed.
{
"type": "clickhouse_query_unixtime",
"host": "192.168.126.50",
"port": 9000,
"sql_query_config": {
"username": "username",
"dbname": "dbname",
"password": "she1Haiphae5",
"query": "select max (serverTime) from forex.quotes1sec",
"difference": "15m"
},
"timeout": "5s"
}
Checking that database replication is working (MySQL, PostgreSQL).
Checking algorithm: a record with random id
and test_value
is inserted into the table on the leading server.
Values are selected in the range 1-5 for id
and 1-9999 for test_value
.
If the insert was successful, then Checker tries to read values with corresponding id
from all the servers in the serverlist
field.
If the result on each server matches test_value
, replication on a specific server considered working.
Configuring is similar to query validation, but with tablename
and serverlist
parameters instead of the query/response parameters.
tablename
contains the name of the table to insert the test record ("repl_test" by default). The serverlist
block contains a list of servers to check.
It is better to include all servers in the cluster (including the leading one) to the list for better result.
*type: check type - mysql_replication, pgsql_replication
Configuration examples:
{
"type": "pgsql_replication",
"host": "master.pgsql.service.staging.consul",
"port": 5432,
"sql_repl_config": {
"username": "username",
"dbname": "dbname",
"password": "ieb6aj2Queet",
"tablename": "repl_test",
"serverlist": [
"pgsql-main-0.node.staging.consul",
"pgsql-main-1.node.staging.consul",
"pgsql-main-2.node.staging.consul"
],
"lag": "5s"
}
}
- name: pgsql-main
parameters:
check_period: 60s
checks:
- type: pgsql_replication_status
host: pgsql-master.db.local
port: 5000
sql_repl_config:
dbname: checker
username: checker
password: vault:secret/local-pgsql/checker:pass
lag: 3s
analytic_replicas:
- sd-156726
severity: critical
The table with following DDL should be created:
CREATE TABLE repl_test (
id int primary key,
test_value int,
timestamp timestamp with time zone default current_timestamp
);
In PostgreSQL version 10+ Checker requires special permission to analyze replication details without superuser role:
GRANT pg_monitor TO checker;
Checking database replication internals (PostgreSQL).
Checker uses database's internal replication counters to detect replication lag. Allowed lag is set in lag
parameter of sql_repl_config
block.
After a successful subscription, Checker waits for any message (of type other than Subscription/Pong) in each of the configured channels. When calculating the timeout for this kind of check, you must take into account:
- time of connection to the server being checked
- the time to complete the subscription and wait for confirmation in the Subscription message, the time to receive the data message.
* type: check type - redis_pubsub
* host: server address
port: port to connect (if omitted, default ports are used)
timeout: timeout for connection and request execution
* pubsub_config: contains request parameters
* channel: the name of the channel to subscribe
password: password
{
"type": "redis_pubsub",
"host": "master.redis.service.staging.consul",
"pubsub_config": {
"channels": [
"ticks_EURUSD",
"ticks_USBRUB"
]
},
"timeout": "5s"
}
If an active check is undesirable or impossible for some reason, a passive check will allow you to track the check status.
{
"name": "passive check of service A",
"type": "passive",
"timeout": "5m"
}
Check refresh requests should be a GET request to the endpoint `http://checker/check/ping/<check uuid>`.
A list of all UUIDs can be obtained with a GET request to the endpoint http://checker/listChecks, or with the CLI command `checker list`.
To get the list via WEB, JWT authorization is required, [see](#web-api):
curl -H "Authorization: <token>" http://checker/listChecks
// TODO describe consul_catalog
Metrics in prometheus format are published at the / metrics endpoint.
The sched_ *
metrics reflect the work of the internal scheduling cycle.
Metrics ʻalerts_by_event_type` - statistics on alerts in the context of various events.
Metrics ʻevents_by_ * `- statistics on events in the context of various projects and audits.
Metrics check_duration
- statistics on the execution time of checks.
Some web endpoints require JWT authorization. JWT token is generated using the CLI command checker gentoken
.
The token generated using encryption key specified in defaults.token_encryption_key
configuration parameter, or using ENV variable (ENV has higher priority).
Hashicorp Vault also supported.
Test token for example config in docs/examples/google.yaml is:
eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJqdGkiOiJPaTNvb3hpZTRhaWtlaW1vb3pvOEVnYWk2YWl6OXBvaCIsImF1ZCI6ImFkbWluIn0.MhkG4ox_-OeVSrn9yexLjpMJoYLAhiROySByiUnq2Nk
/check/ping/<check-uuid>
- update passive check status
/check/status/<check-uuid>
- request the check status
/check/fire/<check-uuid>
- execute the check and return result
/listChecks
- returns all checks defined (require auth).
Project: google
Healthcheck: tcp_test
Name:
UUID: 271099c2-fd93-5d39-9d58-de0a733921bb (mode loud)
Healthcheck: http checks
Name:
UUID: 654f00b3-b182-5cc7-bc8b-c61626a78314 (mode loud)
/alert
- webhook to fire alerts from other sources. Method POST, accepts json payload:
{"project":"my_cool_project", "text":"critical testalert", "severity":"info"}
/healthcheck
- own healthcheck url. Returns code 200 and text 'Ok!' if works as expected.
/metrics
- prometheus format metrics. Please note security concerns, because Prometheus does not allow custom headers in scrape configs.