diff --git a/README.md b/README.md index 79822ec..6a999e9 100644 --- a/README.md +++ b/README.md @@ -26,19 +26,18 @@ # loghost-boshrelease This is a BOSH release to gather, store and analyze syslog events forwarded by bosh VMs. It -currently use [RSyslog](rsyslog) which is pre-installed by the stemcell. +currently uses [RSyslog] which is pre-installed by the stemcell. Only Linux stemcells are supported at the moment. - ## Introduction -Usually, platform logs are sent to ELK stacks which store and index events on-the-fly. Finally -users can build fancy Kibana dashboards by extracting metrics from elasticsearch queries. +Usually, platform logs are sent to ELK stacks which store and index events on-the-fly. +Finally, users can build fancy Kibana dashboards by extracting metrics from elasticsearch queries. -With the development of micro-services architectures, the number of emitted logs recently exploded -making these ELKs very hardware and therefore money consuming. Even more, theses stacks are often -built with heavy redundancy and high availability even when most of emitted events are not critical. +With the development of micro-services architectures, the number of emitted logs recently exploded, +making these ELKs very hardware and, therefore, money consuming. +Even more, these stacks are often built with heavy redundancy and high availability even when most of the emitted events are not critical. The idea here is having a much more lightweight architecture, providing only the most essential features of log processing: @@ -46,10 +45,9 @@ features of log processing: - hardware-efficient generation of metrics - redundancy and availability matching the actual criticality of the logs - -This is achieved by using both good old technologies such as [RSyslog][rsyslog] and modern -tools like [prometheus]. The bridge between logs and metrics is provided by a brilliant tool -[grok_exporter]. +This is achieved by using both good old technologies such as [RSyslog] and modern +tools like [Prometheus]. +The bridge between logs and metrics is provided by a brilliant tool [grok_exporter]. ## Components @@ -68,39 +66,32 @@ generated by the [syslog-release] generally used to forward VM log events to a g Received logs are stored on persistent disk in root directory `/var/vcap/store/loghost` where `{path}` depends on parsed [Structured Data ID][structured-data] fields of the event. - Assuming logs are forwarded by [syslog-release], the parsed fields are: - `$.director`: the configured name of bosh director - `$.deployment`: the name of deployment from which event was sent - `$.group`: the name of the instance from which event was sent - Finally, logs are stored under `/var/vcap/store/loghost/{$.director}/{$.deployment}/{$.group}.log` - #### Rotation -The job also configure local `logrotate` in order to rotate and compress logs every hours. Rotated -logs are stored in the same directories with the `-%Y%m%d%H.gz` suffix. +The job also configures local `logrotate` in order to rotate and compress logs every hour. +Rotated logs are stored in the same directories with the `-%Y%m%d%H.gz` suffix. The number of kept rotations can be configured `loghost_concentrator.logrotate.max-hours` property -with a default value of `360` (ie: 15 days) - +with a default value of `360` (ie: 15 days). #### Forwarding and clustering The job also provides the possibility to re-forward received syslog event under specified -conditions. This can useful for: -- Clusterize multiple concentrators in order the create a kind of backup across independent BOSH - directors -- Forward business or security critical events to external log handling platform +conditions; this can be useful for: +- Clusterize multiple concentrators in order to create a kind of backup across independent BOSH directors +- Forward business or security critical events to an external log handling platform ![clustering] - Forwarding is configured from the `loghost_concentrator.syslog.forward` property by defined -target objects as follow: - +target objects as follows: ```yml : @@ -144,8 +135,7 @@ jobs: transport: tcp ``` - -### Dns +### DNS Assuming that your deployment uses [bosh-dns], the job `loghost_dns` can be used to define new aliases. @@ -172,13 +162,13 @@ jobs: ### Exporter The `loghost_exporter` job installs and configures the [grok_exporter]. This brilliant program -processes log files and computes [prometheus] metrics according to parse rules given in +processes log files and computes [Prometheus] metrics according to parse rules given in [grok] format. Parsing rules are defined by the `loghost_exporter.metrics` key with the exact same syntax defined -by the [grok_exporter]. +by the [grok_exporter-metrics]. -In addition, `loghost_exporter.directors` and `loghost_exporter.deployments` keys must by configured +In addition, `loghost_exporter.directors` and `loghost_exporter.deployments` keys must be configured to give the list of logs files that the exported should watch. > **Note**: A limitation in the grok_exporter implementation forces watched directories to pre-exist @@ -186,10 +176,9 @@ to give the list of logs files that the exported should watch. > job creates required directories in its `pre-start` script. In addition to user defined metrics, the exporter provides -[builtin metrics][grok-builtin-metrics] - -Ops-files provided in the release also provies define metrics, as describe in [usage section](#usage) +[builtin metrics][grok-builtin-metrics]. +Ops-files provided in the release also provide metrics, as described in the [usage section](#usage). ### Alerts @@ -209,17 +198,17 @@ the following alerts: - `SecurityTooManyDiegoSshSuccess`: triggers when `ssh_proxy` component running on (`scheduler` instance) reports too many SSH authentications to containers -Alerts thresholds and evaluation time can be configured from job's spec. +Alert thresholds and evaluation time can be configured from job's spec. ### Dashboards -The job `loghost_dashboards` adds [grafana] dashboards for your [prometheus-boshrelease] +The job `loghost_dashboards` adds [Grafana] dashboards for your [prometheus-boshrelease] deployment. - a global overview giving the system status, number of processed logs per rules, deployments and instances -- a security dashboard overview giving informations on authentications when +- a security dashboard overview giving information on authentications when `loghost_dashboards.security.enabled` key is enabled. @@ -239,8 +228,7 @@ It will add the instance `loghost` with basic features enabled: ### Step 2: Forward all logs to loghost instance -The simplest way to forward all logs at once is to create a `runtime-config.yml` using the -[syslog-release] +The simplest way to forward all logs at once is to create a `runtime-config.yml` using the [syslog-release]. With file `runtime-syslog-forward.yml`: @@ -268,10 +256,8 @@ releases: Upload to bosh director: `bosh update-runtime-config --name syslog-forward runtime-syslog-forward.yml` - ### Step 3: Add alerts and dashboard to prometheus - Add the following ops-files to your prometheus deployment: - `manifests/operations/prometheus/loghost-enable.yml` @@ -281,8 +267,7 @@ It will: - define scrape config based on `bosh_exporter` discovery - define new alerts -- add dashboards to grafana - +- add dashboards to Grafana ## Reference @@ -296,7 +281,7 @@ It will: | loghost-exporter-enable.yml | add `loghost_exporter` job which spawns `grok_exporter` with a default set of metrics | | loghost-exporter-enable-security.yml | add security metrics to `loghost_exporter` job, grok rules for `uaa` and `audispd` | | prometheus/loghost-enable.yml | add discovery scraping of `grok_exporter`, default alerts and dashboards | -| prometheus/loghost-enable-security.yml | add security alerts and dashboards | +| prometheus/loghost-enable-security.yml | add security alerts and dashboards | ### Metrics @@ -324,8 +309,8 @@ With dimension values: from where the log was originally emitted - `source`: the `exe` field of type=`USER.*` message of `audispd` - `ip`: the remote address from which the authentication was attempted -- `clientid`: the `clientid` used to authenticate client on `UAA` -- `username`: the `username` used to authenticate user on `UAA` +- `clientid`: the `clientid` used to authenticate a client on `UAA` +- `username`: the `username` used to authenticate a user on `UAA` > **(*) Tech note**: Because metrics dimensions values are created over time depending on encountered > logs, we cannot rely on `rate` or `increase` prometheus function to compute the number of failures @@ -337,24 +322,22 @@ With dimension values: > sum( offset 5m or {} * 0) by () > ``` - - -[rsyslog]: http://www.rsyslog.com/ +[RSyslog]: http://www.rsyslog.com/ [RFC5424]: https://tools.ietf.org/html/rfc5424 [syslog-release]: https://github.com/cloudfoundry/syslog-release#format -[grafana]: https://grafana.com/ +[Grafana]: https://grafana.com/ [prometheus-boshrelease]: https://github.com/bosh-prometheus/prometheus-boshrelease [grok-builtin-metrics]: https://github.com/fstab/grok_exporter/blob/master/BUILTIN.md [structured-data]: https://tools.ietf.org/html/rfc5424#section-6.3.2 [clustering]: ./doc/clustering.png [rainerscript]: https://www.rsyslog.com/doc/v8-stable/rainerscript/index.html -[bosh-dns]: https://www.rsyslog.com/doc/v8-stable/rainerscript/index.html +[bosh-dns]: https://github.com/cloudfoundry/bosh-dns-release [bosh-dns-job]: https://github.com/cloudfoundry/bosh-dns-release/blob/master/jobs/bosh-dns/spec [grok_exporter]: https://github.com/fstab/grok_exporter -[prometheus]: https://prometheus.io/ +[Prometheus]: https://prometheus.io/ [grok]: https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html -[grok_exporter]: https://github.com/fstab/grok_exporter/blob/master/CONFIG.md#metrics-section +[grok_exporter-metrics]: https://github.com/fstab/grok_exporter/blob/master/CONFIG.md#metrics-section diff --git a/jobs/loghost_alerts/templates/loghost.alerts.yml b/jobs/loghost_alerts/templates/loghost.alerts.yml index 445cd21..5e065f3 100644 --- a/jobs/loghost_alerts/templates/loghost.alerts.yml +++ b/jobs/loghost_alerts/templates/loghost.alerts.yml @@ -3,7 +3,7 @@ groups: rules: - alert: LoghostNoLogReceived expr: | - sum(increase(loghost_total[15m])) by (director) == 0 + sum(increase(loghost_total[15m])) by (director, instance) == 0 for: <%= p('loghost_alerts.nologs.evaluation_time') %> labels: service: loghost @@ -24,7 +24,7 @@ groups: - contact Cloud Foundry administrator team - alert: LoghostNotEnoughSources expr: | - count(sum(loghost_total) by (director)) != <%= p('loghost_alerts.nologs.directors').length() %> + count(sum(loghost_total) by (director, instance)) != <%= p('loghost_alerts.nologs.directors').length() %> for: <%= p('loghost_alerts.nologs.evaluation_time') %> labels: service: loghost @@ -180,7 +180,7 @@ groups: annotations: summary: "Number of Diego SSH authentication {{ $value }} is higher than <%= p('loghost_alerts.security.diego_ssh.success.threshold') %> in the last 5 minutes" description: | - Too many Diego SSH authentication were detected with user {{ $labels.user }} in the last 5 minutes + Too many Diego SSH authentication were detected with user {{ $labels.username }} in the last 5 minutes Number of authentication `{{ $value }}` is higher than configured threshold of `<%= p('loghost_alerts.security.diego_ssh.success.threshold') %>` Details: