Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[monitoring] Rewrite CPU usage rule to improve accuracy (#159351)
Fixes #116128 # Summary This PR changes how the CPU Usage Rule calculates the usage percentage for containerized clusters. Based on the comment [here](#116128 (comment)), my understanding of the issue was that because we were using a `date_histogram` to grab the values we could sometimes run into issues around how `date_histogram` rounds the time range and aligns it towards the start rather than the end, causing the last bucket to be incomplete, this is aggravated by the fact that we make the fixed duration of the histogram the size of the lookback window. I took a slightly different path for the rewrite, rather than using the derivative I just look at the usage across the whole range using a simple delta. This has a glaring flaw in that it cannot account for the limits changing within the lookback window (going higher/lower or set/unset), which we will have to try to address in #160905. The changes in this PR should make the situation better in the other cases and it makes clear when the limits have changed by firing alerts. #160897 outlines follow up work to align how the CPU usage is presented in other places in the UI. # Screenshots **Above threshold:** <img width="1331" alt="above-threshold" src="https://github.com/elastic/kibana/assets/2564140/4dc4dc2a-a858-4022-8407-8179ec3115df"> **Failed to compute usage:** <img width="1324" alt="failed-to-compute" src="https://github.com/elastic/kibana/assets/2564140/88cb3794-6466-4881-acea-002a4f81c34e"> **Limits changed:** <img width="2082" alt="limits-changed" src="https://github.com/elastic/kibana/assets/2564140/d0526421-9362-4695-ab00-af69aa9838c9"> **Limits missing:** <img width="1743" alt="missing-resource-limits" src="https://github.com/elastic/kibana/assets/2564140/82626968-8b18-453d-9cf8-8a6776a6a46e"> **Unexpected limits:** <img width="1637" alt="unexpected-resource-limits" src="https://github.com/elastic/kibana/assets/2564140/721deb15-d75b-4915-8f77-b18d0b33da7d"> # CPU usage for the Completely Fair Scheduler (CFS) for Control Groups (cgroup) The way CPU usage for containers is calculated is this formula: `execution_time / (time_quota_per_schedule_period * number_of_periods)` Execution time is a counter of how many cycles the container was allowed to execute for by the scheduler, the quota is the limit of how many cycles are allowed per period. The number of periods is derived from the length of the period which can also be changed. the default being 0.1 seconds. At the end of each period, the available cycles is refilled to `time_quota_per_schedule_period`. With a longer period, you're likely to be throttled more often since you'll have to wait longer for a refresh, so once you've used your allowance for that period you're blocked. With a shorter period you're getting refilled more often so your total available usage is higher. Both scenarios have an effect on your percentage CPU usage but the number of elapsed periods is a proxy for both of these cases. If you wanted to know about throttling compared to only CPU usage then you might want a separate rule for that stat. In short, 100% CPU usage means you're being throttled to some degree. The number of periods is a safe proxy for the details of period length as the period length will only affect the rate at which quota is refreshed. These fields are counters, so for any given time range, we need to grab the biggest value (the latest) and subtract from that the lowest value (the earliest) to get the delta, then we plug those delta values into the formula above to get the factor (then multiply by 100 to make that a percentage). The code also has some unit conversion because the quota is in microseconds while the usage is in nano seconds. # How to test There are 3 main states to test: No limit set but Kibana configured to use container stats. Limit changed during lookback period (to/from real value, to/from no limit). Limit set and CPU usage crossing threshold and then falling down to recovery **Note: Please also test the non-container use case for this rule to ensure that didn't get broken during this refactor** **1. Start Elasticsearch in a container without setting the CPU limits:** ``` docker network create elastic docker run --name es01 --net elastic -p 9201:9200 -e xpack.license.self_generated.type=trial -it docker.elastic.co/elasticsearch/elasticsearch:master-SNAPSHOT ``` (We're using `master-SNAPSHOT` to include a recent fix to reporting for cgroup v2) Make note of the generated password for the `elastic` user. **2. Start another Elasticsearch instance to act as the monitoring cluster** **3. Configure Kibana to connect to the monitoring cluster and start it** **4. Configure Metricbeat to collect metrics from the Docker cluster and ship them to the monitoring cluster, then start it** Execute the below command next to the Metricbeat binary to grab the CA certificate from the Elasticsearch cluster. ``` docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt . ``` Use the `elastic` password and the CA certificate to configure the `elasticsearch` module: ``` - module: elasticsearch xpack.enabled: true period: 10s hosts: - "https://localhost:9201" username: "elastic" password: "PASSWORD" ssl.certificate_authorities: "PATH_TO_CERT/http_ca.crt" ``` **5. Configure an alert in Kibana with a chosen threshold** OBSERVE: Alert gets fired to inform you that there looks to be a misconfiguration, together with reporting the current value for the fallback metric (warning if the fallback metric is below threshold, danger is if is above). **6. Set limit** First stop ES using `docker stop es01`, then set the limit using `docker update --cpus=1 es01` and start it again using `docker start es01`. After a brief delay you should now see the alert change to a warning about the limits having changed during the alert lookback period and stating that the CPU usage could not be confidently calculated. Wait for change event to pass out of lookback window. **7. Generate load on the monitored cluster** [Slingshot](https://github.com/elastic/slingshot) is an option. After you clone it, you need to update the `package.json` to match [this change](https://github.com/elastic/slingshot/blob/8bfa8351deb0d89859548ee5241e34d0920927e5/package.json#L45-L46) before running `npm install`. Then you can modify the value for `elasticsearch` in the `configs/hosts.json` file like this: ``` "elasticsearch": { "node": "https://localhost:9201", "auth": { "username": "elastic", "password": "PASSWORD" }, "ssl": { "ca": "PATH_TO_CERT/http_ca.crt", "rejectUnauthorized": false } } ``` Then you can start one or more instances of Slingshot like this: `npx ts-node bin/slingshot load --config configs/hosts.json` **7. Observe the alert firing in the logs** Assuming you're using a connector for server log output, you should see a message like below once the threshold is breached: ``` `[2023-06-13T13:05:50.036+02:00][INFO ][plugins.actions.server-log] Server log: CPU usage alert is firing for node e76ce10526e2 in cluster: docker-cluster. [View node](/app/monitoring#/elasticsearch/nodes/OyDWTz1PS-aEwjqcPN2vNQ?_g=(cluster_uuid:kasJK8VyTG6xNZ2PFPAtYg))` ``` The alert should also be visible in the Stack Monitoring UI overview page. At this point you can stop Slingshot and confirm that the alert recovers once CPU usage goes back down below the threshold. **8. Stop the load and confirm that the rule recovers.** # A second opinion I made a little dashboard to replicate what the graph in SM and the rule **_should_** see: [cpu_usage_dashboard.ndjson.zip](https://github.com/elastic/kibana/files/11728315/cpu_usage_dashboard.ndjson.zip) If you want to play with the data, I've collected an `es_archive` which you can load like this: `node scripts/es_archiver load PATH_TO_ARCHIVE/containerized_cpu_load --es-url http://elastic:changeme@localhost:9200 --kibana-url http://elastic:changeme@localhost:5601/__UNSAFE_bypassBasePath` [containerized_cpu_load.zip](https://github.com/elastic/kibana/files/11754646/containerized_cpu_load.zip) These are the timestamps to view the data: Start: Jun 13, 2023 @ 11:40:00.000 End: Jun 13, 2023 @ 12:40:00.000 CPU average: 52.76% --------- Co-authored-by: kibanamachine <[email protected]>
- Loading branch information