-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Stack Monitoring] CPU usage rule should handle usage limit changes #160905
Comments
Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI) |
Thinking about it, my vote would be for option 2. Especially if we can push a majority of the work into Elasticsearch, that path has benefits since it's only one query and the scaling issue can be addressed by having multiple instances of the rule with varying filters, leaving the over all tracing flow simpler. |
Fixes #116128 # Summary This PR changes how the CPU Usage Rule calculates the usage percentage for containerized clusters. Based on the comment [here](#116128 (comment)), my understanding of the issue was that because we were using a `date_histogram` to grab the values we could sometimes run into issues around how `date_histogram` rounds the time range and aligns it towards the start rather than the end, causing the last bucket to be incomplete, this is aggravated by the fact that we make the fixed duration of the histogram the size of the lookback window. I took a slightly different path for the rewrite, rather than using the derivative I just look at the usage across the whole range using a simple delta. This has a glaring flaw in that it cannot account for the limits changing within the lookback window (going higher/lower or set/unset), which we will have to try to address in #160905. The changes in this PR should make the situation better in the other cases and it makes clear when the limits have changed by firing alerts. #160897 outlines follow up work to align how the CPU usage is presented in other places in the UI. # Screenshots **Above threshold:** <img width="1331" alt="above-threshold" src="https://github.com/elastic/kibana/assets/2564140/4dc4dc2a-a858-4022-8407-8179ec3115df"> **Failed to compute usage:** <img width="1324" alt="failed-to-compute" src="https://github.com/elastic/kibana/assets/2564140/88cb3794-6466-4881-acea-002a4f81c34e"> **Limits changed:** <img width="2082" alt="limits-changed" src="https://github.com/elastic/kibana/assets/2564140/d0526421-9362-4695-ab00-af69aa9838c9"> **Limits missing:** <img width="1743" alt="missing-resource-limits" src="https://github.com/elastic/kibana/assets/2564140/82626968-8b18-453d-9cf8-8a6776a6a46e"> **Unexpected limits:** <img width="1637" alt="unexpected-resource-limits" src="https://github.com/elastic/kibana/assets/2564140/721deb15-d75b-4915-8f77-b18d0b33da7d"> # CPU usage for the Completely Fair Scheduler (CFS) for Control Groups (cgroup) The way CPU usage for containers is calculated is this formula: `execution_time / (time_quota_per_schedule_period * number_of_periods)` Execution time is a counter of how many cycles the container was allowed to execute for by the scheduler, the quota is the limit of how many cycles are allowed per period. The number of periods is derived from the length of the period which can also be changed. the default being 0.1 seconds. At the end of each period, the available cycles is refilled to `time_quota_per_schedule_period`. With a longer period, you're likely to be throttled more often since you'll have to wait longer for a refresh, so once you've used your allowance for that period you're blocked. With a shorter period you're getting refilled more often so your total available usage is higher. Both scenarios have an effect on your percentage CPU usage but the number of elapsed periods is a proxy for both of these cases. If you wanted to know about throttling compared to only CPU usage then you might want a separate rule for that stat. In short, 100% CPU usage means you're being throttled to some degree. The number of periods is a safe proxy for the details of period length as the period length will only affect the rate at which quota is refreshed. These fields are counters, so for any given time range, we need to grab the biggest value (the latest) and subtract from that the lowest value (the earliest) to get the delta, then we plug those delta values into the formula above to get the factor (then multiply by 100 to make that a percentage). The code also has some unit conversion because the quota is in microseconds while the usage is in nano seconds. # How to test There are 3 main states to test: No limit set but Kibana configured to use container stats. Limit changed during lookback period (to/from real value, to/from no limit). Limit set and CPU usage crossing threshold and then falling down to recovery **Note: Please also test the non-container use case for this rule to ensure that didn't get broken during this refactor** **1. Start Elasticsearch in a container without setting the CPU limits:** ``` docker network create elastic docker run --name es01 --net elastic -p 9201:9200 -e xpack.license.self_generated.type=trial -it docker.elastic.co/elasticsearch/elasticsearch:master-SNAPSHOT ``` (We're using `master-SNAPSHOT` to include a recent fix to reporting for cgroup v2) Make note of the generated password for the `elastic` user. **2. Start another Elasticsearch instance to act as the monitoring cluster** **3. Configure Kibana to connect to the monitoring cluster and start it** **4. Configure Metricbeat to collect metrics from the Docker cluster and ship them to the monitoring cluster, then start it** Execute the below command next to the Metricbeat binary to grab the CA certificate from the Elasticsearch cluster. ``` docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt . ``` Use the `elastic` password and the CA certificate to configure the `elasticsearch` module: ``` - module: elasticsearch xpack.enabled: true period: 10s hosts: - "https://localhost:9201" username: "elastic" password: "PASSWORD" ssl.certificate_authorities: "PATH_TO_CERT/http_ca.crt" ``` **5. Configure an alert in Kibana with a chosen threshold** OBSERVE: Alert gets fired to inform you that there looks to be a misconfiguration, together with reporting the current value for the fallback metric (warning if the fallback metric is below threshold, danger is if is above). **6. Set limit** First stop ES using `docker stop es01`, then set the limit using `docker update --cpus=1 es01` and start it again using `docker start es01`. After a brief delay you should now see the alert change to a warning about the limits having changed during the alert lookback period and stating that the CPU usage could not be confidently calculated. Wait for change event to pass out of lookback window. **7. Generate load on the monitored cluster** [Slingshot](https://github.com/elastic/slingshot) is an option. After you clone it, you need to update the `package.json` to match [this change](https://github.com/elastic/slingshot/blob/8bfa8351deb0d89859548ee5241e34d0920927e5/package.json#L45-L46) before running `npm install`. Then you can modify the value for `elasticsearch` in the `configs/hosts.json` file like this: ``` "elasticsearch": { "node": "https://localhost:9201", "auth": { "username": "elastic", "password": "PASSWORD" }, "ssl": { "ca": "PATH_TO_CERT/http_ca.crt", "rejectUnauthorized": false } } ``` Then you can start one or more instances of Slingshot like this: `npx ts-node bin/slingshot load --config configs/hosts.json` **7. Observe the alert firing in the logs** Assuming you're using a connector for server log output, you should see a message like below once the threshold is breached: ``` `[2023-06-13T13:05:50.036+02:00][INFO ][plugins.actions.server-log] Server log: CPU usage alert is firing for node e76ce10526e2 in cluster: docker-cluster. [View node](/app/monitoring#/elasticsearch/nodes/OyDWTz1PS-aEwjqcPN2vNQ?_g=(cluster_uuid:kasJK8VyTG6xNZ2PFPAtYg))` ``` The alert should also be visible in the Stack Monitoring UI overview page. At this point you can stop Slingshot and confirm that the alert recovers once CPU usage goes back down below the threshold. **8. Stop the load and confirm that the rule recovers.** # A second opinion I made a little dashboard to replicate what the graph in SM and the rule **_should_** see: [cpu_usage_dashboard.ndjson.zip](https://github.com/elastic/kibana/files/11728315/cpu_usage_dashboard.ndjson.zip) If you want to play with the data, I've collected an `es_archive` which you can load like this: `node scripts/es_archiver load PATH_TO_ARCHIVE/containerized_cpu_load --es-url http://elastic:changeme@localhost:9200 --kibana-url http://elastic:changeme@localhost:5601/__UNSAFE_bypassBasePath` [containerized_cpu_load.zip](https://github.com/elastic/kibana/files/11754646/containerized_cpu_load.zip) These are the timestamps to view the data: Start: Jun 13, 2023 @ 11:40:00.000 End: Jun 13, 2023 @ 12:40:00.000 CPU average: 52.76% --------- Co-authored-by: kibanamachine <[email protected]>
@bck01215 Can you explain more about your setup? That error means that Kibana is configured with |
We did not have containers. This error came from updating from 8?4 to 8?10. It seemed that deleting ans recreating the rule fixed it |
Interesting, perhaps there is/was something stored in the rule state that would affect the flow. Anyway, I'm glad it was solved by re-creating the rule, don't hesitate to reach out again if any issues come up! |
Getting the same alert as @bck01215 triggered after the last few upgrades. Currently running 8.10.2 on both Elasticsearch and Kibana.
In my case removing and re-adding the rule did not resolve the issue. I'm not running any containers, however I noticed that the systemd service mentions "CGroup". I've not touched any cgroup limits. It's installed "out-of-the-box" via apt on a fully patched Ubuntu 20.04 system. Might be a "false positive" depending on how the "containerization detection" is done maybe. Operating System: Ubuntu 20.04.6 LTS
Let me know if you need any further information. |
Hey @msafdal, as you correctly mentioned it does depend on how the containerization detection is done, we noticed this from another report and we are updating the way this flow is detected on this PR. Regarding the limits, the default values are unset or infinity, which is equivalent to not having them set. |
Hijacking the thread to ask what "Kibana is configured for non-containerized workloads" means. I'm running the stack on docker swarm and started receiving the same alert after 8.10.2. Couldn't find anything regarding "telling kibana its running in a container". What's my fix given the alert is right and i haven't configured kibana properly? EDIT: I set I'm guessing the rule is somewhat inconsistent even for truly containerized stacks now. |
@k4z4n0v4 This is a miss on our part, we didn't consider the case where someone is running in a container/cgroup without limits on purpose. We have a fix coming out in the next patch for this but in the meantime you could work around this by setting the limit on your containers to 100% of your available CPU. |
This triggers on all or nodes since the update to 8.10.2. Our nodes are not containerized, we have no limits configured afaik. (although the alerts say we do) |
@willemdh If the alert is reporting that you have limits specified then that is because that's what Elasticsearch is reporting, |
Just upgraded my monitoring cluster to 8.10.2 and got the same alert for all of my 20 nodes. I do not use containers, I run on normal VMs, not sure what should I do to fix this. Added the following line into
But now the alert is the inverse for all my nodes. I'm using the So, it seems that there is no workaround for this, the solution is to disable the rule and wait for the fix on #167244 |
@leandrojmp Is it not possible to define the limit on your cgroup to 100% of your CPU (which is the same as not having the limit but it'll make the rule happy)? Either way, it seems odd that you're getting both sides of the issue. Either you have the cgroup metrics being reported or not, I'm not sure what's going on there. If you hit |
Hello @miltonhultgren,
I didn't make any changes to cgroups or applied any limits, I'm running the default rpm package distribution, I just installed the package, configured Elasticsearch and started the service. This is how systemd works, it uses cgroups, this is the return of
So everything is default, probably this affects anyone that runs elasticsearch using the
Yeah, if i do not set I upgraded just the Monitoring cluster to 8.10.2, the production cluster and metricbeat is still on 8.8.1, not sure if this may impact or not, but an upgrade for 8.10.2 in the production cluster is planned for this week.
This happens for all nodes, and this is the
|
Got it, thanks for the insight @leandrojmp ! This change had a bigger effect than we anticipated (the flag being named container is misleading) since it affects all cgroup run times, like you mentioned this is the default for some setups which we didn't expect (a miss on our part). Thanks for sharing the results of the stats endpoint, I see the issue now. When Kibana is configured for non-container (non-cgroup*) workloads it used to check if the metric values are Apologies again for all the noise this is causing! |
So when 8.11 drops I would need to upgrade just the monitoring cluster to not get the alerts anymore, right? Because We upgrade our production cluster every quarter, and we will upgrade to 8.10.2 this week and the next upgrade will be just next quarter. |
@tonyghiani Did we backport this to 8.10.X or only 8.11.X? Let's make sure this comes out with the next patch release for 8.10! @leandrojmp The alerting system only runs in your monitoring cluster's Kibana so upgrading that will be enough! |
@miltonhultgren Thanks for all the work on this. |
@miltonhultgren apologies for the delay, I completely missed your mention here. |
This #170740 should backport the fix to v10.8. |
I closed the above PR since it won't be released with new patches for 8.10.x, so it'll be available starting from 8.11.0 |
Reverts #159351 Reverts #167244 Due to the many unexpected issues that these changes introduced we've decided to revert these changes until we have better solutions for the problems we've learnt about. Problems: - Gaps in data cause alerts to fire (see next point) - Normal CPU rescaling causes alerts to fire #160905 - Any error fires an alert (since there is no other way to inform the user about the problems faced by the rule executor) - Many assumptions about cgroups only being for container users are wrong To address some of these issues we also need more functionality in the alerting framework to be able to register secondary actions so that we may trigger non-oncall workflows for when a rule faces issues with evaluating the stats. Original issue #116128
Reverts elastic#159351 Reverts elastic#167244 Due to the many unexpected issues that these changes introduced we've decided to revert these changes until we have better solutions for the problems we've learnt about. Problems: - Gaps in data cause alerts to fire (see next point) - Normal CPU rescaling causes alerts to fire elastic#160905 - Any error fires an alert (since there is no other way to inform the user about the problems faced by the rule executor) - Many assumptions about cgroups only being for container users are wrong To address some of these issues we also need more functionality in the alerting framework to be able to register secondary actions so that we may trigger non-oncall workflows for when a rule faces issues with evaluating the stats. Original issue elastic#116128 (cherry picked from commit 55bc6d5)
# Backport This will backport the following commits from `main` to `8.12`: - [[monitoring] Revert CPU Usage rule changes (#172913)](#172913) <!--- Backport version: 8.9.7 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Milton Hultgren","email":"[email protected]"},"sourceCommit":{"committedDate":"2023-12-08T15:25:23Z","message":"[monitoring] Revert CPU Usage rule changes (#172913)\n\nReverts https://github.com/elastic/kibana/pull/159351\r\nReverts https://github.com/elastic/kibana/pull/167244\r\n\r\nDue to the many unexpected issues that these changes introduced we've\r\ndecided to revert these changes until we have better solutions for the\r\nproblems we've learnt about.\r\n\r\nProblems:\r\n- Gaps in data cause alerts to fire (see next point)\r\n- Normal CPU rescaling causes alerts to fire\r\nhttps://github.com//issues/160905\r\n- Any error fires an alert (since there is no other way to inform the\r\nuser about the problems faced by the rule executor)\r\n- Many assumptions about cgroups only being for container users are\r\nwrong\r\n\r\nTo address some of these issues we also need more functionality in the\r\nalerting framework to be able to register secondary actions so that we\r\nmay trigger non-oncall workflows for when a rule faces issues with\r\nevaluating the stats.\r\n\r\nOriginal issue https://github.com/elastic/kibana/issues/116128","sha":"55bc6d505977e8831633cc76e0f46b2ca66ef559","branchLabelMapping":{"^v8.13.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","backport:prev-minor","v8.12.0","v8.13.0"],"number":172913,"url":"https://github.com/elastic/kibana/pull/172913","mergeCommit":{"message":"[monitoring] Revert CPU Usage rule changes (#172913)\n\nReverts https://github.com/elastic/kibana/pull/159351\r\nReverts https://github.com/elastic/kibana/pull/167244\r\n\r\nDue to the many unexpected issues that these changes introduced we've\r\ndecided to revert these changes until we have better solutions for the\r\nproblems we've learnt about.\r\n\r\nProblems:\r\n- Gaps in data cause alerts to fire (see next point)\r\n- Normal CPU rescaling causes alerts to fire\r\nhttps://github.com//issues/160905\r\n- Any error fires an alert (since there is no other way to inform the\r\nuser about the problems faced by the rule executor)\r\n- Many assumptions about cgroups only being for container users are\r\nwrong\r\n\r\nTo address some of these issues we also need more functionality in the\r\nalerting framework to be able to register secondary actions so that we\r\nmay trigger non-oncall workflows for when a rule faces issues with\r\nevaluating the stats.\r\n\r\nOriginal issue https://github.com/elastic/kibana/issues/116128","sha":"55bc6d505977e8831633cc76e0f46b2ca66ef559"}},"sourceBranch":"main","suggestedTargetBranches":["8.12"],"targetPullRequestStates":[{"branch":"8.12","label":"v8.12.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.13.0","labelRegex":"^v8.13.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/172913","number":172913,"mergeCommit":{"message":"[monitoring] Revert CPU Usage rule changes (#172913)\n\nReverts https://github.com/elastic/kibana/pull/159351\r\nReverts https://github.com/elastic/kibana/pull/167244\r\n\r\nDue to the many unexpected issues that these changes introduced we've\r\ndecided to revert these changes until we have better solutions for the\r\nproblems we've learnt about.\r\n\r\nProblems:\r\n- Gaps in data cause alerts to fire (see next point)\r\n- Normal CPU rescaling causes alerts to fire\r\nhttps://github.com//issues/160905\r\n- Any error fires an alert (since there is no other way to inform the\r\nuser about the problems faced by the rule executor)\r\n- Many assumptions about cgroups only being for container users are\r\nwrong\r\n\r\nTo address some of these issues we also need more functionality in the\r\nalerting framework to be able to register secondary actions so that we\r\nmay trigger non-oncall workflows for when a rule faces issues with\r\nevaluating the stats.\r\n\r\nOriginal issue https://github.com/elastic/kibana/issues/116128","sha":"55bc6d505977e8831633cc76e0f46b2ca66ef559"}}]}] BACKPORT--> Co-authored-by: Milton Hultgren <[email protected]>
Following up on #159351
The CPU usage rule as it looks today is not able to accurate calculate the CPU usage in the case where a resource usage limit has changed within the rules look back window (either a limit has been added, or remove, or the set limit was changed to higher or lower).
The current rule simply alerts when it detects this change, but we would ideally extend the rule to be able to handle this case.
This means that the rule needs to be able to easily swap between the containerized and non-containerized calculation for the same node.
Handling the change is non-trivial but here are 3 options we can think of right now:
1. Split the look back window into two or more spans when a change is detected
The rule already detects the change and could respond this this situation by making follow up queries that define the time ranges that apply for each setting (this could be many) and make follow up queries per time range, calculate the usage in each time range (using the appropriate calculation) and then take the average of those. This could be costly in processing time within the rule if there are more than two spans.
2. Use a date histogram to always get smaller time spans
This offers a few sub-options, we could for example drop the exact buckets where the change happened but that requires that we have enough buckets that dropping a few would not greatly affect the average.
Then for each remaining bucket we apply the appropriate calculation and take the average of the buckets.
It's possible this could be done in part by Elasticsearch but most likely it will have to be done in Kibana.
This path exposes us to scalability risks by asking Elasticsearch to do more work, potentially hitting the bucket limit and timing out the rule execution due to more processing being done.
The current rule scales per cluster per node, which can partially be worked around by creating multiple instances of the rule where we filter for a specific cluster for example.
3. The long shot: Use Elasticsearch transforms to create data that is easy to alert on
Underlying the problems the rule faces is a data format that is not easy to alert on.
We could try to leverage a Transform to change the data into something that is easier to say yes/no for.
The transform would do the work outlined in option 2 (roughly) and put the result into a document which the rule can consume, leaving the rule quick to execute since the hard work is amortized by ES.
This is somewhat uncharted territory since we don't know if a transform can keep up in speed for this to not cause the rule to lag, it introduces more complexity in the setup and there is currently no way to install transforms as part of the alerting framework. So the SM plugin would have to own setting up and cleaning up such a transform and making sure the right permissions are available.
Further, there are some doubts about the scalability about Transforms as well, specially for non aggregated data.
AC
The text was updated successfully, but these errors were encountered: