Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stack Monitoring] CPU Usage duration for alerting is not correctly used in the elastic query #116128

Open
zazabe opened this issue Oct 25, 2021 · 7 comments · Fixed by #159351
Open
Labels
bug Fixes for quality problems that affect the customer experience Feature:Stack Monitoring Team:Monitoring Stack Monitoring team

Comments

@zazabe
Copy link

zazabe commented Oct 25, 2021

Kibana version: 7.12.0 / 7.15.1

Elasticsearch version: 7.12.0 / 7.15.1

Server OS version: Elastic Cloud on AWS

Original install method (e.g. download page, yum, from source, etc.): Elastic Cloud on AWS

Describe the bug:

We customized the CPU Usage period and threshold to 50% over 1 hour, but the alert is still triggered when the CPU usage is above 50% for less than 1h.

Expected behavior:

The CPU Usage alert should trigger only if the CPU usage is above 50% in average for 1 hour.

In depth:

I executed the query produced by fetch_cpu_usage_node_stats manually, I can use either a period of 5min or 1h, the result is still 58% of CPU usage:

The query should be redesigned, because the CPU alert is not correctly calculating the CPU usage for different period based on derivative values when monitoring.ui.container.elasticsearch.enabled = true.

The root-cause is the date_histogram aggregation, for example:

  • a CPU alert triggering if the usage is >50% for 1 hour is executed at 2021-10-26 08:00:10.000
  • there are 2 items in the histogram: 2021-10-26 07:00:00.000 and 2021-10-26 08:00:00.000
  • only the last item is used by the alert algorithm, so from 08:00:00 to 08:00:10
    In this case, if the CPU usage is above the threshold just for few seconds at around 08:00, the alert will trigger.

Another problem with this query: it's not returning derivative values if the time-range is set at a rounded second, probably because there's only one item in the date_histogram aggregation.

@zazabe zazabe added the bug Fixes for quality problems that affect the customer experience label Oct 25, 2021
@botelastic botelastic bot added the needs-team Issues missing a team label label Oct 25, 2021
@zazabe
Copy link
Author

zazabe commented Oct 26, 2021

@chrisronline fyi

Tested with kibana 7.15.1 / elasticsearch 7.15.1, same problem:

@chrisronline chrisronline added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services and removed needs-team Issues missing a team label labels Oct 27, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@chrisronline
Copy link
Contributor

Tagging the proper team here, I'm no longer working on this.

@weltenwort weltenwort added the Feature:Metrics UI Metrics UI feature label Oct 28, 2021
@simianhacker
Copy link
Member

@zazabe Thanks for filling this issue and detailing the derivative issues. Are you running in a containerized environment?

@zazabe
Copy link
Author

zazabe commented Nov 1, 2021

Are you running in a containerized environment?

Yes, we are using Elastic Cloud on AWS.


BTW, we solved the problem by using a watcher, it might be useful to fix the problem:

CPU usage watcher
{
  "trigger": {
    "schedule": {
      "interval": "5m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          ".monitoring-es-*"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "aggs": {
            "nodes": {
              "aggs": {
                "cpu_usage": {
                  "avg_bucket": {
                    "buckets_path": "histo>container_cpu_usage",
                    "gap_policy": "skip"
                  }
                },
                "histo": {
                  "aggs": {
                    "avg_cfs_quota_micros": {
                      "avg": {
                        "field": "node_stats.os.cgroup.cpu.cfs_quota_micros"
                      }
                    },
                    "container_cpu_usage": {
                      "bucket_script": {
                        "buckets_path": {
                          "avg_cfs_quota_micros": "avg_cfs_quota_micros.value",
                          "delta_cpuacct_usage_nanos": "delta_cpuacct_usage_nanos.normalized_value",
                          "delta_number_of_elapsed_periods": "delta_number_of_elapsed_periods.normalized_value"
                        },
                        "script": "(params.delta_cpuacct_usage_nanos / ( params.delta_number_of_elapsed_periods * params.avg_cfs_quota_micros * 1000))"
                      }
                    },
                    "delta_cpuacct_usage_nanos": {
                      "derivative": {
                        "buckets_path": "max_cpuacct_usage_nanos",
                        "gap_policy": "skip",
                        "unit": "1s"
                      }
                    },
                    "delta_number_of_elapsed_periods": {
                      "derivative": {
                        "buckets_path": "max_number_of_elapsed_periods",
                        "gap_policy": "skip",
                        "unit": "1s"
                      }
                    },
                    "max_cpuacct_usage_nanos": {
                      "max": {
                        "field": "node_stats.os.cgroup.cpuacct.usage_nanos"
                      }
                    },
                    "max_number_of_elapsed_periods": {
                      "max": {
                        "field": "node_stats.os.cgroup.cpu.stat.number_of_elapsed_periods"
                      }
                    }
                  },
                  "date_histogram": {
                    "field": "timestamp",
                    "fixed_interval": "1m"
                  }
                }
              },
              "terms": {
                "field": "source_node.name"
              }
            }
          },
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "timestamp": {
                      "gte": "{{ctx.trigger.scheduled_time}}||-{{ctx.metadata.period}}"
                    }
                  }
                },
                {
                  "term": {
                    "type": "node_stats"
                  }
                }
              ]
            }
          },
          "size": 0
        }
      }
    }
  },
  "condition": {
    "script": {
      "source": "return ctx.payload.aggregations.nodes.buckets.stream().anyMatch(it -> it.cpu_usage.value > ctx.metadata.cpu_usage_max);",
      "lang": "painless"
    }
  },
  "actions": {
    "slack_1": {
      "slack": {
        "message": {
          "from": "kibana-production",
          "to": [
            "#kibana-production"
          ],
          "text": """CPU usage above {{ctx.payload.cpu_usage_max_percent}}% during the last {{ctx.metadata.period}} for node(s): 
{{#ctx.payload.nodes}}- {{node_name}}: {{cpu_usage_percent}}%
{{/ctx.payload.nodes}}"""
        }
      }
    }
  },
  "metadata": {
    "period": "60m",
    "cpu_usage_max": 0.5
  },
  "transform": {
    "script": {
      "source": "['cpu_usage_max_percent': ctx.metadata.cpu_usage_max * 100, 'nodes': ctx.payload.aggregations.nodes.buckets.stream().filter(it -> it.cpu_usage.value > ctx.metadata.cpu_usage_max).map(it -> ['node_name':it.key,'cpu_usage_percent':Math.round(it.cpu_usage.value * 100)]).collect(Collectors.toList())];",
      "lang": "painless"
    }
  }
}

@simianhacker
Copy link
Member

simianhacker commented May 4, 2022

One problem with this rule is that the date_histogram interval is the same size as the time range of the query. This will give you inconsistent results due to how the date_histogram bucketing works and when the query runs. Sometimes you will get enough buckets for the derivatives and other times you won't. The other issue is that the date_histogram will create buckets at the end of the time range that go past the data available so that last bucket is not a complete representation of the whole minute.

A better way to write this would be to query for 2x the interval and then use 2 filter aggregations with a range filter for the previous bucket and the current bucket; then calculate the derivative manually. For example, if the user asks for the last 5 minutes, query the last 10 minutes.

BTW... this only applies when monitoring.ui.container.elasticsearch.enabled: true is set.

@klacabane klacabane changed the title [Monitoring] CPU Usage duration for alerting is not correctly used in the elastic query [Stack Monitoring] CPU Usage duration for alerting is not correctly used in the elastic query Jul 25, 2022
@miltonhultgren miltonhultgren self-assigned this Jun 5, 2023
miltonhultgren added a commit to miltonhultgren/kibana that referenced this issue Jun 8, 2023
miltonhultgren added a commit that referenced this issue Jul 6, 2023
Fixes #116128

# Summary

This PR changes how the CPU Usage Rule calculates the usage percentage
for containerized clusters.

Based on the comment
[here](#116128 (comment)),
my understanding of the issue was that because we were using a
`date_histogram` to grab the values we could sometimes run into issues
around how `date_histogram` rounds the time range and aligns it towards
the start rather than the end, causing the last bucket to be incomplete,
this is aggravated by the fact that we make the fixed duration of the
histogram the size of the lookback window.

I took a slightly different path for the rewrite, rather than using the
derivative I just look at the usage across the whole range using a
simple delta.

This has a glaring flaw in that it cannot account for the limits
changing within the lookback window (going higher/lower or set/unset),
which we will have to try to address in
#160905. The changes in this PR
should make the situation better in the other cases and it makes clear
when the limits have changed by firing alerts.
#160897 outlines follow up work
to align how the CPU usage is presented in other places in the UI.

# Screenshots

**Above threshold:**
<img width="1331" alt="above-threshold"
src="https://github.com/elastic/kibana/assets/2564140/4dc4dc2a-a858-4022-8407-8179ec3115df">

**Failed to compute usage:**
<img width="1324" alt="failed-to-compute"
src="https://github.com/elastic/kibana/assets/2564140/88cb3794-6466-4881-acea-002a4f81c34e">

**Limits changed:**
<img width="2082" alt="limits-changed"
src="https://github.com/elastic/kibana/assets/2564140/d0526421-9362-4695-ab00-af69aa9838c9">

**Limits missing:**
<img width="1743" alt="missing-resource-limits"
src="https://github.com/elastic/kibana/assets/2564140/82626968-8b18-453d-9cf8-8a6776a6a46e">

**Unexpected limits:**
<img width="1637" alt="unexpected-resource-limits"
src="https://github.com/elastic/kibana/assets/2564140/721deb15-d75b-4915-8f77-b18d0b33da7d">

# CPU usage for the Completely Fair Scheduler (CFS) for Control Groups
(cgroup)

The way CPU usage for containers is calculated is this formula:
`execution_time / (time_quota_per_schedule_period * number_of_periods)`

Execution time is a counter of how many cycles the container was allowed
to execute for by the scheduler, the quota is the limit of how many
cycles are allowed per period.

The number of periods is derived from the length of the period which can
also be changed. the default being 0.1 seconds.
At the end of each period, the available cycles is refilled to
`time_quota_per_schedule_period`. With a longer period, you're likely to
be throttled more often since you'll have to wait longer for a refresh,
so once you've used your allowance for that period you're blocked. With
a shorter period you're getting refilled more often so your total
available usage is higher.
Both scenarios have an effect on your percentage CPU usage but the
number of elapsed periods is a proxy for both of these cases. If you
wanted to know about throttling compared to only CPU usage then you
might want a separate rule for that stat. In short, 100% CPU usage means
you're being throttled to some degree. The number of periods is a safe
proxy for the details of period length as the period length will only
affect the rate at which quota is refreshed.

These fields are counters, so for any given time range, we need to grab
the biggest value (the latest) and subtract from that the lowest value
(the earliest) to get the delta, then we plug those delta values into
the formula above to get the factor (then multiply by 100 to make that a
percentage). The code also has some unit conversion because the quota is
in microseconds while the usage is in nano seconds.

# How to test

There are 3 main states to test:
No limit set but Kibana configured to use container stats.
Limit changed during lookback period (to/from real value, to/from no
limit).
Limit set and CPU usage crossing threshold and then falling down to
recovery

**Note: Please also test the non-container use case for this rule to
ensure that didn't get broken during this refactor**

**1. Start Elasticsearch in a container without setting the CPU
limits:**
```
docker network create elastic
docker run --name es01 --net elastic -p 9201:9200 -e xpack.license.self_generated.type=trial -it docker.elastic.co/elasticsearch/elasticsearch:master-SNAPSHOT
```

(We're using `master-SNAPSHOT` to include a recent fix to reporting for
cgroup v2)

Make note of the generated password for the `elastic` user.

**2. Start another Elasticsearch instance to act as the monitoring
cluster**

**3. Configure Kibana to connect to the monitoring cluster and start
it**

**4. Configure Metricbeat to collect metrics from the Docker cluster and
ship them to the monitoring cluster, then start it**

Execute the below command next to the Metricbeat binary to grab the CA
certificate from the Elasticsearch cluster.

```
docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt .
```

Use the `elastic` password and the CA certificate to configure the
`elasticsearch` module:
```
  - module: elasticsearch
    xpack.enabled: true
    period: 10s
    hosts:
      - "https://localhost:9201"
    username: "elastic"
    password: "PASSWORD"
    ssl.certificate_authorities: "PATH_TO_CERT/http_ca.crt"
```

**5. Configure an alert in Kibana with a chosen threshold**

OBSERVE: Alert gets fired to inform you that there looks to be a
misconfiguration, together with reporting the current value for the
fallback metric (warning if the fallback metric is below threshold,
danger is if is above).

**6. Set limit**
First stop ES using `docker stop es01`, then set the limit using `docker
update --cpus=1 es01` and start it again using `docker start es01`.
After a brief delay you should now see the alert change to a warning
about the limits having changed during the alert lookback period and
stating that the CPU usage could not be confidently calculated.
Wait for change event to pass out of lookback window.

**7. Generate load on the monitored cluster**

[Slingshot](https://github.com/elastic/slingshot) is an option. After
you clone it, you need to update the `package.json` to match [this
change](https://github.com/elastic/slingshot/blob/8bfa8351deb0d89859548ee5241e34d0920927e5/package.json#L45-L46)
before running `npm install`.

Then you can modify the value for `elasticsearch` in the
`configs/hosts.json` file like this:
```
"elasticsearch": {
    "node": "https://localhost:9201",
    "auth": {
      "username": "elastic",
      "password": "PASSWORD"
    },
    "ssl": {
      "ca": "PATH_TO_CERT/http_ca.crt",
      "rejectUnauthorized": false
    }
  }
```

Then you can start one or more instances of Slingshot like this:
`npx ts-node bin/slingshot load --config configs/hosts.json`

**7. Observe the alert firing in the logs**
Assuming you're using a connector for server log output, you should see
a message like below once the threshold is breached:
```
`[2023-06-13T13:05:50.036+02:00][INFO ][plugins.actions.server-log] Server log: CPU usage alert is firing for node e76ce10526e2 in cluster: docker-cluster. [View node](/app/monitoring#/elasticsearch/nodes/OyDWTz1PS-aEwjqcPN2vNQ?_g=(cluster_uuid:kasJK8VyTG6xNZ2PFPAtYg))`
```

The alert should also be visible in the Stack Monitoring UI overview
page.

At this point you can stop Slingshot and confirm that the alert recovers
once CPU usage goes back down below the threshold.

**8. Stop the load and confirm that the rule recovers.**

# A second opinion

I made a little dashboard to replicate what the graph in SM and the rule
**_should_** see:

[cpu_usage_dashboard.ndjson.zip](https://github.com/elastic/kibana/files/11728315/cpu_usage_dashboard.ndjson.zip)

If you want to play with the data, I've collected an `es_archive` which
you can load like this:
`node scripts/es_archiver load PATH_TO_ARCHIVE/containerized_cpu_load
--es-url http://elastic:changeme@localhost:9200 --kibana-url
http://elastic:changeme@localhost:5601/__UNSAFE_bypassBasePath`

[containerized_cpu_load.zip](https://github.com/elastic/kibana/files/11754646/containerized_cpu_load.zip)

These are the timestamps to view the data:
Start: Jun 13, 2023 @ 11:40:00.000
End:   Jun 13, 2023 @ 12:40:00.000
CPU average: 52.76%

---------

Co-authored-by: kibanamachine <[email protected]>
@miltonhultgren
Copy link
Contributor

We've reverted the changes introduced to attempt to fix this issue.

@miltonhultgren miltonhultgren reopened this Dec 8, 2023
@miltonhultgren miltonhultgren removed their assignment Dec 8, 2023
miltonhultgren added a commit that referenced this issue Dec 8, 2023
Reverts #159351
Reverts #167244

Due to the many unexpected issues that these changes introduced we've
decided to revert these changes until we have better solutions for the
problems we've learnt about.

Problems:
- Gaps in data cause alerts to fire (see next point)
- Normal CPU rescaling causes alerts to fire
#160905
- Any error fires an alert (since there is no other way to inform the
user about the problems faced by the rule executor)
- Many assumptions about cgroups only being for container users are
wrong

To address some of these issues we also need more functionality in the
alerting framework to be able to register secondary actions so that we
may trigger non-oncall workflows for when a rule faces issues with
evaluating the stats.

Original issue #116128
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Dec 8, 2023
Reverts elastic#159351
Reverts elastic#167244

Due to the many unexpected issues that these changes introduced we've
decided to revert these changes until we have better solutions for the
problems we've learnt about.

Problems:
- Gaps in data cause alerts to fire (see next point)
- Normal CPU rescaling causes alerts to fire
elastic#160905
- Any error fires an alert (since there is no other way to inform the
user about the problems faced by the rule executor)
- Many assumptions about cgroups only being for container users are
wrong

To address some of these issues we also need more functionality in the
alerting framework to be able to register secondary actions so that we
may trigger non-oncall workflows for when a rule faces issues with
evaluating the stats.

Original issue elastic#116128

(cherry picked from commit 55bc6d5)
kibanamachine referenced this issue Dec 8, 2023
# Backport

This will backport the following commits from `main` to `8.12`:
- [[monitoring] Revert CPU Usage rule changes
(#172913)](#172913)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Milton
Hultgren","email":"[email protected]"},"sourceCommit":{"committedDate":"2023-12-08T15:25:23Z","message":"[monitoring]
Revert CPU Usage rule changes (#172913)\n\nReverts
https://github.com/elastic/kibana/pull/159351\r\nReverts
https://github.com/elastic/kibana/pull/167244\r\n\r\nDue to the many
unexpected issues that these changes introduced we've\r\ndecided to
revert these changes until we have better solutions for the\r\nproblems
we've learnt about.\r\n\r\nProblems:\r\n- Gaps in data cause alerts to
fire (see next point)\r\n- Normal CPU rescaling causes alerts to
fire\r\nhttps://github.com//issues/160905\r\n- Any error
fires an alert (since there is no other way to inform the\r\nuser about
the problems faced by the rule executor)\r\n- Many assumptions about
cgroups only being for container users are\r\nwrong\r\n\r\nTo address
some of these issues we also need more functionality in the\r\nalerting
framework to be able to register secondary actions so that we\r\nmay
trigger non-oncall workflows for when a rule faces issues
with\r\nevaluating the stats.\r\n\r\nOriginal issue
https://github.com/elastic/kibana/issues/116128","sha":"55bc6d505977e8831633cc76e0f46b2ca66ef559","branchLabelMapping":{"^v8.13.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","backport:prev-minor","v8.12.0","v8.13.0"],"number":172913,"url":"https://github.com/elastic/kibana/pull/172913","mergeCommit":{"message":"[monitoring]
Revert CPU Usage rule changes (#172913)\n\nReverts
https://github.com/elastic/kibana/pull/159351\r\nReverts
https://github.com/elastic/kibana/pull/167244\r\n\r\nDue to the many
unexpected issues that these changes introduced we've\r\ndecided to
revert these changes until we have better solutions for the\r\nproblems
we've learnt about.\r\n\r\nProblems:\r\n- Gaps in data cause alerts to
fire (see next point)\r\n- Normal CPU rescaling causes alerts to
fire\r\nhttps://github.com//issues/160905\r\n- Any error
fires an alert (since there is no other way to inform the\r\nuser about
the problems faced by the rule executor)\r\n- Many assumptions about
cgroups only being for container users are\r\nwrong\r\n\r\nTo address
some of these issues we also need more functionality in the\r\nalerting
framework to be able to register secondary actions so that we\r\nmay
trigger non-oncall workflows for when a rule faces issues
with\r\nevaluating the stats.\r\n\r\nOriginal issue
https://github.com/elastic/kibana/issues/116128","sha":"55bc6d505977e8831633cc76e0f46b2ca66ef559"}},"sourceBranch":"main","suggestedTargetBranches":["8.12"],"targetPullRequestStates":[{"branch":"8.12","label":"v8.12.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.13.0","labelRegex":"^v8.13.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/172913","number":172913,"mergeCommit":{"message":"[monitoring]
Revert CPU Usage rule changes (#172913)\n\nReverts
https://github.com/elastic/kibana/pull/159351\r\nReverts
https://github.com/elastic/kibana/pull/167244\r\n\r\nDue to the many
unexpected issues that these changes introduced we've\r\ndecided to
revert these changes until we have better solutions for the\r\nproblems
we've learnt about.\r\n\r\nProblems:\r\n- Gaps in data cause alerts to
fire (see next point)\r\n- Normal CPU rescaling causes alerts to
fire\r\nhttps://github.com//issues/160905\r\n- Any error
fires an alert (since there is no other way to inform the\r\nuser about
the problems faced by the rule executor)\r\n- Many assumptions about
cgroups only being for container users are\r\nwrong\r\n\r\nTo address
some of these issues we also need more functionality in the\r\nalerting
framework to be able to register secondary actions so that we\r\nmay
trigger non-oncall workflows for when a rule faces issues
with\r\nevaluating the stats.\r\n\r\nOriginal issue
https://github.com/elastic/kibana/issues/116128","sha":"55bc6d505977e8831633cc76e0f46b2ca66ef559"}}]}]
BACKPORT-->

Co-authored-by: Milton Hultgren <[email protected]>
@smith smith added Team:Monitoring Stack Monitoring team and removed Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Stack Monitoring Team:Monitoring Stack Monitoring team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants