Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stack Monitoring] CPU usage rule should handle usage limit changes #160905

Open
miltonhultgren opened this issue Jun 29, 2023 · 22 comments
Open

Comments

@miltonhultgren
Copy link
Contributor

Following up on #159351

The CPU usage rule as it looks today is not able to accurate calculate the CPU usage in the case where a resource usage limit has changed within the rules look back window (either a limit has been added, or remove, or the set limit was changed to higher or lower).
The current rule simply alerts when it detects this change, but we would ideally extend the rule to be able to handle this case.
This means that the rule needs to be able to easily swap between the containerized and non-containerized calculation for the same node.

Handling the change is non-trivial but here are 3 options we can think of right now:

1. Split the look back window into two or more spans when a change is detected
The rule already detects the change and could respond this this situation by making follow up queries that define the time ranges that apply for each setting (this could be many) and make follow up queries per time range, calculate the usage in each time range (using the appropriate calculation) and then take the average of those. This could be costly in processing time within the rule if there are more than two spans.

2. Use a date histogram to always get smaller time spans
This offers a few sub-options, we could for example drop the exact buckets where the change happened but that requires that we have enough buckets that dropping a few would not greatly affect the average.
Then for each remaining bucket we apply the appropriate calculation and take the average of the buckets.
It's possible this could be done in part by Elasticsearch but most likely it will have to be done in Kibana.
This path exposes us to scalability risks by asking Elasticsearch to do more work, potentially hitting the bucket limit and timing out the rule execution due to more processing being done.
The current rule scales per cluster per node, which can partially be worked around by creating multiple instances of the rule where we filter for a specific cluster for example.

3. The long shot: Use Elasticsearch transforms to create data that is easy to alert on
Underlying the problems the rule faces is a data format that is not easy to alert on.
We could try to leverage a Transform to change the data into something that is easier to say yes/no for.
The transform would do the work outlined in option 2 (roughly) and put the result into a document which the rule can consume, leaving the rule quick to execute since the hard work is amortized by ES.
This is somewhat uncharted territory since we don't know if a transform can keep up in speed for this to not cause the rule to lag, it introduces more complexity in the setup and there is currently no way to install transforms as part of the alerting framework. So the SM plugin would have to own setting up and cleaning up such a transform and making sure the right permissions are available.
Further, there are some doubts about the scalability about Transforms as well, specially for non aggregated data.

AC

  • The CPU usage rule accurately reports CPU usage even if the resource limits have changed during the configured look back window
@miltonhultgren miltonhultgren added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Feature:Stack Monitoring labels Jun 29, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@miltonhultgren
Copy link
Contributor Author

Thinking about it, my vote would be for option 2.

Especially if we can push a majority of the work into Elasticsearch, that path has benefits since it's only one query and the scaling issue can be addressed by having multiple instances of the rule with varying filters, leaving the over all tracing flow simpler.

miltonhultgren added a commit that referenced this issue Jul 6, 2023
Fixes #116128

# Summary

This PR changes how the CPU Usage Rule calculates the usage percentage
for containerized clusters.

Based on the comment
[here](#116128 (comment)),
my understanding of the issue was that because we were using a
`date_histogram` to grab the values we could sometimes run into issues
around how `date_histogram` rounds the time range and aligns it towards
the start rather than the end, causing the last bucket to be incomplete,
this is aggravated by the fact that we make the fixed duration of the
histogram the size of the lookback window.

I took a slightly different path for the rewrite, rather than using the
derivative I just look at the usage across the whole range using a
simple delta.

This has a glaring flaw in that it cannot account for the limits
changing within the lookback window (going higher/lower or set/unset),
which we will have to try to address in
#160905. The changes in this PR
should make the situation better in the other cases and it makes clear
when the limits have changed by firing alerts.
#160897 outlines follow up work
to align how the CPU usage is presented in other places in the UI.

# Screenshots

**Above threshold:**
<img width="1331" alt="above-threshold"
src="https://github.com/elastic/kibana/assets/2564140/4dc4dc2a-a858-4022-8407-8179ec3115df">

**Failed to compute usage:**
<img width="1324" alt="failed-to-compute"
src="https://github.com/elastic/kibana/assets/2564140/88cb3794-6466-4881-acea-002a4f81c34e">

**Limits changed:**
<img width="2082" alt="limits-changed"
src="https://github.com/elastic/kibana/assets/2564140/d0526421-9362-4695-ab00-af69aa9838c9">

**Limits missing:**
<img width="1743" alt="missing-resource-limits"
src="https://github.com/elastic/kibana/assets/2564140/82626968-8b18-453d-9cf8-8a6776a6a46e">

**Unexpected limits:**
<img width="1637" alt="unexpected-resource-limits"
src="https://github.com/elastic/kibana/assets/2564140/721deb15-d75b-4915-8f77-b18d0b33da7d">

# CPU usage for the Completely Fair Scheduler (CFS) for Control Groups
(cgroup)

The way CPU usage for containers is calculated is this formula:
`execution_time / (time_quota_per_schedule_period * number_of_periods)`

Execution time is a counter of how many cycles the container was allowed
to execute for by the scheduler, the quota is the limit of how many
cycles are allowed per period.

The number of periods is derived from the length of the period which can
also be changed. the default being 0.1 seconds.
At the end of each period, the available cycles is refilled to
`time_quota_per_schedule_period`. With a longer period, you're likely to
be throttled more often since you'll have to wait longer for a refresh,
so once you've used your allowance for that period you're blocked. With
a shorter period you're getting refilled more often so your total
available usage is higher.
Both scenarios have an effect on your percentage CPU usage but the
number of elapsed periods is a proxy for both of these cases. If you
wanted to know about throttling compared to only CPU usage then you
might want a separate rule for that stat. In short, 100% CPU usage means
you're being throttled to some degree. The number of periods is a safe
proxy for the details of period length as the period length will only
affect the rate at which quota is refreshed.

These fields are counters, so for any given time range, we need to grab
the biggest value (the latest) and subtract from that the lowest value
(the earliest) to get the delta, then we plug those delta values into
the formula above to get the factor (then multiply by 100 to make that a
percentage). The code also has some unit conversion because the quota is
in microseconds while the usage is in nano seconds.

# How to test

There are 3 main states to test:
No limit set but Kibana configured to use container stats.
Limit changed during lookback period (to/from real value, to/from no
limit).
Limit set and CPU usage crossing threshold and then falling down to
recovery

**Note: Please also test the non-container use case for this rule to
ensure that didn't get broken during this refactor**

**1. Start Elasticsearch in a container without setting the CPU
limits:**
```
docker network create elastic
docker run --name es01 --net elastic -p 9201:9200 -e xpack.license.self_generated.type=trial -it docker.elastic.co/elasticsearch/elasticsearch:master-SNAPSHOT
```

(We're using `master-SNAPSHOT` to include a recent fix to reporting for
cgroup v2)

Make note of the generated password for the `elastic` user.

**2. Start another Elasticsearch instance to act as the monitoring
cluster**

**3. Configure Kibana to connect to the monitoring cluster and start
it**

**4. Configure Metricbeat to collect metrics from the Docker cluster and
ship them to the monitoring cluster, then start it**

Execute the below command next to the Metricbeat binary to grab the CA
certificate from the Elasticsearch cluster.

```
docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt .
```

Use the `elastic` password and the CA certificate to configure the
`elasticsearch` module:
```
  - module: elasticsearch
    xpack.enabled: true
    period: 10s
    hosts:
      - "https://localhost:9201"
    username: "elastic"
    password: "PASSWORD"
    ssl.certificate_authorities: "PATH_TO_CERT/http_ca.crt"
```

**5. Configure an alert in Kibana with a chosen threshold**

OBSERVE: Alert gets fired to inform you that there looks to be a
misconfiguration, together with reporting the current value for the
fallback metric (warning if the fallback metric is below threshold,
danger is if is above).

**6. Set limit**
First stop ES using `docker stop es01`, then set the limit using `docker
update --cpus=1 es01` and start it again using `docker start es01`.
After a brief delay you should now see the alert change to a warning
about the limits having changed during the alert lookback period and
stating that the CPU usage could not be confidently calculated.
Wait for change event to pass out of lookback window.

**7. Generate load on the monitored cluster**

[Slingshot](https://github.com/elastic/slingshot) is an option. After
you clone it, you need to update the `package.json` to match [this
change](https://github.com/elastic/slingshot/blob/8bfa8351deb0d89859548ee5241e34d0920927e5/package.json#L45-L46)
before running `npm install`.

Then you can modify the value for `elasticsearch` in the
`configs/hosts.json` file like this:
```
"elasticsearch": {
    "node": "https://localhost:9201",
    "auth": {
      "username": "elastic",
      "password": "PASSWORD"
    },
    "ssl": {
      "ca": "PATH_TO_CERT/http_ca.crt",
      "rejectUnauthorized": false
    }
  }
```

Then you can start one or more instances of Slingshot like this:
`npx ts-node bin/slingshot load --config configs/hosts.json`

**7. Observe the alert firing in the logs**
Assuming you're using a connector for server log output, you should see
a message like below once the threshold is breached:
```
`[2023-06-13T13:05:50.036+02:00][INFO ][plugins.actions.server-log] Server log: CPU usage alert is firing for node e76ce10526e2 in cluster: docker-cluster. [View node](/app/monitoring#/elasticsearch/nodes/OyDWTz1PS-aEwjqcPN2vNQ?_g=(cluster_uuid:kasJK8VyTG6xNZ2PFPAtYg))`
```

The alert should also be visible in the Stack Monitoring UI overview
page.

At this point you can stop Slingshot and confirm that the alert recovers
once CPU usage goes back down below the threshold.

**8. Stop the load and confirm that the rule recovers.**

# A second opinion

I made a little dashboard to replicate what the graph in SM and the rule
**_should_** see:

[cpu_usage_dashboard.ndjson.zip](https://github.com/elastic/kibana/files/11728315/cpu_usage_dashboard.ndjson.zip)

If you want to play with the data, I've collected an `es_archive` which
you can load like this:
`node scripts/es_archiver load PATH_TO_ARCHIVE/containerized_cpu_load
--es-url http://elastic:changeme@localhost:9200 --kibana-url
http://elastic:changeme@localhost:5601/__UNSAFE_bypassBasePath`

[containerized_cpu_load.zip](https://github.com/elastic/kibana/files/11754646/containerized_cpu_load.zip)

These are the timestamps to view the data:
Start: Jun 13, 2023 @ 11:40:00.000
End:   Jun 13, 2023 @ 12:40:00.000
CPU average: 52.76%

---------

Co-authored-by: kibanamachine <[email protected]>
@bck01215
Copy link

bck01215 commented Sep 19, 2023

The CPU usage rule seems to break noncontainerized environments.
image
We get this on all our nodes now

@miltonhultgren
Copy link
Contributor Author

miltonhultgren commented Sep 19, 2023

@bck01215 Can you explain more about your setup?
Are you using containers? Are you using cgroups without containers? Do you have limits set? Have you configured monitoring.ui.container.elasticsearch.enabled? Are you monitoring both containerized and non-containerized workloads with the same Kibana instance? Can you share the result of GET /_nodes/_local/stats from the node in question?

That error means that Kibana is configured with monitoring.ui.container.elasticsearch.enabled set to false (which is the default) but that the nodes are reporting monitoring data which includes cgroup metrics for usage limits (which Elasticsearch only reports if it's running in a cgroup).
In that case, the basic CPU metric is likely false since it doesn't account for the cgroup limits, hence why Kibana should be configured with monitoring.ui.container.elasticsearch.enabled set to true instead (since that changes how the rule computes the usage).

@bck01215
Copy link

We did not have containers. This error came from updating from 8?4 to 8?10. It seemed that deleting ans recreating the rule fixed it

@miltonhultgren
Copy link
Contributor Author

Interesting, perhaps there is/was something stored in the rule state that would affect the flow.

Anyway, I'm glad it was solved by re-creating the rule, don't hesitate to reach out again if any issues come up!

@msafdal
Copy link

msafdal commented Sep 28, 2023

Getting the same alert as @bck01215 triggered after the last few upgrades. Currently running 8.10.2 on both Elasticsearch and Kibana.

Kibana is configured for non-containerized workloads but node xxx has resource limits configured.

image

In my case removing and re-adding the rule did not resolve the issue.

I'm not running any containers, however I noticed that the systemd service mentions "CGroup". I've not touched any cgroup limits. It's installed "out-of-the-box" via apt on a fully patched Ubuntu 20.04 system. Might be a "false positive" depending on how the "containerization detection" is done maybe.

Operating System: Ubuntu 20.04.6 LTS
Kernel: Linux 5.4.0-163-generic
Architecture: x86-64

# systemctl status elasticsearch.service
● elasticsearch.service - Elasticsearch
     Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/elasticsearch.service.d
             └─override.conf
     Active: active (running) since Thu 2023-09-28 10:45:02 CEST; 17min ago
       Docs: https://www.elastic.co
   Main PID: 2465 (java)
      Tasks: 159 (limit: 9425)
     Memory: 6.1G
     CGroup: /system.slice/elasticsearch.service
             ├─2465 /usr/share/elasticsearch/jdk/bin/java -Xms4m -Xmx64m -XX:+UseSerialGC -Dcli.name=server -Dcli.script=/usr/share/elasticsearch/bin/elasticsearch -Dcli.libs=lib/tools>
             ├─2539 /usr/share/elasticsearch/jdk/bin/java -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -Djava.security.manager=allow -XX:+AlwaysPreTouch ->
             └─2563 /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/bin/controller
# cat /etc/systemd/system/elasticsearch.service.d/override.conf
[Service]
LimitMEMLOCK=infinity
# systemctl show elasticsearch.service

Type=notify
Restart=no
NotifyAccess=all
RestartUSec=100ms
TimeoutStartUSec=15min
TimeoutStopUSec=infinity
TimeoutAbortUSec=infinity
RuntimeMaxUSec=infinity
WatchdogUSec=0
WatchdogTimestampMonotonic=0
RootDirectoryStartOnly=no
RemainAfterExit=no
GuessMainPID=yes
SuccessExitStatus=143
MainPID=2465
ControlPID=0
FileDescriptorStoreMax=0
NFileDescriptorStore=0
StatusErrno=0
Result=success
ReloadResult=success
CleanResult=success
UID=111
GID=113
NRestarts=0
OOMPolicy=stop
ExecMainStartTimestamp=Thu 2023-09-28 10:44:12 CEST
ExecMainStartTimestampMonotonic=91909234
ExecMainExitTimestampMonotonic=0
ExecMainPID=2465
ExecMainCode=0
ExecMainStatus=0
ExecStart={ path=/usr/share/elasticsearch/bin/systemd-entrypoint ; argv[]=/usr/share/elasticsearch/bin/systemd-entrypoint -p ${PID_DIR}/elasticsearch.pid --quiet ; ignore_errors=no ; start_time=[n/a] ; stop_time=[n/a] ; pid=0 ; code=(null) ; status=0/0 }
ExecStartEx={ path=/usr/share/elasticsearch/bin/systemd-entrypoint ; argv[]=/usr/share/elasticsearch/bin/systemd-entrypoint -p ${PID_DIR}/elasticsearch.pid --quiet ; flags= ; start_time=[n/a] ; stop_time=[n/a] ; pid=0 ; code=(null) ; status=0/0 }
Slice=system.slice
ControlGroup=/system.slice/elasticsearch.service
MemoryCurrent=6604414976
CPUUsageNSec=[not set]
EffectiveCPUs=
EffectiveMemoryNodes=
TasksCurrent=165
IPIngressBytes=[no data]
IPIngressPackets=[no data]
IPEgressBytes=[no data]
IPEgressPackets=[no data]
IOReadBytes=18446744073709551615
IOReadOperations=18446744073709551615
IOWriteBytes=18446744073709551615
IOWriteOperations=18446744073709551615
Delegate=no
CPUAccounting=no
CPUWeight=[not set]
StartupCPUWeight=[not set]
CPUShares=[not set]
StartupCPUShares=[not set]
CPUQuotaPerSecUSec=infinity
CPUQuotaPeriodUSec=infinity
AllowedCPUs=
AllowedMemoryNodes=
IOAccounting=no
IOWeight=[not set]
StartupIOWeight=[not set]
BlockIOAccounting=no
BlockIOWeight=[not set]
StartupBlockIOWeight=[not set]
MemoryAccounting=yes
DefaultMemoryLow=0
DefaultMemoryMin=0
MemoryMin=0
MemoryLow=0
MemoryHigh=infinity
MemoryMax=infinity
MemorySwapMax=infinity
MemoryLimit=infinity
DevicePolicy=auto
TasksAccounting=yes
TasksMax=9425
IPAccounting=no
Environment=ES_HOME=/usr/share/elasticsearch ES_PATH_CONF=/etc/elasticsearch PID_DIR=/var/run/elasticsearch ES_SD_NOTIFY=true
EnvironmentFiles=/etc/default/elasticsearch (ignore_errors=yes)
UMask=0022
LimitCPU=infinity
LimitCPUSoft=infinity
LimitFSIZE=infinity
LimitFSIZESoft=infinity
LimitDATA=infinity
LimitDATASoft=infinity
LimitSTACK=infinity
LimitSTACKSoft=8388608
LimitCORE=infinity
LimitCORESoft=0
LimitRSS=infinity
LimitRSSSoft=infinity
LimitNOFILE=65535
LimitNOFILESoft=65535
LimitAS=infinity
LimitASSoft=infinity
LimitNPROC=4096
LimitNPROCSoft=4096
LimitMEMLOCK=infinity
LimitMEMLOCKSoft=infinity
LimitLOCKS=infinity
LimitLOCKSSoft=infinity
LimitSIGPENDING=31419
LimitSIGPENDINGSoft=31419
LimitMSGQUEUE=819200
LimitMSGQUEUESoft=819200
LimitNICE=0
LimitNICESoft=0
LimitRTPRIO=0
LimitRTPRIOSoft=0
LimitRTTIME=infinity
LimitRTTIMESoft=infinity
WorkingDirectory=/usr/share/elasticsearch
OOMScoreAdjust=0
Nice=0
IOSchedulingClass=0
IOSchedulingPriority=0
CPUSchedulingPolicy=0
CPUSchedulingPriority=0
CPUAffinity=
CPUAffinityFromNUMA=no
NUMAPolicy=n/a
NUMAMask=
TimerSlackNSec=50000
CPUSchedulingResetOnFork=no
NonBlocking=no
StandardInput=null
StandardInputData=
StandardOutput=journal
StandardError=inherit
TTYReset=no
TTYVHangup=no
TTYVTDisallocate=no
SyslogPriority=30
SyslogLevelPrefix=yes
SyslogLevel=6
SyslogFacility=3
LogLevelMax=-1
LogRateLimitIntervalUSec=0
LogRateLimitBurst=0
SecureBits=0
CapabilityBoundingSet=cap_chown cap_dac_override cap_dac_read_search cap_fowner cap_fsetid cap_kill cap_setgid cap_setuid cap_setpcap cap_linux_immutable cap_net_bind_service cap_net_broadcast cap_net_admin cap_net_raw cap_ipc_lock cap_ipc_owner cap_sys_module cap_sys_rawio cap_sys_chroot cap_sys_ptrace cap_sys_pacct cap_sys_admin cap_sys_boot cap_sys_nice cap_sys_resource cap_sys_time cap_sys_tty_config cap_mknod cap_lease cap_audit_write cap_audit_control cap_setfcap cap_mac_override cap_mac_admin cap_syslog cap_wake_alarm cap_block_suspend cap_audit_read
AmbientCapabilities=
User=elasticsearch
Group=elasticsearch
DynamicUser=no
RemoveIPC=no
MountFlags=
PrivateTmp=yes
PrivateDevices=no
ProtectKernelTunables=no
ProtectKernelModules=no
ProtectKernelLogs=no
ProtectControlGroups=no
PrivateNetwork=no
PrivateUsers=no
PrivateMounts=no
ProtectHome=no
ProtectSystem=no
SameProcessGroup=no
UtmpMode=init
IgnoreSIGPIPE=yes
NoNewPrivileges=no
SystemCallErrorNumber=0
LockPersonality=no
RuntimeDirectoryPreserve=no
RuntimeDirectoryMode=0755
RuntimeDirectory=elasticsearch
StateDirectoryMode=0755
CacheDirectoryMode=0755
LogsDirectoryMode=0755
ConfigurationDirectoryMode=0755
TimeoutCleanUSec=infinity
MemoryDenyWriteExecute=no
RestrictRealtime=no
RestrictSUIDSGID=no
RestrictNamespaces=no
MountAPIVFS=no
KeyringMode=private
ProtectHostname=no
KillMode=process
KillSignal=15
RestartKillSignal=15
FinalKillSignal=9
SendSIGKILL=no
SendSIGHUP=no
WatchdogSignal=6
Id=elasticsearch.service
Names=elasticsearch.service
Requires=sysinit.target system.slice -.mount
Wants=network-online.target
WantedBy=multi-user.target
Conflicts=shutdown.target
Before=multi-user.target shutdown.target
After=systemd-journald.socket -.mount system.slice basic.target sysinit.target network-online.target systemd-tmpfiles-setup.service
RequiresMountsFor=/tmp /var/tmp /run/elasticsearch /usr/share/elasticsearch
Documentation=https://www.elastic.co
Description=Elasticsearch
LoadState=loaded
ActiveState=active
SubState=running
FragmentPath=/usr/lib/systemd/system/elasticsearch.service
DropInPaths=/etc/systemd/system/elasticsearch.service.d/override.conf
UnitFileState=enabled
UnitFilePreset=enabled
StateChangeTimestamp=Thu 2023-09-28 10:45:02 CEST
StateChangeTimestampMonotonic=141699634
InactiveExitTimestamp=Thu 2023-09-28 10:44:12 CEST
InactiveExitTimestampMonotonic=91909577
ActiveEnterTimestamp=Thu 2023-09-28 10:45:02 CEST
ActiveEnterTimestampMonotonic=141699634
ActiveExitTimestampMonotonic=0
InactiveEnterTimestampMonotonic=0
CanStart=yes
CanStop=yes
CanReload=no
CanIsolate=no
CanClean=runtime
StopWhenUnneeded=no
RefuseManualStart=no
RefuseManualStop=no
AllowIsolate=no
DefaultDependencies=yes
OnFailureJobMode=replace
IgnoreOnIsolate=no
NeedDaemonReload=no
JobTimeoutUSec=infinity
JobRunningTimeoutUSec=infinity
JobTimeoutAction=none
ConditionResult=yes
AssertResult=yes
ConditionTimestamp=Thu 2023-09-28 10:44:12 CEST
ConditionTimestampMonotonic=91906857
AssertTimestamp=Thu 2023-09-28 10:44:12 CEST
AssertTimestampMonotonic=91906857
Transient=no
Perpetual=no
StartLimitIntervalUSec=10s
StartLimitBurst=5
StartLimitAction=none
FailureAction=none
SuccessAction=none
InvocationID=4f28f67c4fda4d70bd994ce417ef1f03
CollectMode=inactive
GET /

{
  "name": "redacted",
  "cluster_name": "redacted",
  "cluster_uuid": "MO0h-6amRzmUrahDwUyd4Q",
  "version": {
    "number": "8.10.2",
    "build_flavor": "default",
    "build_type": "deb",
    "build_hash": "6d20dd8ce62365be9b1aca96427de4622e970e9e",
    "build_date": "2023-09-19T08:16:24.564900370Z",
    "build_snapshot": false,
    "lucene_version": "9.7.0",
    "minimum_wire_compatibility_version": "7.17.0",
    "minimum_index_compatibility_version": "7.0.0"
  },
  "tagline": "You Know, for Search"
}
GET _nodes/stats/process


{
  "_nodes": {
    "total": 3,
    "successful": 3,
    "failed": 0
  },
  "cluster_name": "redacted",
  "nodes": {
    "LcJxYbj5QPiJoKGxGMVTRw": {
      "timestamp": 1695891961430,
      "name": "redacted",
      "transport_address": "10.35.6.16:9300",
      "host": "10.35.6.16",
      "ip": "10.35.6.16:9300",
      "roles": [
        "data",
        "ingest",
        "master",
        "ml",
        "remote_cluster_client",
        "transform"
      ],
      "attributes": {
        "ml.allocated_processors_double": "4.0",
        "ml.allocated_processors": "4",
        "ml.machine_memory": "8331182080",
        "transform.config_version": "10.0.0",
        "xpack.installed": "true",
        "ml.config_version": "10.0.0",
        "ml.max_jvm_size": "4294967296"
      },
      "process": {
        "timestamp": 1695891961096,
        "open_file_descriptors": 4900,
        "max_file_descriptors": 65535,
        "cpu": {
          "percent": 34,
          "total_in_millis": 586047080
        },
        "mem": {
          "total_virtual_in_bytes": 398558937088
        }
      }
    },
    "fpHmBFy1QLSoIbxE5DEBsQ": {
      "timestamp": 1695891961442,
      "name": "redacted",
      "transport_address": "10.35.6.17:9300",
      "host": "10.35.6.17",
      "ip": "10.35.6.17:9300",
      "roles": [
        "data",
        "ingest",
        "master",
        "ml",
        "remote_cluster_client",
        "transform"
      ],
      "attributes": {
        "ml.allocated_processors_double": "4.0",
        "ml.allocated_processors": "4",
        "ml.machine_memory": "8331186176",
        "transform.config_version": "10.0.0",
        "xpack.installed": "true",
        "ml.config_version": "10.0.0",
        "ml.max_jvm_size": "4294967296"
      },
      "process": {
        "timestamp": 1695891961110,
        "open_file_descriptors": 2776,
        "max_file_descriptors": 65535,
        "cpu": {
          "percent": 53,
          "total_in_millis": 5383620
        },
        "mem": {
          "total_virtual_in_bytes": 134904147968
        }
      }
    },
    "gZNoi6l8RrCyH4uD7BpYTg": {
      "timestamp": 1695891961420,
      "name": "redacted",
      "transport_address": "10.35.6.18:9300",
      "host": "10.35.6.18",
      "ip": "10.35.6.18:9300",
      "roles": [
        "data",
        "ingest",
        "master",
        "ml",
        "remote_cluster_client",
        "transform"
      ],
      "attributes": {
        "ml.allocated_processors_double": "4.0",
        "ml.allocated_processors": "4",
        "ml.machine_memory": "8331423744",
        "xpack.installed": "true",
        "transform.config_version": "10.0.0",
        "ml.config_version": "10.0.0",
        "ml.max_jvm_size": "4294967296"
      },
      "process": {
        "timestamp": 1695891960798,
        "open_file_descriptors": 4962,
        "max_file_descriptors": 65535,
        "cpu": {
          "percent": 57,
          "total_in_millis": 1993530
        },
        "mem": {
          "total_virtual_in_bytes": 377759141888
        }
      }
    }
  }
}

Let me know if you need any further information.

@tonyghiani
Copy link
Contributor

Hey @msafdal, as you correctly mentioned it does depend on how the containerization detection is done, we noticed this from another report and we are updating the way this flow is detected on this PR.

Regarding the limits, the default values are unset or infinity, which is equivalent to not having them set.
You can check each property's meaning and set it accordingly in your override file to set the limit that the control group should use.

@k4z4n0v4
Copy link

k4z4n0v4 commented Sep 29, 2023

Hijacking the thread to ask what "Kibana is configured for non-containerized workloads" means. I'm running the stack on docker swarm and started receiving the same alert after 8.10.2. Couldn't find anything regarding "telling kibana its running in a container". What's my fix given the alert is right and i haven't configured kibana properly?

EDIT: I set monitoring.ui.container.elasticsearch.enabled: true in kibana.yml as per @miltonhultgren 's comment, and after restarting the kibana's docker i get the opposite alert now:
image

I'm guessing the rule is somewhat inconsistent even for truly containerized stacks now.

@miltonhultgren
Copy link
Contributor Author

@k4z4n0v4 This is a miss on our part, we didn't consider the case where someone is running in a container/cgroup without limits on purpose. We have a fix coming out in the next patch for this but in the meantime you could work around this by setting the limit on your containers to 100% of your available CPU.

@willemdh
Copy link

willemdh commented Oct 2, 2023

This triggers on all or nodes since the update to 8.10.2. Our nodes are not containerized, we have no limits configured afaik. (although the alerts say we do)

@miltonhultgren
Copy link
Contributor Author

@willemdh If the alert is reporting that you have limits specified then that is because that's what Elasticsearch is reporting,
and in that case you should most likely configure Kibana to monitor a containerized workload (container or cgroup based) so that the CPU calculation is correct.

@leandrojmp
Copy link

leandrojmp commented Oct 2, 2023

Just upgraded my monitoring cluster to 8.10.2 and got the same alert for all of my 20 nodes.

I do not use containers, I run on normal VMs, not sure what should I do to fix this.

kibana-alert

Added the following line into kibana.yml

monitoring.ui.container.elasticsearch.enabled: true

But now the alert is the inverse for all my nodes.

kibana-alert-2

I'm using the rpm package and systemd uses cgroups to run elasticsearch.

So, it seems that there is no workaround for this, the solution is to disable the rule and wait for the fix on #167244

@miltonhultgren
Copy link
Contributor Author

miltonhultgren commented Oct 3, 2023

@leandrojmp Is it not possible to define the limit on your cgroup to 100% of your CPU (which is the same as not having the limit but it'll make the rule happy)?

Either way, it seems odd that you're getting both sides of the issue. Either you have the cgroup metrics being reported or not, I'm not sure what's going on there. If you hit /_nodes/_local/stats on the Elasticsearch node giving the alert, do you see the cgroup metrics filled in with a quota?

@leandrojmp
Copy link

leandrojmp commented Oct 3, 2023

Hello @miltonhultgren,

Is it not possible to define the limit on your cgroup to 100% of your CPU (which is the same as not having the limit but it'll make the rule happy)?

I didn't make any changes to cgroups or applied any limits, I'm running the default rpm package distribution, I just installed the package, configured Elasticsearch and started the service.

This is how systemd works, it uses cgroups, this is the return of systemctl status elasticsearch on one of the nodes:

● elasticsearch.service - Elasticsearch
   Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/elasticsearch.service.d
           └─override.conf
   Active: active (running) since Tue 2023-06-20 23:42:20 UTC; 3 months 13 days ago
     Docs: https://www.elastic.co
 Main PID: 1226 (java)
    Tasks: 359 (limit: 408607)
   Memory: 58.1G
   CGroup: /system.slice/elasticsearch.service
           ├─1226 /usr/share/elasticsearch/jdk/bin/java -Xms4m -Xmx64m -XX:+UseSerialGC -Dcli.name=server -Dcli.script=/usr/share/elasticsearch/bin/elasticsearch -Dcli.libs=lib/tools/server-cli -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.type=rpm -cp /usr/share/elast>
           ├─3633 /usr/share/elasticsearch/jdk/bin/java -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -Djava.security.manager=allow -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.ne>
           └─4464 /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/bin/controller

So everything is default, probably this affects anyone that runs elasticsearch using the rpm or deb packages, I prefer to not change anything related to cgroup because I'm not familiar how this works with systemd and this is a production environment.

Either way, it seems odd that you're getting both sides of the issue. Either you have the cgroup metrics being reported or not, I'm not sure what's going on there.

Yeah, if i do not set monitoring.ui.container.elasticsearch.enabled on Kibana I got the error about a non-containerized workload with resources limits, but if I enable it I got the error about containerized workload without resources limits, either way I got alerts for all my nodes, should I open another issues to track this?

I upgraded just the Monitoring cluster to 8.10.2, the production cluster and metricbeat is still on 8.8.1, not sure if this may impact or not, but an upgrade for 8.10.2 in the production cluster is planned for this week.

If you hit /_nodes/_local/stats on the Elasticsearch node giving the alert, do you see the cgroup metrics filled in with a quota?

This happens for all nodes, and this is the cgroup part in the response for one of them:

        "cgroup": {
          "cpuacct": {
            "control_group": "/",
            "usage_nanos": 30412695915531812
          },
          "cpu": {
            "control_group": "/",
            "cfs_period_micros": 100000,
            "cfs_quota_micros": -1,
            "stat": {
              "number_of_elapsed_periods": 0,
              "number_of_times_throttled": 0,
              "time_throttled_nanos": 0
            }
          },
          "memory": {
            "control_group": "/system.slice/elasticsearch.service",
            "limit_in_bytes": "9223372036854771712",
            "usage_in_bytes": "62579138560"
          }
        }

@miltonhultgren
Copy link
Contributor Author

Got it, thanks for the insight @leandrojmp ! This change had a bigger effect than we anticipated (the flag being named container is misleading) since it affects all cgroup run times, like you mentioned this is the default for some setups which we didn't expect (a miss on our part).

Thanks for sharing the results of the stats endpoint, I see the issue now. When Kibana is configured for non-container (non-cgroup*) workloads it used to check if the metric values are null (meaning not being reported at all), while in the container (cgroup) path is checks if they are not -1 so there isn't an exact overlap between the two cases.
The fix introduced by #167244 should address both of those cases.

Apologies again for all the noise this is causing!

@leandrojmp
Copy link

@miltonhultgren

So when 8.11 drops I would need to upgrade just the monitoring cluster to not get the alerts anymore, right? Because We upgrade our production cluster every quarter, and we will upgrade to 8.10.2 this week and the next upgrade will be just next quarter.

@miltonhultgren
Copy link
Contributor Author

@tonyghiani Did we backport this to 8.10.X or only 8.11.X? Let's make sure this comes out with the next patch release for 8.10!

@leandrojmp The alerting system only runs in your monitoring cluster's Kibana so upgrading that will be enough!

@jacoor
Copy link

jacoor commented Nov 7, 2023

@miltonhultgren Thanks for all the work on this.
Could you confirm if this has been backported to 8.10.X and if so, which exact version?
#167244 has only 8.11 label and backport: skip.

@tonyghiani
Copy link
Contributor

@miltonhultgren apologies for the delay, I completely missed your mention here.
The PR was not backported into 8.10.x, I'll see if it's possible to bring the same changes to the latest 8.10.x patch.

@tonyghiani
Copy link
Contributor

This #170740 should backport the fix to v10.8.

@tonyghiani
Copy link
Contributor

I closed the above PR since it won't be released with new patches for 8.10.x, so it'll be available starting from 8.11.0

@smith smith added Team:Monitoring Stack Monitoring team and removed Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Nov 13, 2023
miltonhultgren added a commit that referenced this issue Dec 8, 2023
Reverts #159351
Reverts #167244

Due to the many unexpected issues that these changes introduced we've
decided to revert these changes until we have better solutions for the
problems we've learnt about.

Problems:
- Gaps in data cause alerts to fire (see next point)
- Normal CPU rescaling causes alerts to fire
#160905
- Any error fires an alert (since there is no other way to inform the
user about the problems faced by the rule executor)
- Many assumptions about cgroups only being for container users are
wrong

To address some of these issues we also need more functionality in the
alerting framework to be able to register secondary actions so that we
may trigger non-oncall workflows for when a rule faces issues with
evaluating the stats.

Original issue #116128
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Dec 8, 2023
Reverts elastic#159351
Reverts elastic#167244

Due to the many unexpected issues that these changes introduced we've
decided to revert these changes until we have better solutions for the
problems we've learnt about.

Problems:
- Gaps in data cause alerts to fire (see next point)
- Normal CPU rescaling causes alerts to fire
elastic#160905
- Any error fires an alert (since there is no other way to inform the
user about the problems faced by the rule executor)
- Many assumptions about cgroups only being for container users are
wrong

To address some of these issues we also need more functionality in the
alerting framework to be able to register secondary actions so that we
may trigger non-oncall workflows for when a rule faces issues with
evaluating the stats.

Original issue elastic#116128

(cherry picked from commit 55bc6d5)
kibanamachine added a commit that referenced this issue Dec 8, 2023
# Backport

This will backport the following commits from `main` to `8.12`:
- [[monitoring] Revert CPU Usage rule changes
(#172913)](#172913)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Milton
Hultgren","email":"[email protected]"},"sourceCommit":{"committedDate":"2023-12-08T15:25:23Z","message":"[monitoring]
Revert CPU Usage rule changes (#172913)\n\nReverts
https://github.com/elastic/kibana/pull/159351\r\nReverts
https://github.com/elastic/kibana/pull/167244\r\n\r\nDue to the many
unexpected issues that these changes introduced we've\r\ndecided to
revert these changes until we have better solutions for the\r\nproblems
we've learnt about.\r\n\r\nProblems:\r\n- Gaps in data cause alerts to
fire (see next point)\r\n- Normal CPU rescaling causes alerts to
fire\r\nhttps://github.com//issues/160905\r\n- Any error
fires an alert (since there is no other way to inform the\r\nuser about
the problems faced by the rule executor)\r\n- Many assumptions about
cgroups only being for container users are\r\nwrong\r\n\r\nTo address
some of these issues we also need more functionality in the\r\nalerting
framework to be able to register secondary actions so that we\r\nmay
trigger non-oncall workflows for when a rule faces issues
with\r\nevaluating the stats.\r\n\r\nOriginal issue
https://github.com/elastic/kibana/issues/116128","sha":"55bc6d505977e8831633cc76e0f46b2ca66ef559","branchLabelMapping":{"^v8.13.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","backport:prev-minor","v8.12.0","v8.13.0"],"number":172913,"url":"https://github.com/elastic/kibana/pull/172913","mergeCommit":{"message":"[monitoring]
Revert CPU Usage rule changes (#172913)\n\nReverts
https://github.com/elastic/kibana/pull/159351\r\nReverts
https://github.com/elastic/kibana/pull/167244\r\n\r\nDue to the many
unexpected issues that these changes introduced we've\r\ndecided to
revert these changes until we have better solutions for the\r\nproblems
we've learnt about.\r\n\r\nProblems:\r\n- Gaps in data cause alerts to
fire (see next point)\r\n- Normal CPU rescaling causes alerts to
fire\r\nhttps://github.com//issues/160905\r\n- Any error
fires an alert (since there is no other way to inform the\r\nuser about
the problems faced by the rule executor)\r\n- Many assumptions about
cgroups only being for container users are\r\nwrong\r\n\r\nTo address
some of these issues we also need more functionality in the\r\nalerting
framework to be able to register secondary actions so that we\r\nmay
trigger non-oncall workflows for when a rule faces issues
with\r\nevaluating the stats.\r\n\r\nOriginal issue
https://github.com/elastic/kibana/issues/116128","sha":"55bc6d505977e8831633cc76e0f46b2ca66ef559"}},"sourceBranch":"main","suggestedTargetBranches":["8.12"],"targetPullRequestStates":[{"branch":"8.12","label":"v8.12.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.13.0","labelRegex":"^v8.13.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/172913","number":172913,"mergeCommit":{"message":"[monitoring]
Revert CPU Usage rule changes (#172913)\n\nReverts
https://github.com/elastic/kibana/pull/159351\r\nReverts
https://github.com/elastic/kibana/pull/167244\r\n\r\nDue to the many
unexpected issues that these changes introduced we've\r\ndecided to
revert these changes until we have better solutions for the\r\nproblems
we've learnt about.\r\n\r\nProblems:\r\n- Gaps in data cause alerts to
fire (see next point)\r\n- Normal CPU rescaling causes alerts to
fire\r\nhttps://github.com//issues/160905\r\n- Any error
fires an alert (since there is no other way to inform the\r\nuser about
the problems faced by the rule executor)\r\n- Many assumptions about
cgroups only being for container users are\r\nwrong\r\n\r\nTo address
some of these issues we also need more functionality in the\r\nalerting
framework to be able to register secondary actions so that we\r\nmay
trigger non-oncall workflows for when a rule faces issues
with\r\nevaluating the stats.\r\n\r\nOriginal issue
https://github.com/elastic/kibana/issues/116128","sha":"55bc6d505977e8831633cc76e0f46b2ca66ef559"}}]}]
BACKPORT-->

Co-authored-by: Milton Hultgren <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants