Add "For the last x" Option for Cluster Health Stack Monitoring Rule #145843

MakoWish · 2022-11-18T22:57:25Z

Description

We leverage the pre-built Stack Monitoring Rules, and one that annoyingly blows up our inboxes is the "Cluster Health" alert. Any time an index is rolled-over by ILM, there is a brief period of time the cluster goes yellow until the replica is created. This triggers the rule to email us that the cluster health is yellow, and one minute later, we receive another alert that it has recovered (the rule currently checks every minute).

Since the creation of replicas only takes a few seconds, it would be nice to be able to modify this Stack Monitoring rule to be similar to the Metric Threshold rules where we have a For the last x seconds/minutes/hours/days option. This would help to eliminate false-positive cluster health alerts.
EDIT: The rule should alert if it's been more than X time since there was a healthy status reported, not just report the current status being unhealthy.

Eric

The text was updated successfully, but these errors were encountered:

DaveCTurner · 2022-11-21T11:28:09Z

Stack Monitoring is part of Kibana so I've moved this over to the Kibana repo.

elasticmachine · 2022-11-21T12:01:53Z

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

elasticmachine · 2022-11-23T09:31:50Z

Pinging @elastic/actionable-observability (Team: Actionable Observability)

…#147565) ## Summary This PR fixes #145843 by adding the ability to configure `duration` for the Stack Monitoring "Legacy Rules" along with a set of default rule parameters and custom validation; which can be configured per rule. There are three new attributes added the the LEGACY_RULE_DETAILS object: - `defaults` – The default values for parameters (so we can set the default duration to `2m` for Cluster Health) - `expressionConfig` – Configuration for turning on/off UI elements (like duration) - `validate` – A custom validate function (so we can ensure `duration` is provided) This will also allow us control over which of the legacy rules gets duration and which ones we want to keep "as is". It also makes room for adding additional UI features in the future. <img width="618" alt="image" src="https://user-images.githubusercontent.com/41702/207685736-d8dc3023-66d0-4e40-a564-830f290ec1e1.png"> ### Checklist Delete any items that are not applicable to this PR. - [X] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [X] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

MakoWish · 2023-03-31T22:30:57Z

We just upgraded our DEV cluster to 8.7.0 yesterday, and I am not seeing this. Am I overlooking it somewhere? I can see the new "In the last" option, but I am not sure how that solves the issue. What I was hoping to see is a "for the last" type of option similar to a Metrics monitoring rule.

With a Metrics monitoring rule, you can essentially say, "Alert me if the CPU is over 90% for 5 minutes". With a rule like that, the CPU could spike over 90%, but as long as it calms down before that 5 minute mark, an alert action would not trigger. I was hoping to have that same functionality in the Cluster Health rule to be something along the lines of "If cluster health is not green for 5 minutes". That way, any time ILM rolls an index over, and there is that brief moment the replica is being created ("missing replica shard"), we don't get an alert for what is expected behavior.

Eric

MakoWish · 2023-05-25T14:15:51Z

Checking back on this one.

MakoWish · 2023-06-13T23:06:49Z

Still hoping for more insight on this. We are now on 8.8.1 as of yesterday, and I still don't see this option. Please reopen this.

miltonhultgren · 2023-06-15T10:41:14Z

@MakoWish Do I understand correctly that you're looking to be able to express a rule like "alert me, if it's been more than 1 hour since the last healthy report"?
Meaning, we would query the data looking back a certain duration and if there are no healthy reports in this time range then we fire the alert?

That is indeed different from the changes @simianhacker did in the past for this, which perhaps got lost with adding in more flexibility for all of the rules.

We can re-open this issue but I cannot provide an estimate on when/if this work will be picked up.

MakoWish · 2023-06-15T15:35:51Z

Hi @miltonhultgren,

Basically, as the rule stands now, any time an index rolls over for ILM, the cluster momentarily goes yellow until the replica shards are created. This sends out an alert that the cluster is not healthy, but within just a minute or so later, we get an alert that the cluster has recovered. If we could implement an option similar to the Metric threshold rule like "If the cluster health is not green for more than 5 minutes, trigger the rule". That would avoid the false-positives created during ILM rollovers.

smith · 2023-06-27T19:52:23Z

Moving this to the backlog. It's still prioritized but we have a few support issues that are taking priority.

agjmills · 2023-09-25T14:30:30Z

👍 for this feature

stefnestor · 2023-11-02T19:45:47Z

Cross-linking similar ballpark Github: #113445

MakoWish · 2023-12-11T15:34:24Z

Any progress on this? I got 64 "Cluster health" alerts over the weekend, but they were all false positives for index rollovers as discussed in this issue. I would like to cut down on these false positives and only receive an alert if there is a legitimate health issue with the cluster. It has been more than a year now since opening this.

MakoWish · 2023-12-27T15:55:28Z

A quick update on this would be appreciated. We had 131 of these false-positive "Cluster health" alerts over the long Christmas weekend. If there truly were a cluster health issue, we would never know it, because this alert has been crying wolf for the 5+ years we've been using Elastic.

MakoWish · 2024-01-02T16:39:45Z

212 false-positive alerts this past weekend.

MakoWish · 2024-01-05T15:25:29Z

The same thing goes for the "Shard size" rule, because when an index is being force-merged, there is a period of time an index is expected to be over the specified size threshold while being merged. This is again just a false positive that can be ignored with a "for the last 'x'" option on the rules. For the "Shard size" case, I would like to say, "Alert me if the shard size is too large for more than 15 minutes." For the "Cluster health" rule, five minutes would be fine. Again, you can do this with "Metric Threshold" rules, so why not with the stack monitoring rules?

vbohata · 2024-01-21T21:13:29Z

Agree, we have also a big number of false positives which also caused us to miss the real issue once time.

ddoorn · 2024-01-22T01:43:06Z

Also waiting for this to be fixed.

MakoWish · 2024-01-22T04:16:03Z

The most disappointing thing is that as platinum subscribers, I even brought this up to our CSM, and still silence from the Elastic side. 😡

stefnestor · 2024-01-23T20:31:16Z

Hi all; I just wanted to mention the sister-issue's publicly documented workaround here. (Your comments still stand, but AFAICT this issue isn't user-blocked-no-workaround vs an annoyance for the team to flush-out an older Rule structure with later learnings.)

agjmills · 2024-02-05T09:29:01Z

@smith it looks like this issue has been prioritised since 8.7.0, but subsequently removed from each planned release

Is there a plan to get this implemented in the next 3 months?

smith · 2024-03-08T02:18:00Z

@agjmills at this time we do not plan to prioritize this.

MakoWish added >enhancement labels Nov 18, 2022

DaveCTurner transferred this issue from elastic/elasticsearch Nov 21, 2022

botelastic bot added the needs-team Issues missing a team label label Nov 21, 2022

ppisljar added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services and removed needs-team Issues missing a team label labels Nov 21, 2022

miltonhultgren added the Feature:Stack Monitoring label Nov 21, 2022

emma-raffenne added the Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" label Nov 23, 2022

smith removed the Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services label Nov 28, 2022

emma-raffenne added the v8.7.0 label Nov 29, 2022

simianhacker mentioned this issue Dec 14, 2022

Adding duration configuration to Stack Monitoring Cluster Health rule #147565

Merged

2 tasks

simianhacker self-assigned this Dec 14, 2022

simianhacker closed this as completed in #147565 Dec 15, 2022

miltonhultgren reopened this Jun 15, 2023

miltonhultgren unassigned simianhacker Jun 15, 2023

maryam-saeidi removed the Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" label Jun 21, 2023

botelastic bot added the needs-team Issues missing a team label label Jun 21, 2023

mbondyra added the Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services label Jun 22, 2023

botelastic bot removed the needs-team Issues missing a team label label Jun 22, 2023

miltonhultgren mentioned this issue Jun 23, 2023

Health check alert triggers falsely on index rollover #148640

Closed

smith removed the needs:triage label Jun 27, 2023

smith added v8.10.0 and removed v8.7.0 labels Jun 27, 2023

miltonhultgren mentioned this issue Sep 25, 2023

Avoid alerting if downtime less than threshold #82925

Closed

stefnestor mentioned this issue Nov 2, 2023

Enhance "Cluster health" stack monitoring rule to allow user configuration for Yellow/Red/Both #113445

Open

smith added Team:Monitoring Stack Monitoring team and removed Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Nov 13, 2023

smith removed the v8.10.0 label Jan 23, 2024

sophiec20 added enhancement New value added to drive a business result and removed >enhancement labels Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "For the last x" Option for Cluster Health Stack Monitoring Rule #145843

Add "For the last x" Option for Cluster Health Stack Monitoring Rule #145843

MakoWish commented Nov 18, 2022 •

edited by miltonhultgren

Loading

DaveCTurner commented Nov 21, 2022

elasticmachine commented Nov 21, 2022

elasticmachine commented Nov 23, 2022

MakoWish commented Mar 31, 2023 •

edited

Loading

MakoWish commented May 25, 2023

MakoWish commented Jun 13, 2023

miltonhultgren commented Jun 15, 2023

MakoWish commented Jun 15, 2023 •

edited

Loading

smith commented Jun 27, 2023

agjmills commented Sep 25, 2023

stefnestor commented Nov 2, 2023

MakoWish commented Dec 11, 2023

MakoWish commented Dec 27, 2023

MakoWish commented Jan 2, 2024

MakoWish commented Jan 5, 2024

vbohata commented Jan 21, 2024

ddoorn commented Jan 22, 2024

MakoWish commented Jan 22, 2024

stefnestor commented Jan 23, 2024

agjmills commented Feb 5, 2024

smith commented Mar 8, 2024

Add "For the last x" Option for Cluster Health Stack Monitoring Rule #145843

Add "For the last x" Option for Cluster Health Stack Monitoring Rule #145843

Comments

MakoWish commented Nov 18, 2022 • edited by miltonhultgren Loading

Description

DaveCTurner commented Nov 21, 2022

elasticmachine commented Nov 21, 2022

elasticmachine commented Nov 23, 2022

MakoWish commented Mar 31, 2023 • edited Loading

MakoWish commented May 25, 2023

MakoWish commented Jun 13, 2023

miltonhultgren commented Jun 15, 2023

MakoWish commented Jun 15, 2023 • edited Loading

smith commented Jun 27, 2023

agjmills commented Sep 25, 2023

stefnestor commented Nov 2, 2023

MakoWish commented Dec 11, 2023

MakoWish commented Dec 27, 2023

MakoWish commented Jan 2, 2024

MakoWish commented Jan 5, 2024

vbohata commented Jan 21, 2024

ddoorn commented Jan 22, 2024

MakoWish commented Jan 22, 2024

stefnestor commented Jan 23, 2024

agjmills commented Feb 5, 2024

smith commented Mar 8, 2024

MakoWish commented Nov 18, 2022 •

edited by miltonhultgren

Loading

MakoWish commented Mar 31, 2023 •

edited

Loading

MakoWish commented Jun 15, 2023 •

edited

Loading