Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "For the last x" Option for Cluster Health Stack Monitoring Rule #145843

Open
MakoWish opened this issue Nov 18, 2022 · 21 comments · Fixed by #147565
Open

Add "For the last x" Option for Cluster Health Stack Monitoring Rule #145843

MakoWish opened this issue Nov 18, 2022 · 21 comments · Fixed by #147565
Labels
enhancement New value added to drive a business result Feature:Stack Monitoring Team:Monitoring Stack Monitoring team

Comments

@MakoWish
Copy link

MakoWish commented Nov 18, 2022

Description

We leverage the pre-built Stack Monitoring Rules, and one that annoyingly blows up our inboxes is the "Cluster Health" alert. Any time an index is rolled-over by ILM, there is a brief period of time the cluster goes yellow until the replica is created. This triggers the rule to email us that the cluster health is yellow, and one minute later, we receive another alert that it has recovered (the rule currently checks every minute).

Since the creation of replicas only takes a few seconds, it would be nice to be able to modify this Stack Monitoring rule to be similar to the Metric Threshold rules where we have a For the last x seconds/minutes/hours/days option. This would help to eliminate false-positive cluster health alerts.
EDIT: The rule should alert if it's been more than X time since there was a healthy status reported, not just report the current status being unhealthy.

Eric

@DaveCTurner DaveCTurner transferred this issue from elastic/elasticsearch Nov 21, 2022
@DaveCTurner
Copy link

Stack Monitoring is part of Kibana so I've moved this over to the Kibana repo.

@botelastic botelastic bot added the needs-team Issues missing a team label label Nov 21, 2022
@ppisljar ppisljar added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services and removed needs-team Issues missing a team label labels Nov 21, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@emma-raffenne emma-raffenne added the Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" label Nov 23, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/actionable-observability (Team: Actionable Observability)

@smith smith removed the Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services label Nov 28, 2022
@simianhacker simianhacker self-assigned this Dec 14, 2022
simianhacker added a commit that referenced this issue Dec 15, 2022
…#147565)

## Summary

This PR fixes #145843 by adding the ability to configure `duration` for
the Stack Monitoring "Legacy Rules" along with a set of default rule
parameters and custom validation; which can be configured per rule.
There are three new attributes added the the LEGACY_RULE_DETAILS object:

- `defaults` – The default values for parameters (so we can set the
default duration to `2m` for Cluster Health)
- `expressionConfig` – Configuration for turning on/off UI elements
(like duration)
- `validate` – A custom validate function (so we can ensure `duration`
is provided)

This will also allow us control over which of the legacy rules gets
duration and which ones we want to keep "as is". It also makes room for
adding additional UI features in the future.

<img width="618" alt="image"
src="https://user-images.githubusercontent.com/41702/207685736-d8dc3023-66d0-4e40-a564-830f290ec1e1.png">

### Checklist

Delete any items that are not applicable to this PR.

- [X] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [X] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
@MakoWish
Copy link
Author

MakoWish commented Mar 31, 2023

We just upgraded our DEV cluster to 8.7.0 yesterday, and I am not seeing this. Am I overlooking it somewhere? I can see the new "In the last" option, but I am not sure how that solves the issue. What I was hoping to see is a "for the last" type of option similar to a Metrics monitoring rule.

With a Metrics monitoring rule, you can essentially say, "Alert me if the CPU is over 90% for 5 minutes". With a rule like that, the CPU could spike over 90%, but as long as it calms down before that 5 minute mark, an alert action would not trigger. I was hoping to have that same functionality in the Cluster Health rule to be something along the lines of "If cluster health is not green for 5 minutes". That way, any time ILM rolls an index over, and there is that brief moment the replica is being created ("missing replica shard"), we don't get an alert for what is expected behavior.

Eric

@MakoWish
Copy link
Author

Checking back on this one.

@MakoWish
Copy link
Author

Still hoping for more insight on this. We are now on 8.8.1 as of yesterday, and I still don't see this option. Please reopen this.

@miltonhultgren
Copy link
Contributor

@MakoWish Do I understand correctly that you're looking to be able to express a rule like "alert me, if it's been more than 1 hour since the last healthy report"?
Meaning, we would query the data looking back a certain duration and if there are no healthy reports in this time range then we fire the alert?

That is indeed different from the changes @simianhacker did in the past for this, which perhaps got lost with adding in more flexibility for all of the rules.

We can re-open this issue but I cannot provide an estimate on when/if this work will be picked up.

@MakoWish
Copy link
Author

MakoWish commented Jun 15, 2023

Hi @miltonhultgren,

Basically, as the rule stands now, any time an index rolls over for ILM, the cluster momentarily goes yellow until the replica shards are created. This sends out an alert that the cluster is not healthy, but within just a minute or so later, we get an alert that the cluster has recovered. If we could implement an option similar to the Metric threshold rule like "If the cluster health is not green for more than 5 minutes, trigger the rule". That would avoid the false-positives created during ILM rollovers.

rule_for_last_x

@maryam-saeidi maryam-saeidi removed the Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" label Jun 21, 2023
@botelastic botelastic bot added the needs-team Issues missing a team label label Jun 21, 2023
@mbondyra mbondyra added the Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services label Jun 22, 2023
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jun 22, 2023
@smith smith added v8.10.0 and removed v8.7.0 labels Jun 27, 2023
@smith
Copy link
Contributor

smith commented Jun 27, 2023

Moving this to the backlog. It's still prioritized but we have a few support issues that are taking priority.

@agjmills
Copy link

👍 for this feature

@stefnestor
Copy link
Contributor

Cross-linking similar ballpark Github: #113445

@smith smith added Team:Monitoring Stack Monitoring team and removed Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Nov 13, 2023
@MakoWish
Copy link
Author

Any progress on this? I got 64 "Cluster health" alerts over the weekend, but they were all false positives for index rollovers as discussed in this issue. I would like to cut down on these false positives and only receive an alert if there is a legitimate health issue with the cluster. It has been more than a year now since opening this.

@MakoWish
Copy link
Author

A quick update on this would be appreciated. We had 131 of these false-positive "Cluster health" alerts over the long Christmas weekend. If there truly were a cluster health issue, we would never know it, because this alert has been crying wolf for the 5+ years we've been using Elastic.

@MakoWish
Copy link
Author

MakoWish commented Jan 2, 2024

212 false-positive alerts this past weekend.

@MakoWish
Copy link
Author

MakoWish commented Jan 5, 2024

The same thing goes for the "Shard size" rule, because when an index is being force-merged, there is a period of time an index is expected to be over the specified size threshold while being merged. This is again just a false positive that can be ignored with a "for the last 'x'" option on the rules. For the "Shard size" case, I would like to say, "Alert me if the shard size is too large for more than 15 minutes." For the "Cluster health" rule, five minutes would be fine. Again, you can do this with "Metric Threshold" rules, so why not with the stack monitoring rules?

@vbohata
Copy link

vbohata commented Jan 21, 2024

Agree, we have also a big number of false positives which also caused us to miss the real issue once time.

@ddoorn
Copy link

ddoorn commented Jan 22, 2024

Also waiting for this to be fixed.

@MakoWish
Copy link
Author

The most disappointing thing is that as platinum subscribers, I even brought this up to our CSM, and still silence from the Elastic side. 😡

@stefnestor
Copy link
Contributor

Hi all; I just wanted to mention the sister-issue's publicly documented workaround here. (Your comments still stand, but AFAICT this issue isn't user-blocked-no-workaround vs an annoyance for the team to flush-out an older Rule structure with later learnings.)

@smith smith removed the v8.10.0 label Jan 23, 2024
@agjmills
Copy link

agjmills commented Feb 5, 2024

@smith it looks like this issue has been prioritised since 8.7.0, but subsequently removed from each planned release

Is there a plan to get this implemented in the next 3 months?

@smith
Copy link
Contributor

smith commented Mar 8, 2024

@agjmills at this time we do not plan to prioritize this.

@sophiec20 sophiec20 added enhancement New value added to drive a business result and removed >enhancement labels Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Feature:Stack Monitoring Team:Monitoring Stack Monitoring team
Projects
None yet
Development

Successfully merging a pull request may close this issue.