-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "For the last x" Option for Cluster Health Stack Monitoring Rule #145843
Comments
Stack Monitoring is part of Kibana so I've moved this over to the Kibana repo. |
Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI) |
Pinging @elastic/actionable-observability (Team: Actionable Observability) |
…#147565) ## Summary This PR fixes #145843 by adding the ability to configure `duration` for the Stack Monitoring "Legacy Rules" along with a set of default rule parameters and custom validation; which can be configured per rule. There are three new attributes added the the LEGACY_RULE_DETAILS object: - `defaults` – The default values for parameters (so we can set the default duration to `2m` for Cluster Health) - `expressionConfig` – Configuration for turning on/off UI elements (like duration) - `validate` – A custom validate function (so we can ensure `duration` is provided) This will also allow us control over which of the legacy rules gets duration and which ones we want to keep "as is". It also makes room for adding additional UI features in the future. <img width="618" alt="image" src="https://user-images.githubusercontent.com/41702/207685736-d8dc3023-66d0-4e40-a564-830f290ec1e1.png"> ### Checklist Delete any items that are not applicable to this PR. - [X] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [X] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
We just upgraded our DEV cluster to 8.7.0 yesterday, and I am not seeing this. Am I overlooking it somewhere? I can see the new "In the last" option, but I am not sure how that solves the issue. What I was hoping to see is a "for the last" type of option similar to a Metrics monitoring rule. With a Metrics monitoring rule, you can essentially say, "Alert me if the CPU is over 90% for 5 minutes". With a rule like that, the CPU could spike over 90%, but as long as it calms down before that 5 minute mark, an alert action would not trigger. I was hoping to have that same functionality in the Cluster Health rule to be something along the lines of "If cluster health is not green for 5 minutes". That way, any time ILM rolls an index over, and there is that brief moment the replica is being created ("missing replica shard"), we don't get an alert for what is expected behavior. Eric |
Checking back on this one. |
Still hoping for more insight on this. We are now on 8.8.1 as of yesterday, and I still don't see this option. Please reopen this. |
@MakoWish Do I understand correctly that you're looking to be able to express a rule like "alert me, if it's been more than 1 hour since the last healthy report"? That is indeed different from the changes @simianhacker did in the past for this, which perhaps got lost with adding in more flexibility for all of the rules. We can re-open this issue but I cannot provide an estimate on when/if this work will be picked up. |
Hi @miltonhultgren, Basically, as the rule stands now, any time an index rolls over for ILM, the cluster momentarily goes yellow until the replica shards are created. This sends out an alert that the cluster is not healthy, but within just a minute or so later, we get an alert that the cluster has recovered. If we could implement an option similar to the Metric threshold rule like "If the cluster health is not green for more than 5 minutes, trigger the rule". That would avoid the false-positives created during ILM rollovers. |
Moving this to the backlog. It's still prioritized but we have a few support issues that are taking priority. |
👍 for this feature |
Cross-linking similar ballpark Github: #113445 |
Any progress on this? I got 64 "Cluster health" alerts over the weekend, but they were all false positives for index rollovers as discussed in this issue. I would like to cut down on these false positives and only receive an alert if there is a legitimate health issue with the cluster. It has been more than a year now since opening this. |
A quick update on this would be appreciated. We had 131 of these false-positive "Cluster health" alerts over the long Christmas weekend. If there truly were a cluster health issue, we would never know it, because this alert has been crying wolf for the 5+ years we've been using Elastic. |
212 false-positive alerts this past weekend. |
The same thing goes for the "Shard size" rule, because when an index is being force-merged, there is a period of time an index is expected to be over the specified size threshold while being merged. This is again just a false positive that can be ignored with a "for the last 'x'" option on the rules. For the "Shard size" case, I would like to say, "Alert me if the shard size is too large for more than 15 minutes." For the "Cluster health" rule, five minutes would be fine. Again, you can do this with "Metric Threshold" rules, so why not with the stack monitoring rules? |
Agree, we have also a big number of false positives which also caused us to miss the real issue once time. |
Also waiting for this to be fixed. |
The most disappointing thing is that as platinum subscribers, I even brought this up to our CSM, and still silence from the Elastic side. 😡 |
Hi all; I just wanted to mention the sister-issue's publicly documented workaround here. (Your comments still stand, but AFAICT this issue isn't user-blocked-no-workaround vs an annoyance for the team to flush-out an older Rule structure with later learnings.) |
@smith it looks like this issue has been prioritised since 8.7.0, but subsequently removed from each planned release Is there a plan to get this implemented in the next 3 months? |
@agjmills at this time we do not plan to prioritize this. |
Description
We leverage the pre-built Stack Monitoring Rules, and one that annoyingly blows up our inboxes is the "Cluster Health" alert. Any time an index is rolled-over by ILM, there is a brief period of time the cluster goes yellow until the replica is created. This triggers the rule to email us that the cluster health is yellow, and one minute later, we receive another alert that it has recovered (the rule currently checks every minute).
Since the creation of replicas only takes a few seconds, it would be nice to be able to modify this Stack Monitoring rule to be similar to the Metric Threshold rules where we have a
For the last x seconds/minutes/hours/days
option. This would help to eliminate false-positive cluster health alerts.EDIT: The rule should alert if it's been more than X time since there was a healthy status reported, not just report the current status being unhealthy.
Eric
The text was updated successfully, but these errors were encountered: