Enhance "Cluster health" stack monitoring rule to allow user configuration for Yellow/Red/Both #113445

ravikesarwani · 2021-09-29T16:32:08Z

As part of "Cluster health" rule allow users to configure if they want to receive alert for Yellow, Red, or Both yellow and red.
The default configuration value for the rule will stay as "Both yellow and red".

Combining with our changes in 7.15 to allow multiple rules of the same type users can now configure different actions for Yellow(say email) and Red(say pagerduty), if they want.

Currently the Cluster health rule fires when the cluster health status changes from green to yellow OR red.
There is no way for the users to configure to get alert only when the cluster state changes to "red".

Yellow status can happen based on temporary processing in Elasticsearch.
Any action that creates a new index (rollover, shrink, mounting an index, close-and-reopen (through forcemerge w/codec change)) can cause the cluster to go briefly yellow.

Stretch goal
Besides adding the extra configuration(for Yellow, Red, or Both) we should look at the possibility of "look at last X minutes of data and alert only when we see all of them to be the same status" rather than just relying on the last document status.

The text was updated successfully, but these errors were encountered:

stefnestor · 2023-11-02T19:30:25Z

I'll also note for public record ILM Searchable Snapshots coming up on Frozen tier can blip the cluster status:red with no action required by Dev on-calls, e.g. Elasticsearch logs per index hitting phase/action/step: frozen/searchablesnapshots/mount-snapshot (or maybe wait-for-index-color sorry these happen really close together so I can't fully tell):

Cluster health status changed from [YELLOW] to [RED] (reason: [snapshot shard size updated]).
Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[partial-restored-my_index-2023.10.31-000001][0]]]).

Which'd resolve relating to the stretch goal in description

we should look at the possibility of "look at last X minutes of data and alert only when we see all of them to be the same status" rather than just relying on the last document status.

Which kinda overlaps with #145843

VimCommando · 2023-11-02T20:02:04Z

On a related node, all the built-in rules should be using the _health_report API indicators and not the _cluster/health indicators.

The _health_report understands if shards are unassigned due to expected cluster actions, like new indices or restarting nodes:
https://github.com/elastic/elasticsearch/blob/3636d3d6ac492dda2dc2400e104b69319b753daa/server/src/main/java/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java#L408-L410

I don't know for sure if it tracks ILM transitions yet.

stefnestor · 2023-11-02T20:13:45Z

Publicly documenting lower stack versions workaround/alternative via manual Rule setup.

Example is taken on Elastic Cloud against version v8.9.2 for Logs&Metrics data:

Create Data View for .ds-.monitoring-es*
Create EQL Rule is above count 20 for last 5mins for Lucene filter cluster_state.status:red AND event.dataset:elasticsearch.cluster.stats. (Since Logs&Metrics polls every 10s, we're calculating 66% (arbitrary threshold I chose for example) of Xmins/10s where I also arbitrarily decided Xmins as 5mins.)

ravikesarwani added Team:Monitoring Stack Monitoring team Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Feature:Stack Monitoring SM alerting improvements labels Sep 29, 2021

stefnestor mentioned this issue Jul 13, 2023

[Feature request] Separate red and yellow cluster health alerts or make it configurable #132392

Closed

stefnestor mentioned this issue Nov 2, 2023

Add "For the last x" Option for Cluster Health Stack Monitoring Rule #145843

Open

smith removed the Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services label Nov 13, 2023

bmorelli25 mentioned this issue Jan 26, 2024

Add Known Issue elastic/observability-docs#3576

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance "Cluster health" stack monitoring rule to allow user configuration for Yellow/Red/Both #113445

Enhance "Cluster health" stack monitoring rule to allow user configuration for Yellow/Red/Both #113445

ravikesarwani commented Sep 29, 2021 •

edited

Loading

stefnestor commented Nov 2, 2023 •

edited

Loading

VimCommando commented Nov 2, 2023

stefnestor commented Nov 2, 2023

Enhance "Cluster health" stack monitoring rule to allow user configuration for Yellow/Red/Both #113445

Enhance "Cluster health" stack monitoring rule to allow user configuration for Yellow/Red/Both #113445

Comments

ravikesarwani commented Sep 29, 2021 • edited Loading

stefnestor commented Nov 2, 2023 • edited Loading

VimCommando commented Nov 2, 2023

stefnestor commented Nov 2, 2023

ravikesarwani commented Sep 29, 2021 •

edited

Loading

stefnestor commented Nov 2, 2023 •

edited

Loading