Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance "Cluster health" stack monitoring rule to allow user configuration for Yellow/Red/Both #113445

Open
ravikesarwani opened this issue Sep 29, 2021 · 3 comments

Comments

@ravikesarwani
Copy link
Contributor

ravikesarwani commented Sep 29, 2021

As part of "Cluster health" rule allow users to configure if they want to receive alert for Yellow, Red, or Both yellow and red.
The default configuration value for the rule will stay as "Both yellow and red".

Combining with our changes in 7.15 to allow multiple rules of the same type users can now configure different actions for Yellow(say email) and Red(say pagerduty), if they want.

Currently the Cluster health rule fires when the cluster health status changes from green to yellow OR red.
There is no way for the users to configure to get alert only when the cluster state changes to "red".

Yellow status can happen based on temporary processing in Elasticsearch.
Any action that creates a new index (rollover, shrink, mounting an index, close-and-reopen (through forcemerge w/codec change)) can cause the cluster to go briefly yellow.

Stretch goal
Besides adding the extra configuration(for Yellow, Red, or Both) we should look at the possibility of "look at last X minutes of data and alert only when we see all of them to be the same status" rather than just relying on the last document status.

@stefnestor
Copy link
Contributor

stefnestor commented Nov 2, 2023

I'll also note for public record ILM Searchable Snapshots coming up on Frozen tier can blip the cluster status:red with no action required by Dev on-calls, e.g. Elasticsearch logs per index hitting phase/action/step: frozen/searchablesnapshots/mount-snapshot (or maybe wait-for-index-color sorry these happen really close together so I can't fully tell):

Cluster health status changed from [YELLOW] to [RED] (reason: [snapshot shard size updated]).
Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[partial-restored-my_index-2023.10.31-000001][0]]]).

Which'd resolve relating to the stretch goal in description

we should look at the possibility of "look at last X minutes of data and alert only when we see all of them to be the same status" rather than just relying on the last document status.

Which kinda overlaps with #145843

@VimCommando
Copy link

On a related node, all the built-in rules should be using the _health_report API indicators and not the _cluster/health indicators.

The _health_report understands if shards are unassigned due to expected cluster actions, like new indices or restarting nodes:
https://github.com/elastic/elasticsearch/blob/3636d3d6ac492dda2dc2400e104b69319b753daa/server/src/main/java/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java#L408-L410

I don't know for sure if it tracks ILM transitions yet.

@stefnestor
Copy link
Contributor

Publicly documenting lower stack versions workaround/alternative via manual Rule setup.

Example is taken on Elastic Cloud against version v8.9.2 for Logs&Metrics data:

  1. Create Data View for .ds-.monitoring-es*
  2. Create EQL Rule is above count 20 for last 5mins for Lucene filter cluster_state.status:red AND event.dataset:elasticsearch.cluster.stats. (Since Logs&Metrics polls every 10s, we're calculating 66% (arbitrary threshold I chose for example) of Xmins/10s where I also arbitrarily decided Xmins as 5mins.)
image

@smith smith removed the Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services label Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants