Add new cluster level disk usage alert on a specific data tier #110138

ravikesarwani · 2021-08-25T19:51:22Z

Currently we have a disk usage alert at the node level where we alert when a disk on a specific node is running low(80%). This by itself is insufficient as Elasticsearch manages the disk among a specific data tier by moving shards around automatically. Low disk on a specific node may not be an issue at all if other nodes on the same data tier have extra space and Elasticsearch is able to move shards around to those nodes requiring no user intervention.

This new OOTB rule tracks the disk usage at the cluster level on a specific data tier (Hot/Warm/Cold/Frozen) and alerts when it reaches a certain level. We should create 4 separate rules (so users have flexibility to manage them separately) for Hot, Warm, Cold & Frozen data tier.

Hot, Warm and Cold will alert by default when the combined disk space across all the nodes for that tier reaches more than 80%, on average, in the last 5 minutes with re-notify interval of 1 day.
Frozen will alert by default when the combined disk space across all the nodes for that tier reaches more than 95%, on average, in the last 5 minutes with re-notify interval of 1 day.
As we deliver this new rule we also need to modify our current existing node based disk usage alert to fire (by default) when the disk on a node reaches more than 90%. This is the high watermark that’s configured in Elasticsearch when it attempts to relocate shards away from a node. The node level rule supplements the cluster level rule and together they handle different disk usage scenario much more gracefully and alert only when really needed.

Docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html#disk-based-shard-allocation

Stretch goal:

Provide an optional way to configure absolute value instead of used disk percentage for the alert.

ravikesarwani · 2021-08-25T19:53:13Z

cc: @DaveCTurner Let me know if this looks okay from ES side or any adjustments should be made.

elasticmachine · 2021-08-25T19:54:17Z

Pinging @elastic/stack-monitoring (Team:Monitoring)

elasticmachine · 2021-08-25T19:54:17Z

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

hendry-lim · 2021-08-26T01:21:16Z

Would it also be possible to allow absolute values instead of only percentages? We are able to set absolute disk watermark values instead of percentage in Elasticsearch.

DaveCTurner · 2021-08-26T11:01:49Z

Sounds good to me, thanks @ravikesarwani 👍

jasonrhodes · 2021-08-30T15:52:24Z

Related: #105659

jasonrhodes · 2021-08-30T15:54:13Z

@ravikesarwani / @DaveCTurner can we put links to the ES docs that specify these values in the description here?

ravikesarwani · 2021-08-30T15:57:45Z

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html#disk-based-shard-allocation

botelastic bot added the needs-team Issues missing a team label label Aug 25, 2021

ravikesarwani added Feature:Stack Monitoring Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Team:Monitoring Stack Monitoring team labels Aug 25, 2021

botelastic bot removed the needs-team Issues missing a team label label Aug 25, 2021

ravikesarwani added the SM alerting improvements label Sep 15, 2021

cjcenizal mentioned this issue Nov 2, 2021

Add data-tier concepts into the monitoring app #82216

Open

smith removed the Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services label Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new cluster level disk usage alert on a specific data tier #110138

Add new cluster level disk usage alert on a specific data tier #110138

ravikesarwani commented Aug 25, 2021 •

edited by denisgils

Loading

ravikesarwani commented Aug 25, 2021

elasticmachine commented Aug 25, 2021

elasticmachine commented Aug 25, 2021

hendry-lim commented Aug 26, 2021 •

edited

Loading

DaveCTurner commented Aug 26, 2021

jasonrhodes commented Aug 30, 2021

jasonrhodes commented Aug 30, 2021

ravikesarwani commented Aug 30, 2021

Add new cluster level disk usage alert on a specific data tier #110138

Add new cluster level disk usage alert on a specific data tier #110138

Comments

ravikesarwani commented Aug 25, 2021 • edited by denisgils Loading

ravikesarwani commented Aug 25, 2021

elasticmachine commented Aug 25, 2021

elasticmachine commented Aug 25, 2021

hendry-lim commented Aug 26, 2021 • edited Loading

DaveCTurner commented Aug 26, 2021

jasonrhodes commented Aug 30, 2021

jasonrhodes commented Aug 30, 2021

ravikesarwani commented Aug 30, 2021

ravikesarwani commented Aug 25, 2021 •

edited by denisgils

Loading

hendry-lim commented Aug 26, 2021 •

edited

Loading