-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new cluster level disk usage alert on a specific data tier #110138
Comments
cc: @DaveCTurner Let me know if this looks okay from ES side or any adjustments should be made. |
Pinging @elastic/stack-monitoring (Team:Monitoring) |
Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui) |
Would it also be possible to allow absolute values instead of only percentages? We are able to set absolute disk watermark values instead of percentage in Elasticsearch. |
Sounds good to me, thanks @ravikesarwani 👍 |
Related: #105659 |
@ravikesarwani / @DaveCTurner can we put links to the ES docs that specify these values in the description here? |
Currently we have a disk usage alert at the node level where we alert when a disk on a specific node is running low(80%). This by itself is insufficient as Elasticsearch manages the disk among a specific data tier by moving shards around automatically. Low disk on a specific node may not be an issue at all if other nodes on the same data tier have extra space and Elasticsearch is able to move shards around to those nodes requiring no user intervention.
This new OOTB rule tracks the disk usage at the cluster level on a specific data tier (Hot/Warm/Cold/Frozen) and alerts when it reaches a certain level. We should create 4 separate rules (so users have flexibility to manage them separately) for Hot, Warm, Cold & Frozen data tier.
Hot, Warm and Cold will alert by default when the combined disk space across all the nodes for that tier reaches more than 80%, on average, in the last 5 minutes with re-notify interval of 1 day.
Frozen will alert by default when the combined disk space across all the nodes for that tier reaches more than 95%, on average, in the last 5 minutes with re-notify interval of 1 day.
As we deliver this new rule we also need to modify our current existing node based disk usage alert to fire (by default) when the disk on a node reaches more than 90%. This is the high watermark that’s configured in Elasticsearch when it attempts to relocate shards away from a node. The node level rule supplements the cluster level rule and together they handle different disk usage scenario much more gracefully and alert only when really needed.
Docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html#disk-based-shard-allocation
Stretch goal:
The text was updated successfully, but these errors were encountered: