Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new cluster level disk usage alert on a specific data tier #110138

Open
3 tasks
ravikesarwani opened this issue Aug 25, 2021 · 8 comments
Open
3 tasks

Add new cluster level disk usage alert on a specific data tier #110138

ravikesarwani opened this issue Aug 25, 2021 · 8 comments

Comments

@ravikesarwani
Copy link
Contributor

ravikesarwani commented Aug 25, 2021

Currently we have a disk usage alert at the node level where we alert when a disk on a specific node is running low(80%). This by itself is insufficient as Elasticsearch manages the disk among a specific data tier by moving shards around automatically. Low disk on a specific node may not be an issue at all if other nodes on the same data tier have extra space and Elasticsearch is able to move shards around to those nodes requiring no user intervention.

This new OOTB rule tracks the disk usage at the cluster level on a specific data tier (Hot/Warm/Cold/Frozen) and alerts when it reaches a certain level. We should create 4 separate rules (so users have flexibility to manage them separately) for Hot, Warm, Cold & Frozen data tier.

  • Hot, Warm and Cold will alert by default when the combined disk space across all the nodes for that tier reaches more than 80%, on average, in the last 5 minutes with re-notify interval of 1 day.
    Frozen will alert by default when the combined disk space across all the nodes for that tier reaches more than 95%, on average, in the last 5 minutes with re-notify interval of 1 day.

  • As we deliver this new rule we also need to modify our current existing node based disk usage alert to fire (by default) when the disk on a node reaches more than 90%. This is the high watermark that’s configured in Elasticsearch when it attempts to relocate shards away from a node. The node level rule supplements the cluster level rule and together they handle different disk usage scenario much more gracefully and alert only when really needed.

Docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html#disk-based-shard-allocation

Stretch goal:

  • Provide an optional way to configure absolute value instead of used disk percentage for the alert.
@botelastic botelastic bot added the needs-team Issues missing a team label label Aug 25, 2021
@ravikesarwani
Copy link
Contributor Author

cc: @DaveCTurner Let me know if this looks okay from ES side or any adjustments should be made.

@ravikesarwani ravikesarwani added Feature:Stack Monitoring Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Team:Monitoring Stack Monitoring team labels Aug 25, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring (Team:Monitoring)

@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Aug 25, 2021
@hendry-lim
Copy link

hendry-lim commented Aug 26, 2021

Would it also be possible to allow absolute values instead of only percentages? We are able to set absolute disk watermark values instead of percentage in Elasticsearch.

@DaveCTurner
Copy link

Sounds good to me, thanks @ravikesarwani 👍

@jasonrhodes
Copy link
Member

Related: #105659

@jasonrhodes
Copy link
Member

@ravikesarwani / @DaveCTurner can we put links to the ES docs that specify these values in the description here?

@smith smith removed the Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services label Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants