-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Stack Monitoring] Improve Missing Monitoring Data rule #126709
Comments
Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI) |
I like the idea of 2 separate rule. One focused on the whole cluster and one focused on node.
For both of these rules I feel we will require some concept of alerting only when there was data before and now data is missing for a little while. We should handle the scenario gracefully where nodes are taken out of the cluster, a scenario that will happen all the time in the field in the lifetime of a Elasticsearch cluster. |
I would recommend we create a seperate missing data rule for every entity in the system: Kibana, Metricbeat, Filebeat, APM Server, Nodes, and Clusters. As a customer, I would expect to be notified when any of these disappear from the cluster. As for the rule evaluation, we should use an Elasticsearch query to push the missing entity detection to Elasticsearch. The following example is for detecting nodes when they drop out of the cluster or stop reporting. The idea is to query Elasticsearch using a range filter that spans across the last rule execution and the current rule execution. To determine if a node has gone missing or is new/recovered we need to create two buckets using a Once we have the document count for each period, we can use a With Along with the missing nodes, we also need to track the last execution time of the previous execution so we can use it to create the range query that covers both. For most of the monitoring data, looking at a 5 minute window for each period should be sufficient. This means we would actually query for approximately 10 minutes of data, from the start of the last execution to the end of the current. In a perfect world, we could simply create 2 equal sized buckets but unfortunately the Kibana Alerting system has some drift which is why we need to use the timestamp of last execution rather than assuming it never drifts. In the example query below, I'm just using a 10 minute time range with two equal 5 minute periods but in the final implementation, the
This should simplify the Kibana code to just a few parts:
This will also improve the performance of these rules because we only need to query approximately 10 minutes of data instead of looking back 24 hours every time it runs. It also eliminate the bug where after 24 hours missing nodes recover because they are no longer showing up in the query. |
I'm wondering how these kinds of rule intersect with the planned Health and Topology APIs? |
After investigating the slow performance of this rule when created with the default values of looking back 1 day we found this rule has some shortcomings. The way this rule works is we query for all data in the range of
now
-lookback
. Per each cluster, per each node, we subtractnow
from the last document's timestamp and if that value is greater thanduration
then we fire an alert.duration
andlookback
are configurable by the user and when we create an OOTB rule of this type for the user we set the defaults below:When it alerts it specifies which node has the issue. The problem with this approach is once the time range has passed and data no longer exists it will no longer report missing data on a node. Some changes we could make:
lookback
option if we can track the groupsThe text was updated successfully, but these errors were encountered: