-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schema limitation in prometheus alerts #30
Comments
I think this is something we will need to explore more: uniqueness. In playing with Grizzly, I came to have PrometheusRuleGroup as the thing that needs to be unique (using a combination of name and namespace). Within that, rules are just rules. But, I think Sam made a change to make rules into a map, and I suspect it is this that is catching you now. |
Yup, that is the source of the problem. Choosing an alert name as a key might not have been the best choice when Prometheus allows names not to be unique. That said I like having alerts put into a map instead of a list but it appears that the current unique identifier is not unique from prometheus perspective. |
I also have a recollection of hearing that recording rules are order sensitive. |
Super interesting! Pondering this, i think the desire to reuse names - for alerts at least - reduces to questions about the relationship we're trying to express between the same-named objects. Lemme lay out some base facts: In the node exporter case, there are two alerts with the name
(Additionally, the severity label is hardcoded to Clearly, there's something meritorious about defining these alerts in a way that expresses they are related. Taken together, they constitute a unified classifier ("normal," "warning," "critical") over the value space produced by their related linear predictions, which differ only by a fixed scalar. In other words, in any sane configuration of the parameters, the "critical" alert will never be firing if the "warning" alert is not also firing. This is a clear, strong relationship. The problem with simply reusing a name to indicate this relationship is how little the name-sameness actually tells the user. When two alerts of the same name are firing, there is no general rule of inference the user can rely on to be certain such a relationship exists - all they can know for sure is that "the author thinks these alerts are related, somehow." And it's not actually trivial to decode this relationship - it took me a minute or so of copy/pasting the two exprs into an editor for careful visual comparison to verify that they are actually identical, excepting those two deviations. For such a well-defined relationship between alerts as exists in this case, i think we're best served with another logical abstraction that precisely captures:
Basically, one alert declaration in the pop, but declared in a way that clearly expresses the intent for multiple alerts to be generated. And, to @paulfantom's point about the inhibition rule, having a clear, specified logical structure for this relationship would actually make it easy to generate those inhibition rules, too. Something like: #UsageThreshold: {
minThreshold: number
horizonHours: number
warningLabel: string
} Then wrap these into a struct that's an argument to the alert, and introduce some more generic flattening logic. We could make the inclusion of such arguments part of the meta-structure that already wraps the alerts. Seems like an ideal thing for me to poke at during a community hack session - i'll do that tomorrow! Again, maps or listsComing back to the original question... Ultimately, i think the design choice about maps vs. lists comes back to what general inference rules the user can apply when multiple alerts exist with the same name. Are those relationships fuzzy (lists), or well-specified (maps or lists)? It's a 2x2 matrix of possibilities:
Because allowing fuzziness at all undermines the user's ability to make inferences, i think we should opt only for the well-specified approach, and keep the map structure as-is. My hunch is that anytime it's really, really clear that it makes sense to have multiple alerts having the same name, it's specifically because there's a well-specifiable relationship between those alerts, meriting a logical structure. If that turns out not to be the case, or it's just not worth the work to use these structures, then suffixed naming patterns are still always possible. |
It would certainly make sense that they're ordered, as one rule may depend on the output of a prior one. (right?) That's definitely something we'll need to tackle. I can imagine a straightforward sort rule, given a declaration that one rule must succeed another. |
Bear in mind that this concept of specifying the relation between alerts wasn't chosen because we wanted to explicitly state the relation between alerts but because we wanted to have simplicity in use. AFAIK originally mixins were meant to be developed by SMEs and installed by users which may not have a lot of knowledge of the monitored application. IMHO In such cases expectation to know the relation between alerts is less relevant than the ease of use. Especially when you have good alert descriptions and runbooks. |
i think we actually may be in agreement here - will be easier to tell after i hack on this today and there's some code to look at :)
Of course. But today, it's not necessarily what alerts sharing the same name means. To that end...
I'm not suggesting that we're expecting inhibition rules to be generated - or at least, not any more than mixins are today. Case in point, there's no inhibition rule defined in the node_exporter mixin that i can see. That's defined over in kube-prometheus. Rather, i'm saying that creating a logical structure to formalize the multiple-severities relation between a set of alerts makes it possible to write a reliable general algorithm that generates corresponding inhibition rules. Such a general algorithm can exist without formalizing the relation, but then the correctness of any generated inhibition rules is based on an unverifiable hope that best practices were followed.
This is the point where i think our imaginations of what i've described have diverged. I do have worries about the complexity/usability of what i've described above, but i think they fall on the side of the pop author and the consuming tool author, not the actual end user consuming the pop. Will have to see what i can come up with in today's hack session :) |
@talentedmrjones this topic may be of interest to you, wrt any helpers/patterns that you may or may not have written on top of Cloudformation schema as part of your stax efforts |
@beorn7 this feels like an interesting example of that "Prometheus as compilation target" idea we talked about, many months ago |
OK, after having worked on these for an hour during today's community hack session and reflected on the experience, i think it's best to go with @paulfantom's original idea and just go back to using lists for rules and alerts. Two basic reasons:
Haven't decided if i'll tidy up the hack session into a PR for folks to have an educational look at, or not. |
Quick update based on discussion in community call - we're going to stick very closely to what Prometheus actually defines for these config objects. That means:
It remains plausible that we can add helpers to codify patterns like this alerts-differ-only-by-severity outcome, but we've gotta walk before we can run. Gonna use this issue as the TODO for converting the schema per the above, and add it to the milestone. |
I can confirm that order of rules within a group matters because later rules may depend on the output of earlier rules. Having multiple rules funneling their results into different elements of one vector with one name (may it be a metric name for recording rules or an alert name for alert rules) is a quite common pattern in Prometheus, for better or worse. The use case that triggered this discussion is only one of them. In particular, there are legitimate use cases where the critical alert may fire while the warning alert doesn't (e.g. our (in)famous multi-window multi-burn-rate SLO alerts). But then there are also completely different scenarios not related to severity. I would say the only meaning of multiple alerts with the same name is that they all alert about a similar thing (which is described by the name) and that the alert name is a good candidate to be part of the grouping key for notifications. |
While playing around with converting node_exporter mixin to pollypkg I found a limitation of the current polly alerts schema - there cannot be a duplicate alert with different severity.
For example, node_exporter mixin is shipping 2 alerts named
NodeFilesystemSpaceFillingUp
with different expressions and severities. It's useful to have them share names when we consider how inhibition rule can be written for such alerts (example in kube-prometheus)Am I missing something, or current schema makes it mandatory to have unique alert names?
The text was updated successfully, but these errors were encountered: