Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema limitation in prometheus alerts #30

Open
paulfantom opened this issue Aug 7, 2021 · 12 comments
Open

Schema limitation in prometheus alerts #30

paulfantom opened this issue Aug 7, 2021 · 12 comments

Comments

@paulfantom
Copy link
Contributor

paulfantom commented Aug 7, 2021

While playing around with converting node_exporter mixin to pollypkg I found a limitation of the current polly alerts schema - there cannot be a duplicate alert with different severity.

For example, node_exporter mixin is shipping 2 alerts named NodeFilesystemSpaceFillingUp with different expressions and severities. It's useful to have them share names when we consider how inhibition rule can be written for such alerts (example in kube-prometheus)

Am I missing something, or current schema makes it mandatory to have unique alert names?

@malcolmholmes
Copy link
Contributor

I think this is something we will need to explore more: uniqueness. In playing with Grizzly, I came to have PrometheusRuleGroup as the thing that needs to be unique (using a combination of name and namespace). Within that, rules are just rules.

But, I think Sam made a change to make rules into a map, and I suspect it is this that is catching you now.

@paulfantom
Copy link
Contributor Author

I think Sam made a change to make rules into a map, and I suspect it is this that is catching you now.

Yup, that is the source of the problem. Choosing an alert name as a key might not have been the best choice when Prometheus allows names not to be unique.

That said I like having alerts put into a map instead of a list but it appears that the current unique identifier is not unique from prometheus perspective.

@malcolmholmes
Copy link
Contributor

I also have a recollection of hearing that recording rules are order sensitive.

@sdboyer
Copy link
Member

sdboyer commented Aug 9, 2021

Super interesting! Pondering this, i think the desire to reuse names - for alerts at least - reduces to questions about the relationship we're trying to express between the same-named objects. Lemme lay out some base facts:

In the node exporter case, there are two alerts with the name NodeFilesystemSpaceFillingUp having expr that differ in two specific ways:

  • The critical one interpolates the fsSpaceFillingUpCriticalThreshold parameter to a minimum threshold, whereas the warning one uses the fsSpaceFillingUpWarningThreshold parameter
  • The critical one uses a projected-threshold-met value of 4 (hardcoded) hours, the warning of 24 (hardcoded) hours

(Additionally, the severity label is hardcoded to warning in the warning alert, but controlled by the nodeCriticalSeverity parameter in the critical alert.)

Clearly, there's something meritorious about defining these alerts in a way that expresses they are related. Taken together, they constitute a unified classifier ("normal," "warning," "critical") over the value space produced by their related linear predictions, which differ only by a fixed scalar. In other words, in any sane configuration of the parameters, the "critical" alert will never be firing if the "warning" alert is not also firing.

This is a clear, strong relationship. The problem with simply reusing a name to indicate this relationship is how little the name-sameness actually tells the user. When two alerts of the same name are firing, there is no general rule of inference the user can rely on to be certain such a relationship exists - all they can know for sure is that "the author thinks these alerts are related, somehow." And it's not actually trivial to decode this relationship - it took me a minute or so of copy/pasting the two exprs into an editor for careful visual comparison to verify that they are actually identical, excepting those two deviations.

For such a well-defined relationship between alerts as exists in this case, i think we're best served with another logical abstraction that precisely captures:

  • A set of named parameters
  • A list of the above sets
  • The pop author's intent that there should exist instances of the alert for each declared set

Basically, one alert declaration in the pop, but declared in a way that clearly expresses the intent for multiple alerts to be generated. And, to @paulfantom's point about the inhibition rule, having a clear, specified logical structure for this relationship would actually make it easy to generate those inhibition rules, too. Something like:

#UsageThreshold: {
  minThreshold: number
  horizonHours: number
  warningLabel: string
}

Then wrap these into a struct that's an argument to the alert, and introduce some more generic flattening logic.

We could make the inclusion of such arguments part of the meta-structure that already wraps the alerts. Seems like an ideal thing for me to poke at during a community hack session - i'll do that tomorrow!

Again, maps or lists

Coming back to the original question...

Ultimately, i think the design choice about maps vs. lists comes back to what general inference rules the user can apply when multiple alerts exist with the same name. Are those relationships fuzzy (lists), or well-specified (maps or lists)? It's a 2x2 matrix of possibilities:

Fuzzy ✅ Fuzzy 🚫
Specified ambiguous, defies inference what i think we want
Specified 🚫 (mixins today) (polly today)

Because allowing fuzziness at all undermines the user's ability to make inferences, i think we should opt only for the well-specified approach, and keep the map structure as-is. My hunch is that anytime it's really, really clear that it makes sense to have multiple alerts having the same name, it's specifically because there's a well-specifiable relationship between those alerts, meriting a logical structure. If that turns out not to be the case, or it's just not worth the work to use these structures, then suffixed naming patterns are still always possible.

@sdboyer
Copy link
Member

sdboyer commented Aug 9, 2021

I also have a recollection of hearing that recording rules are order sensitive.

It would certainly make sense that they're ordered, as one rule may depend on the output of a prior one. (right?)

That's definitely something we'll need to tackle. I can imagine a straightforward sort rule, given a declaration that one rule must succeed another.

@paulfantom
Copy link
Contributor Author

having a clear, specified logical structure for this relationship would actually make it easy to generate those inhibition rules, too.

NodeFilesystemSpaceFillingUp is not a single case, it isn't even a single case in node_exporter mixin. The concept of having different severity alerts named the same to create an easily manageable relation between them is quite commonly used. Because of that, I can create one simple inhibition rule which will work with all those cases. If we start to expect additional inhibition rules to be generated to correctly use pollypkg, then IMHO it is another blocker for wider adoption of polly.

Bear in mind that this concept of specifying the relation between alerts wasn't chosen because we wanted to explicitly state the relation between alerts but because we wanted to have simplicity in use. AFAIK originally mixins were meant to be developed by SMEs and installed by users which may not have a lot of knowledge of the monitored application. IMHO In such cases expectation to know the relation between alerts is less relevant than the ease of use. Especially when you have good alert descriptions and runbooks.

@sdboyer
Copy link
Member

sdboyer commented Aug 10, 2021

i think we actually may be in agreement here - will be easier to tell after i hack on this today and there's some code to look at :)

The concept of having different severity alerts named the same to create an easily manageable relation between them is quite commonly used.

Of course. But today, it's not necessarily what alerts sharing the same name means. To that end...

If we start to expect additional inhibition rules to be generated to correctly use pollypkg, then IMHO it is another blocker for wider adoption of polly.

I'm not suggesting that we're expecting inhibition rules to be generated - or at least, not any more than mixins are today. Case in point, there's no inhibition rule defined in the node_exporter mixin that i can see. That's defined over in kube-prometheus.

Rather, i'm saying that creating a logical structure to formalize the multiple-severities relation between a set of alerts makes it possible to write a reliable general algorithm that generates corresponding inhibition rules. Such a general algorithm can exist without formalizing the relation, but then the correctness of any generated inhibition rules is based on an unverifiable hope that best practices were followed.

IMHO In such cases expectation to know the relation between alerts is less relevant than the ease of use.

This is the point where i think our imaginations of what i've described have diverged. I do have worries about the complexity/usability of what i've described above, but i think they fall on the side of the pop author and the consuming tool author, not the actual end user consuming the pop.

Will have to see what i can come up with in today's hack session :)

@sdboyer
Copy link
Member

sdboyer commented Aug 10, 2021

@talentedmrjones this topic may be of interest to you, wrt any helpers/patterns that you may or may not have written on top of Cloudformation schema as part of your stax efforts

@sdboyer
Copy link
Member

sdboyer commented Aug 10, 2021

@beorn7 this feels like an interesting example of that "Prometheus as compilation target" idea we talked about, many months ago

@sdboyer
Copy link
Member

sdboyer commented Aug 10, 2021

OK, after having worked on these for an hour during today's community hack session and reflected on the experience, i think it's best to go with @paulfantom's original idea and just go back to using lists for rules and alerts. Two basic reasons:

  • Ultimately, the main reason for using maps is to make the individual rules and alerts more addressable. That's putting the cart before the horse, because it's not yet clear exactly how we're going to allow external values to be mixed in to pops. Until that's clear, it's hard to really justify the complexity cost of deviating from exactly what Prometheus has already laid out as a structure to be followed.
  • Using lists doesn't preclude the introduction of alert-exploder-helpers like this in the future.

Haven't decided if i'll tidy up the hack session into a PR for folks to have an educational look at, or not.

@sdboyer
Copy link
Member

sdboyer commented Aug 11, 2021

Quick update based on discussion in community call - we're going to stick very closely to what Prometheus actually defines for these config objects. That means:

  1. Collapsing the rule and alert types into a single Rule type
  2. Creating a RuleGroup type
  3. Going back to having lists

It remains plausible that we can add helpers to codify patterns like this alerts-differ-only-by-severity outcome, but we've gotta walk before we can run.

Gonna use this issue as the TODO for converting the schema per the above, and add it to the milestone.

@sdboyer sdboyer added this to the Minimally runnable milestone Aug 11, 2021
@beorn7
Copy link

beorn7 commented Aug 12, 2021

I can confirm that order of rules within a group matters because later rules may depend on the output of earlier rules.

Having multiple rules funneling their results into different elements of one vector with one name (may it be a metric name for recording rules or an alert name for alert rules) is a quite common pattern in Prometheus, for better or worse. The use case that triggered this discussion is only one of them. In particular, there are legitimate use cases where the critical alert may fire while the warning alert doesn't (e.g. our (in)famous multi-window multi-burn-rate SLO alerts). But then there are also completely different scenarios not related to severity. I would say the only meaning of multiple alerts with the same name is that they all alert about a similar thing (which is described by the name) and that the alert name is a good candidate to be part of the grouping key for notifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants