Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Change rule creation for anomaly detection detectors (latency, throughput, and failed transaction rate) #126580

Closed
formgeist opened this issue Mar 1, 2022 · 14 comments · Fixed by #171901
Assignees
Labels
apm:alerting apm:ml Integration between APM and ML enhancement New value added to drive a business result Team:APM All issues that need APM UI Team support

Comments

@formgeist
Copy link
Contributor

formgeist commented Mar 1, 2022

Summary

The current experience around creating anomaly detection rules for alerts is targeted only for the latency detector, but we have recently added new detectors for the anomaly detection jobs that include throughput and failed transaction rate.

We should decide whether we should include a single anomaly detection rule that will contain all three options as conditions or have separate rule types for each detector.

CleanShot 2022-03-01 at 13 46 43@2x

Solution

Convert the existing latency anomaly rule to a global anomaly rule for all or a single detector(s)

  • Remove/replace the existing latency anomaly rule with a global anomaly rule
  • The new anomaly rule takes a condition of all or a single detector, so the user doesn't have to choose individual rule types for the various anomalies that might occur.
  • By enabling the user to selected a single detector we keep the existing functionality of being able to select e.g. latency as the detector and determining how to alert on any severity level.
  • By enabling the user to choose actions on all detector anomalies with the same severity level enables them to have fewer rules that serve as a catch-all
  • Add detector ("metric") option to the conditions of the rule and replace wording around latency anomaly.
  • Update the default rule name to "Anomaly alert | $service.name"

CleanShot 2022-03-02 at 11 03 18@2x

CleanShot 2022-03-02 at 11 03 49@2x

CleanShot 2022-03-02 at 11 44 24@2x

@formgeist formgeist added Team:APM All issues that need APM UI Team support enhancement New value added to drive a business result labels Mar 1, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:apm)

@alex-fedotyev
Copy link

@formgeist - could you please create a quick mock how alert creation would look like in both of these flows, i.e. a single anomaly detection rule that will contain all three options as conditions or have separate rule types for each detector?

@formgeist
Copy link
Contributor Author

@alex-fedotyev I've updated the issue description with the two options. I'm heavily leaning towards Solution B because it means the most flexibility in its use and being able to set a global rule for all anomalies that are detected.

@dgieselaar are there any limitations or challenges with either approach?

cc @sqren @chrisdistasio

@sorenlouv
Copy link
Member

sorenlouv commented Mar 2, 2022

Solution B sound good to me. Thanks for getting around to this so quickly @formgeist !

@chrisdistasio
Copy link

@vinaychandrasekhar do you have any thoughts on this?

@vinaychandrasekhar
Copy link

@chrisdistasio thanks for including me in the discussion.
@formgeist couple of questions -

  • Would it make sense to also combine the multiple threshold rules under a single "Create threshold rule" with a second level menu for failed transaction rule, error count rule? Otherwise, it appears confusing (to me) that anomaly rule creation shows no notion of signal or metric (i.e., latency, transaction rate, errors) where as the threshold ones do
  • Other o11y apps where the alerts and rules drop down is shown seem to follow a pattern more similar to solution A above. Would like to check if there's a top-down pattern consistency already considered across o11y apps?

@chrisdistasio
Copy link

+1 @vinaychandrasekhar -- we should use this as an opty to drive consistency in behavior/pattern across the o11y apps.

@dgieselaar
Copy link
Member

@formgeist agree w/ @sqren that B sounds like our best option. It'd be great if we can get that in for 8.2, as it fixes the bug (or rather makes that behaviour explicit) where it fires for all detector types.

@formgeist
Copy link
Contributor Author

Would it make sense to also combine the multiple threshold rules under a single "Create threshold rule" with a second level menu for failed transaction rule, error count rule? Otherwise, it appears confusing (to me) that anomaly rule creation shows no notion of signal or metric (i.e., latency, transaction rate, errors) where as the threshold ones do

We originally intended the rule grouping to reflect the fact that Latency had anomaly and threshold rules types, while the others (throughput and failed transaction rate) only supported threshold based rules. Now that this is no longer the issue, I agree a re-organization makes sense.

Other o11y apps where the alerts and rules drop down is shown seem to follow a pattern more similar to solution A above. Would like to check if there's a top-down pattern consistency already considered across o11y apps?

Agreed, there's an opportunity to review this across Observability too. I've created a separate issue for this.

I would like us to focus on supporting the new anomaly detectors in the upcoming release, so let's narrow the scope down to change the anomaly rule creation. I'll create the necessary ticket(s) so we can include this in our plans.

cc @dannycroft

@formgeist
Copy link
Contributor Author

Created a related issue to change the structure of rules in the Alerts and rules option for APM #126757

@MiriamAparicio
Copy link
Contributor

Hi, I'm going to start working on this. For me to understand and to confirm as the issue is quite all
We want to add the 'metric' detector to the creation of the rule, keeping as default ALL, and display suggestions as what we have for the type with the 3 different connectors (latency, throughput, or failed transaction rate)

image

Do we change also this copy?
Alert when either the latency, throughput, or failed transaction rate of a service is anomalous. Learn more

cc @formgeist @boriskirov @sqren

@boriskirov
Copy link
Contributor

Yes, that sounds good, adding a detector for the different available metrics and selecting the all by default.

@sorenlouv
Copy link
Member

We want to add the 'metric' detector to the creation of the rule, keeping as default ALL, and display suggestions as what we have for the type with the 3 different connectors (latency, throughput, or failed transaction rate)

@MiriamAparicio I suggest making it a checkbox (multiselect). Let's avoid the "All" option and instead default to having all three options pre-selected.

@MiriamAparicio
Copy link
Contributor

@sqren I already have a PR up, and what you suggested would be very different to what we have for all other rules

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apm:alerting apm:ml Integration between APM and ML enhancement New value added to drive a business result Team:APM All issues that need APM UI Team support
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants