Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Active alert never resolves when origin rule is deleted #112354

Open
cyrille-leclerc opened this issue Sep 15, 2021 · 16 comments
Open

[APM] Active alert never resolves when origin rule is deleted #112354

cyrille-leclerc opened this issue Sep 15, 2021 · 16 comments
Labels
bug Fixes for quality problems that affect the customer experience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Theme: rac label obsolete

Comments

@cyrille-leclerc
Copy link
Contributor

cyrille-leclerc commented Sep 15, 2021

Kibana version

7.15 BC3 on ESS

APM Server version (if applicable)

No response

Elasticsearch version (if applicable)

No response

Steps to Reproduce

  • Create an APM Transaction Latency rule: "SERVICE frontend TYPE request ENVIRONMENT production WHEN avg IS ABOVE 1500ms FOR THE LAST 1 minute"
  • Create a violation of its threshold
  • Verify in the alerts screen that has alert has fired and currently has the status active
  • Delete the rule that has caused the alert
  • Wait for several minutes
  • Refresh the alerts screen, the alert is still visible and its status is still active``

https://www.youtube.com/watch?v=mJmhfnpwnMw&feature=youtu.be

Expected Behavior

The alert should still be visible in the alerts screen and its status should have changed to "resolved".

Actual Behavior

The alert is still be visible in the alerts screen but its status is "active" instead of "resolved".

@cyrille-leclerc cyrille-leclerc added Team:APM All issues that need APM UI Team support Theme: rac label obsolete labels Sep 15, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:apm)

@cyrille-leclerc
Copy link
Contributor Author

@jasonrhodes I'm wondering if the simplest solution for this odd behaviour could be to for the moment cascade delete the alert that have been created by a rule that is being deleted.
A more sophisticated solution would be to cascade resolve the active alert of a rule that is being deleted.

@sorenlouv sorenlouv added [zube]: Inbox and removed Team:APM All issues that need APM UI Team support labels Sep 20, 2021
@botelastic botelastic bot added the needs-team Issues missing a team label label Sep 20, 2021
@sorenlouv sorenlouv removed [zube]: Inbox needs-team Issues missing a team label labels Sep 20, 2021
@botelastic botelastic bot added the needs-team Issues missing a team label label Sep 20, 2021
@jasonrhodes
Copy link
Member

@elastic/actionable-observability we need to confirm if this is a bug and figure out what's happening if it is, or close this with an explanation if not

@jasonrhodes jasonrhodes added the Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" label Nov 11, 2021
@botelastic botelastic bot removed the needs-team Issues missing a team label label Nov 11, 2021
@ersin-erdal ersin-erdal self-assigned this Nov 23, 2021
@ersin-erdal
Copy link
Contributor

ersin-erdal commented Nov 23, 2021

I re-produced the bug (if it's a bug)
Yes the alert stays as Active. Another bug is the "View rule details" links. Since the rule is deleted it navigates to an empty page. A "not found" toast message is shown though.

The solution could be updating all the alerts (status and rule_id so ve can hide those "View rule details" links too) that is created by a rule when the rule is deleted.
( to cascade resolve the active alert of a rule) as Cyrille said.

If this is ok for everybody i can implement the solution.

@ersin-erdal ersin-erdal removed their assignment Dec 15, 2021
@vinaychandrasekhar
Copy link

Hi @ersin-erdal thank you for your question and suggestions. Not showing the "view rule details" on some alerts but showing that link on others might get confusing for users. Options are -

  • Option 1: show the link but greyed out and unclickable, with a message next to it that the rule no longer exists and may have been deleted
  • Option 2: show the link, and improve the message when the user lands on the rule details page

Preference would be option 1, but if that takes more effort we could start with option 1 and track option 2 for future.

cc @hbharding

@simianhacker
Copy link
Member

@elastic/kibana-alerting-services Is there a lifecycle event that happens when an alert is deleted? The current functionality in relies on wrapping the solutions executor, when the solution's rule runs our lifecycle executor runs as well setting the alert state (in RAC indices).

@simianhacker simianhacker self-assigned this Jan 25, 2022
@simianhacker simianhacker added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Jan 31, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@simianhacker simianhacker added bug Fixes for quality problems that affect the customer experience and removed Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" v8.1.0 labels Jan 31, 2022
@simianhacker simianhacker removed their assignment Jan 31, 2022
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@mikecote
Copy link
Contributor

mikecote commented Feb 1, 2022

@elastic/kibana-alerting-services Is there a lifecycle event that happens when an alert is deleted? The current functionality in relies on wrapping the solutions executor, when the solution's rule runs our lifecycle executor runs as well setting the alert state (in RAC indices).

We don't have any at this time but it shouldn't be hard to implement. Would you be open to working on a PR in the platform code to support this? cc @XavierM

@gmmorris
Copy link
Contributor

@elastic/kibana-alerting-services Is there a lifecycle event that happens when an alert is deleted? The current functionality in relies on wrapping the solutions executor, when the solution's rule runs our lifecycle executor runs as well setting the alert state (in RAC indices).

We don't have any at this time but it shouldn't be hard to implement. Would you be open to working on a PR in the platform code to support this? cc @XavierM

Just to make sure I understand this issue correctly - when @simianhacker says "alert" here, he means rule, right?
As in, when the rule is deleted, you want to resolve any open alerts in alerts-as-data?

(cc @kobelb )

@kobelb
Copy link
Contributor

kobelb commented Apr 13, 2022

Do we really want to be automatically resolving all alerts when an alerting rule is deleted? I could see some users wanting to delete the alerts entirely, or manually resolve them.

@simianhacker
Copy link
Member

@gmmorris Yes... When the rule is deleted the we need a way to clean up the "Alerts as Data" alerts.

@pmuellr
Copy link
Member

pmuellr commented May 5, 2022

We have a similar situation when you disable a rule - what do we do with active alerts at the alerting framework level? Previously we did nothing, so they remained active, and then later cause problems when the rule was enabled and those alert ids never got used again - they'd remain active forever. At least I think that was the scenario.

We ended up marking them resolved, but not running the resolved action for them, IIRC.

We don't currently have a problem when rules are deleted, since everything hangs off the rules, there's nothing to "clean up". Once we start doing space-wide event log searches not based on the rule ids, but on RBAC fields instead, presumably we'll start seeing this effect. We'd need to join on the rules to determine if the rule is deleted or not.

Doesn't help with resolving this, just thought I'd mention the similar situation.

I think one of the things that came from that was thinking we needed some new "action group" or maybe some additional context in the "resolved" action group, to indicate WHY we "resolved" it. Because the alert is no longer active? Or not enabled? Or was deleted?

@pmuellr
Copy link
Member

pmuellr commented May 5, 2022

Do we really want to be automatically resolving all alerts when an alerting rule is deleted? I could see some users wanting to delete the alerts entirely, or manually resolve them.

Ya, I'd guess some users would be surprised to find alert data deleted when a rule is deleted. I'd think you'd want some setting indicating what you want done: delete or not delete; and then probably some UX to show me "alerts from deleted rules", where I could get rid of them with a bulk delete.

And I'm curious if this is a space concern, or a visibility issue. Rather than delete the alerts, could we update them with a "rule was deleted" flag or such, and then use that with filtering?

Figured I'd like this to, as it might be appropriate, in the big scheme of things: Implement soft delete of rules #89564

@mikecote
Copy link
Contributor

mikecote commented May 5, 2022

Cross-referencing a comment I made here for potential circuit breaker on rules => #124870 (comment).

If we come up with a status for alerts that are no longer tracked (unconfirmed recovered), we could re-use such status when deleting or disabling an alerting rule.

@gmmorris
Copy link
Contributor

Do we really want to be automatically resolving all alerts when an alerting rule is deleted? I could see some users wanting to delete the alerts entirely, or manually resolve them.

Might this map to the whole "lifecycle vs. Persistent" alerts?
Lifecycle alerts auto-recover/delete, but Persistent ones... persist.

@kobelb
Copy link
Contributor

kobelb commented May 10, 2022

Might this map to the whole "lifecycle vs. Persistent" alerts?
Lifecycle alerts auto-recover/delete, but Persistent ones... persist.

This is largely a product question; however, I have an opinion, like always.

When a lifecycle alerting rule is deleted, doing something with the active alerts makes sense to me because they won't be resolved by the system in the future. I don't think we should automatically delete them, as this would break the references from any cases and this information might be helpful for the case. Changing their status from active to something makes sense to me. We can mark them as "resolved", but we risk the user interpreting the resolved status as the system no longer detecting the condition that triggered them, which is not the case.

When a persistent alert is deleted, I think users will need the choice whether to leave the persistent alerts behind or whether to delete them. It's possible that the alerts are still entirely valid and should go through their normal workflow to be closed, or it's possible the the alerts are invalid and should be purged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Theme: rac label obsolete
Projects
No open projects
Development

No branches or pull requests