Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research how to add a circuit breaker for max number of active alerts per rule #124870

Closed
mikecote opened this issue Feb 7, 2022 · 9 comments
Closed
Assignees
Labels
Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented Feb 7, 2022

We should explore adding a circuit breaker to the number of active alerts per rule so the system doesn't end-up working with large data structures in memory. Some thoughts below..

@mikecote mikecote added blocked Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Feb 7, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@mikecote
Copy link
Contributor Author

I think we should explore this option similar to how we did #124871 and implement the functionality with a high number as a starting point.

Some thoughts on the functionality

  • We should pass this circuit breaker to the rule type runners so they can do their best to respect it
  • We should create an issue asking all rule type authors to implement code to respect the circuit breaker
  • The framework must know when the circuit breaker is applied in the rule type runner
  • Recovery events should not send until the rule run where circuit breaker stops applying
  • We should not track more than circuit breaker amount of alerts in memory (think what happens on alert API usage)
  • Circuit breaker should be configurable in kibana.yml while we come up with a default value (lower than 100,000, maybe 10,000)
  • Circuit breaker should be configurable by rule type as well as overridable in kibana.yml (similar to timeouts by rule type)
  • We should create the alert documents and send the notifications for those captured prior to the circuit breaker applying (similar to actions)
  • What should be the last run status? warning or error? check with security solution

@mikecote mikecote added research and removed blocked labels Apr 14, 2022
@mikecote mikecote changed the title Add circuit breaker for max number of alerts a rule can generate Research how to add a circuit breaker for max number of alerts a rule can generate Apr 14, 2022
@mikecote mikecote changed the title Research how to add a circuit breaker for max number of alerts a rule can generate Research how to add a circuit breaker for max number of active alerts per rule Apr 28, 2022
@mikecote
Copy link
Contributor Author

mikecote commented Apr 28, 2022

I've learned with time it might be better to have a circuit breaker on the maximum number of active alerts rather than the maximum number of alerts a run can generate.

For example. If a rule continuously hits the circuit breaker for the next 10 runs but generates different alerts each time, the system would have to keep track of them all to know when to recover them (leaving our state with an unbounded number of alerts to track, in this case potentially up to 10x the circuit breaker). I think we should see if we can do something with the max number of active alerts instead. The main reason is that we want the framework to work with small data structures since everything is stored within the task state and in-memory which takes CPU time to calculate new, ongoing and recovered alerts.

This may change some of my thoughts above.. some further thoughts below:

  • Even though the limit applies to the maximum number of active alerts, we should also apply this as the max number of alerts a rule run can generate
    • For example, with a circuit breaker of 1,000 and a rule having 400 active alerts; the next run could generate 1,000 new alerts or less and the system would recover the previous 400 as they are no longer reported. Netting up to 1,000 tracked alerts after the run.
  • When the rule run generated a greater amount of alerts than the active alert limit, we need to apply some math on what number of alerts we can track (limit - previous active alerts = how many new alerts we can track).
    • For example, with a circuit breaker of 1,000 and a rule having 400 active alerts; the next run could generate 1,001 new alerts and the system applies the circuit breaker at the 1,001st call but should only track the first 600 alerts reported so it nets 1,000 alerts to track (400 existing alerts + 600 new alerts = 1,000 limit).

@gmmorris
Copy link
Contributor

gmmorris commented May 4, 2022

I've been contemplating the approach we could take here and have some thoughts.

The way I see it we need to choose between two approaches - one that prioritises correctness of recovery and another that prioritises freshness of data.
I'm not sure which is the right one to prioritise by default, but from an observability perspective, my intuition tells me that freshness is more important to me than avoiding premature recovery.

Allow me to explain:

In the image above you can see the two approaches:
Two approaches to Alert Circuit Breakers in rules@2x

Drop Alerts the Moment a Circuit Breaker is Tripped
The first assumes that a rule run that trips the circuit breaker should throw away the entire cycle as we can't trust it. I can see why this might make sense to avoid telling the user that an alert has recovered when we can't be sure that it has, but this has an obvious downside: As a user I now have no idea what the state of my system is. Assuming we communicate the circuit break tripping as a warning ⚠️ in the UI, I know that something is wrong, but I don't know what. I can still see the alerts from the last valid rule run, but I have no idea whether they are still violating the condition or not.

Process Alerts until the Moment a Circuit Breaker is Tripped
The second treats any alerts that were detected priori to the circuit breaker being tripped as valid, and simply refuses to process any more after that. We can still throw and stop additional detections or processing in the rule, but we treat the alerts before the circuit is tripped as if nothing bad happened.

The downside to this is that if the rule hasn't reported an existing alert, then this alert will auto-recover.
The plus side is that any active alert is, demonstrably, an accurate depiction of the world as we know it. This enables me, the user, to continue to address the problem causing the alert storm, using reliable detection data, and hopefully I can bring the situation back under control and below the circuit breaker threshold.

--

Looking at these two side by side, it seems obvious to me that we want to go with the latter approach, but this is just one person's opinion. Please explore this further with input from some folks outside E&C.

@mikecote
Copy link
Contributor Author

mikecote commented May 4, 2022

my intuition tells me that freshness is more important to me than avoiding premature recovery.

Some further thoughts:

We've had to in the past do an RCA as to why the system prematurely recovered an alert during a severe outage. With that in mind, I'm thinking yes we may want to report on the freshness of data but not sure about auto recovering alerts that aren't confirmed recovered. We could also drop the recovery events for those alerts or make the system work with an unbounded number of alerts to recover 🤔

As I think, the system will need to start tracking alerts for flapping purposes, I'm thinking we'll also need to put a limit on that as well and the freshness of data requirement could require us to make the algorithm unbounded as well. This may be another example we'll need to support an unbounded number of alerts to apply flapping algorithms to 🤔

@gmmorris
Copy link
Contributor

gmmorris commented May 5, 2022

We've had to in the past do an RCA as to why the system prematurely recovered an alert during a severe outage. With that in mind, I'm thinking yes we may want to report on the freshness of data but not sure about auto recovering alerts that aren't confirmed recovered. We could also drop the recovery events for those alerts or make the system work with an unbounded number of alerts to recover 🤔

That's a fair point, thanks for reminding me of that case.
What crosses my mind though is that in that case the rule suggested the problem went away.
If we make it clear the state is unreliable due to a circuit break being ripped, it would address that problem.

The question is - how do we make this clear?

@mikecote mikecote moved this from Todo to Blocked / On hold in AppEx: ResponseOps - Execution & Connectors May 5, 2022
@mikecote
Copy link
Contributor Author

mikecote commented May 5, 2022

I'm temporarily adding a blocked label for now as this is the next issue to be picked up, and I wanted to summarize my conversation with leadership before this issue gets picked up.

@mikecote
Copy link
Contributor Author

mikecote commented May 5, 2022

Some notes after talking with some of the leads:

  • Freshness and severity are the priority when applying the circuit breaker
  • We should work with rule-type authors to report alerts in order of importance so if a circuit breaker does apply, the most important alerts will make the cut.
  • For recovering alerts, instead, we will come up with a term but basically "drop" them without calling actions.
  • For throttling, the functionality will stop working when an alert is "dropped" and reset the next time the alert is recovered.
  • Circuit breaker applies to max number of alerts that can be reported during a rule run and will drop existing alerts to keep track of the newly reported alerts. This is why it's important to report the most important alerts first.

There were conversations about what to do with alerts when deleting or disabling a rule (#112354). With this new "dropped" functionality, we could leverage it there too..

@ymao1 ymao1 closed this as completed Aug 8, 2022
Repository owner moved this from In Progress to Done in AppEx: ResponseOps - Execution & Connectors Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

No branches or pull requests

4 participants