Research how to add a circuit breaker for max number of active alerts per rule #124870

mikecote · 2022-02-07T19:38:57Z

We should explore adding a circuit breaker to the number of active alerts per rule so the system doesn't end-up working with large data structures in memory. Some thoughts below..

elasticmachine · 2022-02-07T19:38:58Z

Pinging @elastic/response-ops (Team:ResponseOps)

mikecote · 2022-04-13T15:08:06Z

I think we should explore this option similar to how we did #124871 and implement the functionality with a high number as a starting point.

Some thoughts on the functionality

We should pass this circuit breaker to the rule type runners so they can do their best to respect it
We should create an issue asking all rule type authors to implement code to respect the circuit breaker
The framework must know when the circuit breaker is applied in the rule type runner
Recovery events should not send until the rule run where circuit breaker stops applying
We should not track more than circuit breaker amount of alerts in memory (think what happens on alert API usage)
Circuit breaker should be configurable in kibana.yml while we come up with a default value (lower than 100,000, maybe 10,000)
Circuit breaker should be configurable by rule type as well as overridable in kibana.yml (similar to timeouts by rule type)
We should create the alert documents and send the notifications for those captured prior to the circuit breaker applying (similar to actions)
What should be the last run status? warning or error? check with security solution

mikecote · 2022-04-28T13:42:11Z

I've learned with time it might be better to have a circuit breaker on the maximum number of active alerts rather than the maximum number of alerts a run can generate.

For example. If a rule continuously hits the circuit breaker for the next 10 runs but generates different alerts each time, the system would have to keep track of them all to know when to recover them (leaving our state with an unbounded number of alerts to track, in this case potentially up to 10x the circuit breaker). I think we should see if we can do something with the max number of active alerts instead. The main reason is that we want the framework to work with small data structures since everything is stored within the task state and in-memory which takes CPU time to calculate new, ongoing and recovered alerts.

This may change some of my thoughts above.. some further thoughts below:

Even though the limit applies to the maximum number of active alerts, we should also apply this as the max number of alerts a rule run can generate
- For example, with a circuit breaker of 1,000 and a rule having 400 active alerts; the next run could generate 1,000 new alerts or less and the system would recover the previous 400 as they are no longer reported. Netting up to 1,000 tracked alerts after the run.
When the rule run generated a greater amount of alerts than the active alert limit, we need to apply some math on what number of alerts we can track (limit - previous active alerts = how many new alerts we can track).
- For example, with a circuit breaker of 1,000 and a rule having 400 active alerts; the next run could generate 1,001 new alerts and the system applies the circuit breaker at the 1,001st call but should only track the first 600 alerts reported so it nets 1,000 alerts to track (400 existing alerts + 600 new alerts = 1,000 limit).

gmmorris · 2022-05-04T10:03:15Z

I've been contemplating the approach we could take here and have some thoughts.

The way I see it we need to choose between two approaches - one that prioritises correctness of recovery and another that prioritises freshness of data.
I'm not sure which is the right one to prioritise by default, but from an observability perspective, my intuition tells me that freshness is more important to me than avoiding premature recovery.

Allow me to explain:

In the image above you can see the two approaches:

Drop Alerts the Moment a Circuit Breaker is Tripped
The first assumes that a rule run that trips the circuit breaker should throw away the entire cycle as we can't trust it. I can see why this might make sense to avoid telling the user that an alert has recovered when we can't be sure that it has, but this has an obvious downside: As a user I now have no idea what the state of my system is. Assuming we communicate the circuit break tripping as a warning ⚠️ in the UI, I know that something is wrong, but I don't know what. I can still see the alerts from the last valid rule run, but I have no idea whether they are still violating the condition or not.

Process Alerts until the Moment a Circuit Breaker is Tripped
The second treats any alerts that were detected priori to the circuit breaker being tripped as valid, and simply refuses to process any more after that. We can still throw and stop additional detections or processing in the rule, but we treat the alerts before the circuit is tripped as if nothing bad happened.

The downside to this is that if the rule hasn't reported an existing alert, then this alert will auto-recover.
The plus side is that any active alert is, demonstrably, an accurate depiction of the world as we know it. This enables me, the user, to continue to address the problem causing the alert storm, using reliable detection data, and hopefully I can bring the situation back under control and below the circuit breaker threshold.

--

Looking at these two side by side, it seems obvious to me that we want to go with the latter approach, but this is just one person's opinion. Please explore this further with input from some folks outside E&C.

mikecote · 2022-05-04T19:28:34Z

my intuition tells me that freshness is more important to me than avoiding premature recovery.

Some further thoughts:

We've had to in the past do an RCA as to why the system prematurely recovered an alert during a severe outage. With that in mind, I'm thinking yes we may want to report on the freshness of data but not sure about auto recovering alerts that aren't confirmed recovered. We could also drop the recovery events for those alerts or make the system work with an unbounded number of alerts to recover 🤔

As I think, the system will need to start tracking alerts for flapping purposes, I'm thinking we'll also need to put a limit on that as well and the freshness of data requirement could require us to make the algorithm unbounded as well. This may be another example we'll need to support an unbounded number of alerts to apply flapping algorithms to 🤔

gmmorris · 2022-05-05T13:20:49Z

We've had to in the past do an RCA as to why the system prematurely recovered an alert during a severe outage. With that in mind, I'm thinking yes we may want to report on the freshness of data but not sure about auto recovering alerts that aren't confirmed recovered. We could also drop the recovery events for those alerts or make the system work with an unbounded number of alerts to recover 🤔

That's a fair point, thanks for reminding me of that case.
What crosses my mind though is that in that case the rule suggested the problem went away.
If we make it clear the state is unreliable due to a circuit break being ripped, it would address that problem.

The question is - how do we make this clear?

mikecote · 2022-05-05T16:55:47Z

I'm temporarily adding a blocked label for now as this is the next issue to be picked up, and I wanted to summarize my conversation with leadership before this issue gets picked up.

mikecote · 2022-05-05T18:17:45Z

Some notes after talking with some of the leads:

Freshness and severity are the priority when applying the circuit breaker
We should work with rule-type authors to report alerts in order of importance so if a circuit breaker does apply, the most important alerts will make the cut.
For recovering alerts, instead, we will come up with a term but basically "drop" them without calling actions.
For throttling, the functionality will stop working when an alert is "dropped" and reset the next time the alert is recovered.
Circuit breaker applies to max number of alerts that can be reported during a rule run and will drop existing alerts to keep track of the newly reported alerts. This is why it's important to report the most important alerts first.

There were conversations about what to do with alerts when deleting or disabling a rule (#112354). With this new "dropped" functionality, we could leverage it there too..

ymao1 · 2022-08-08T19:13:46Z

RFC approved. Created implementation issues:

[Response Ops][Alerting] Add alert circuit breaker to alerting framework #138308
[Response Ops][Alerting] Add framework services for rule types to respect alert circuit breaker #138309
[Response Ops] [Alerting] Update stack rules to use alert circuit breaker services #138311

mikecote added blocked Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Feb 7, 2022

mikecote added this to AppEx: ResponseOps - Execution & Connectors Feb 7, 2022

mikecote mentioned this issue Feb 7, 2022

[Meta][ResponseOps] An alerting rule can negatively impact the alerting system and the overall Kibana health #119653

Closed

mikecote moved this to Awaiting Triage in AppEx: ResponseOps - Execution & Connectors Feb 17, 2022

mikecote moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors Feb 17, 2022

mikecote added research and removed blocked labels Apr 14, 2022

mikecote changed the title ~~Add circuit breaker for max number of alerts a rule can generate~~ Research how to add a circuit breaker for max number of alerts a rule can generate Apr 14, 2022

mikecote changed the title ~~Research how to add a circuit breaker for max number of alerts a rule can generate~~ Research how to add a circuit breaker for max number of active alerts per rule Apr 28, 2022

mikecote moved this from Todo to Blocked / On hold in AppEx: ResponseOps - Execution & Connectors May 5, 2022

mikecote added the blocked label May 5, 2022

mikecote removed the blocked label May 5, 2022

mikecote moved this from Blocked / On hold to Todo in AppEx: ResponseOps - Execution & Connectors May 5, 2022

mikecote mentioned this issue May 5, 2022

[APM] Active alert never resolves when origin rule is deleted #112354

Open

mikecote assigned mikecote and ymao1 May 16, 2022

mikecote moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors May 16, 2022

mikecote mentioned this issue May 17, 2022

Rename alert count fields in the event log #131848

Merged

ymao1 closed this as completed Aug 8, 2022

Repository owner moved this from In Progress to Done in AppEx: ResponseOps - Execution & Connectors Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research how to add a circuit breaker for max number of active alerts per rule #124870

Research how to add a circuit breaker for max number of active alerts per rule #124870

mikecote commented Feb 7, 2022 •

edited

Loading

elasticmachine commented Feb 7, 2022

mikecote commented Apr 13, 2022

mikecote commented Apr 28, 2022 •

edited

Loading

gmmorris commented May 4, 2022

mikecote commented May 4, 2022 •

edited

Loading

gmmorris commented May 5, 2022

mikecote commented May 5, 2022 •

edited

Loading

mikecote commented May 5, 2022

ymao1 commented Aug 8, 2022

Research how to add a circuit breaker for max number of active alerts per rule #124870

Research how to add a circuit breaker for max number of active alerts per rule #124870

Comments

mikecote commented Feb 7, 2022 • edited Loading

elasticmachine commented Feb 7, 2022

mikecote commented Apr 13, 2022

mikecote commented Apr 28, 2022 • edited Loading

gmmorris commented May 4, 2022

mikecote commented May 4, 2022 • edited Loading

gmmorris commented May 5, 2022

mikecote commented May 5, 2022 • edited Loading

mikecote commented May 5, 2022

ymao1 commented Aug 8, 2022

mikecote commented Feb 7, 2022 •

edited

Loading

mikecote commented Apr 28, 2022 •

edited

Loading

mikecote commented May 4, 2022 •

edited

Loading

mikecote commented May 5, 2022 •

edited

Loading