-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research how to add a circuit breaker for max number of active alerts per rule #124870
Comments
Pinging @elastic/response-ops (Team:ResponseOps) |
I think we should explore this option similar to how we did #124871 and implement the functionality with a high number as a starting point. Some thoughts on the functionality
|
I've learned with time it might be better to have a circuit breaker on the maximum number of active alerts rather than the maximum number of alerts a run can generate. For example. If a rule continuously hits the circuit breaker for the next 10 runs but generates different alerts each time, the system would have to keep track of them all to know when to recover them (leaving our state with an unbounded number of alerts to track, in this case potentially up to 10x the circuit breaker). I think we should see if we can do something with the max number of active alerts instead. The main reason is that we want the framework to work with small data structures since everything is stored within the task state and in-memory which takes CPU time to calculate new, ongoing and recovered alerts. This may change some of my thoughts above.. some further thoughts below:
|
Some further thoughts: We've had to in the past do an RCA as to why the system prematurely recovered an alert during a severe outage. With that in mind, I'm thinking yes we may want to report on the freshness of data but not sure about auto recovering alerts that aren't confirmed recovered. We could also drop the recovery events for those alerts or make the system work with an unbounded number of alerts to recover 🤔 As I think, the system will need to start tracking alerts for flapping purposes, I'm thinking we'll also need to put a limit on that as well and the freshness of data requirement could require us to make the algorithm unbounded as well. This may be another example we'll need to support an unbounded number of alerts to apply flapping algorithms to 🤔 |
That's a fair point, thanks for reminding me of that case. The question is - how do we make this clear? |
I'm temporarily adding a blocked label for now as this is the next issue to be picked up, and I wanted to summarize my conversation with leadership before this issue gets picked up. |
Some notes after talking with some of the leads:
There were conversations about what to do with alerts when deleting or disabling a rule (#112354). With this new "dropped" functionality, we could leverage it there too.. |
RFC approved. Created implementation issues: |
We should explore adding a circuit breaker to the number of active alerts per rule so the system doesn't end-up working with large data structures in memory. Some thoughts below..
The text was updated successfully, but these errors were encountered: