Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip scheduling actions for the alerts without scheduledActions #195948

Merged

Conversation

ersin-erdal
Copy link
Contributor

@ersin-erdal ersin-erdal commented Oct 11, 2024

Resolves: #190258

As a result of #190258, we have found out that the odd behaviour happens when an existing alert is pushed above the max alerts limit by a new alert.

Scenario:

  1. The rule type detects 4 alerts (alert-1, alert-2, alert-3, alert-4),
    But reports only the first 3 as the max alerts limit is 3.

  2. alert-2 becomes recovered, therefore the rule type reports 3 active (alert-1, alert-3, alert-4), 1 recovered (alert-2) alert.

  3. Alerts alert-1, alert-3, alert-4 are saved in the task state.

  4. alert-2 becomes active again (the others are still active)

  5. Rule type reports 3 active alerts (alert-1, alert-2, alert-3)

  6. As a result, the action scheduler tries to schedule actions for alert-1, alert-3, alert-4 as they are the existing alerts.
    But, since the rule type didn't report the alert-4 it has no scheduled actions, therefore the action scheduler assumes it is recovered and tries to schedule a recovery action.

This PR changes the actionScheduler to handle active and recovered alerts separately.
With this change, no action would be scheduled for an alert from previous run (exists in the task state) and isn't reported by the ruleTypeExecutor due to max-alerts-limit but it would be kept in the task state.

@ersin-erdal ersin-erdal added release_note:fix Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) labels Oct 11, 2024
@ersin-erdal ersin-erdal self-assigned this Oct 11, 2024
@ersin-erdal ersin-erdal marked this pull request as ready for review October 11, 2024 15:28
@ersin-erdal ersin-erdal requested a review from a team as a code owner October 11, 2024 15:28
ersin-erdal and others added 10 commits October 11, 2024 17:29
…rsin-erdal/kibana into 190259-alert-without-scheduled-actions
@elasticmachine
Copy link
Contributor

elasticmachine commented Oct 13, 2024

💔 Build Failed

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #22 / serverless search UI With custom role should have limited navigation menu
  • [job] [logs] FTR Configs #22 / serverless search UI With custom role should have limited navigation menu

Metrics [docs]

✅ unchanged

History

cc @ersin-erdal

@pmuellr
Copy link
Member

pmuellr commented Oct 14, 2024

I'm thinking this may resolve at least some of the alert doc conflicts we are seeing: #190376 - not sure about all of them though.

I'll add a note in that issue that we should check after this is released, to see if the conflicts reduce or are gone for the release that shipped them.

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM

Tested the way I repro'd the behavior in the referenced issue. No longer seeing any of the odd behavior I saw with that.

): Record<string, LegacyAlert<State, Context, ActionGroupIds | RecoveryActionGroupId>>;
type: 'new' | 'active' | 'activeCurrent'
): Record<string, LegacyAlert<State, Context, ActionGroupIds>> | {};
getProcessedAlerts(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! Overloaded definition of getProcessed() alerts, typed to return different things based on the input!

WIll have to remember this, wonder if it could help with some of our other complex type things ...

@@ -56,7 +57,8 @@ export const getSummarizedAlerts = async <
* yet (the update call uses refresh: false). So we need to rely on the in
* memory alerts to do this.
*/
const newAlertsInMemory = Object.values(alertsClient.getProcessedAlerts('new') || {}) || [];
const newAlertsInMemory: Array<Alert<State, Context, ActionGroupIds>> =
Object.values(alertsClient.getProcessedAlerts('new') || {}) || [];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The || {}) || [] looks ... busy, and maybe unneeded? I tried removing it, and it seems to compile.

Then since all the rest is typed, you can remove the type declaration, and eventually the import for Alert:

  const newAlertsInMemory = Object.values(alertsClient.getProcessedAlerts('new'));

Seemed to type check ok!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these were for the tests as well, I will try to remove those fallback.

type: 'new' | 'active' | 'activeCurrent' | 'recovered' | 'recoveredCurrent'
): Record<string, LegacyAlert<State, Context, ActionGroupIds | RecoveryActionGroupId>>;
type: 'new' | 'active' | 'activeCurrent'
): Record<string, LegacyAlert<State, Context, ActionGroupIds>> | {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprised the {} is needed here. {} should conform to Record<string, **anything**> I think, so seems like overkill. Tried removing them and running type_check and it failed. Weird.

One of the type check failures was in this method:

public getProcessedAlerts(
type: 'new' | 'active' | 'activeCurrent' | 'recovered' | 'recoveredCurrent'
) {
if (Object.hasOwn(this.processedAlerts, type)) {
return this.processedAlerts[type];
}
return {};
}

Perhaps this is just overly complex TS needed to handle overridden methods or something. Doesn't seem like it should cause any problems, just ... weird.

Copy link
Contributor Author

@ersin-erdal ersin-erdal Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is because getProcessedAlerts returns {} if it cannot find the given type in the processedAlerts.
line 241 above.
I was also expecting it to conform with Record but.... :)

public async run(
alerts: Record<string, Alert<State, Context, ActionGroupIds | RecoveryActionGroupId>>
): Promise<RunResult> {
public async run({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The typing seems weird here, as I would expect in the tests for this, we could just not pass either of these properties (or neither!) and we'd get the defaults. But you can't. I think we'd need a ? after the property names to make that would. Would cut down on a few lines in the test file (above) ... :-)

I guess I'm a little surprised TS or lint didn't complain about "the default value will never be used because the property is not optional", or something.

Copy link
Contributor Author

@ersin-erdal ersin-erdal Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't make them optional because we always pass the active and recovered alerts. getProcessAlerts returns {} incase of no alerts.

The default values (activeCurrentAlerts = {}) are for the unit tests actually. Some tests does not pass any alerts at all and Object.values(activeCurrentAlerts) breaks the execution as activeCurrentAlerts is undefined.

I can remove the default values and use Object.values(activeCurrentAlerts || {}) as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe I'm more confused - we have tests passing undefined for some of these properties - are they bypassing the typing somehow? Because it looks to me like they are required.

Seems like we should clean this up, but we can do that in a separate PR - can you create an issue?

const results = await scheduler.getActionsToSchedule({ alerts, throttledSummaryActions });
const results = await scheduler.getActionsToSchedule({
activeCurrentAlerts: alerts,
recoveredCurrentAlerts: {},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems odd that none of the tests in this file seem to use non-empty recovered alerts, do we need some more tests here? In a follow-on PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am adding it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It uses the passed alerts just to calculate the number of alerts. Looks like adding it to a test is enough

const results = await scheduler.getActionsToSchedule({ alerts });
const results = await scheduler.getActionsToSchedule({
activeCurrentAlerts: alerts,
recoveredCurrentAlerts: {},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another file not using any recovered alerts ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

system action scheduler doesn't use the alerts, so i removed all of them.

return false;
}

private isValidActionGroup(actionGroup: ActionGroupIds | RecoveryActionGroupId) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice cleanup to extract these to functions!

…rsin-erdal/kibana into 190259-alert-without-scheduled-actions
@elasticmachine
Copy link
Contributor

elasticmachine commented Oct 15, 2024

💔 Build Failed

Failed CI Steps

Test Failures

  • [job] [logs] Jest Tests #11 / Action Scheduler does not schedule actions for the for-each type alerts that are filtered out
  • [job] [logs] Jest Tests #11 / Action Scheduler does not schedule actions for the for-each type alerts that are filtered out
  • [job] [logs] Jest Tests #11 / Action Scheduler does not schedule actions for the summarized alerts that are filtered out (for each alert)
  • [job] [logs] Jest Tests #11 / Action Scheduler does not schedule actions for the summarized alerts that are filtered out (for each alert)
  • [job] [logs] Jest Tests #11 / Action Scheduler does not schedule actions for the summarized alerts that are filtered out (summary of alerts onThrottleInterval)
  • [job] [logs] Jest Tests #11 / Action Scheduler does not schedule actions for the summarized alerts that are filtered out (summary of alerts onThrottleInterval)
  • [job] [logs] Jest Tests #11 / Action Scheduler does not schedule summary actions when there is an active maintenance window
  • [job] [logs] Jest Tests #11 / Action Scheduler does not schedule summary actions when there is an active maintenance window
  • [job] [logs] Jest Tests #11 / Action Scheduler rule url populates the rule.url with start and stop time when available
  • [job] [logs] Jest Tests #11 / Action Scheduler rule url populates the rule.url with start and stop time when available
  • [job] [logs] Jest Tests #11 / Action Scheduler skips summary actions (per rule run) when there is no alerts
  • [job] [logs] Jest Tests #11 / Action Scheduler skips summary actions (per rule run) when there is no alerts
  • [job] [logs] Jest Tests #11 / Action Scheduler System actions does not execute if the connector adapter is not configured
  • [job] [logs] Jest Tests #11 / Action Scheduler System actions does not execute if the connector adapter is not configured
  • [job] [logs] Jest Tests #11 / Action Scheduler System actions triggers system actions with summarization per rule run
  • [job] [logs] Jest Tests #11 / Action Scheduler System actions triggers system actions with summarization per rule run
  • [job] [logs] Jest Tests #11 / Action Scheduler triggers summary actions (custom interval)
  • [job] [logs] Jest Tests #11 / Action Scheduler triggers summary actions (custom interval)
  • [job] [logs] Jest Tests #11 / Action Scheduler triggers summary actions (per rule run)
  • [job] [logs] Jest Tests #11 / Action Scheduler triggers summary actions (per rule run)

Metrics [docs]

✅ unchanged

History

cc @ersin-erdal

@ersin-erdal ersin-erdal merged commit dd25bf8 into elastic:main Oct 15, 2024
37 checks passed
@ersin-erdal ersin-erdal deleted the 190259-alert-without-scheduled-actions branch October 15, 2024 23:40
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/11356081317

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Oct 15, 2024
…tic#195948)

Resolves: elastic#190258

As a result of elastic#190258, we have found out that the odd behaviour happens
when an existing alert is pushed above the max alerts limit by a new
alert.

Scenario:

1. The rule type detects 4 alerts (`alert-1`, `alert-2`, `alert-3`,
`alert-4`),
But reports only the first 3 as the max alerts limit is 3.

2. `alert-2` becomes recovered, therefore the rule type reports 3 active
(`alert-1`, `alert-3`, `alert-4`), 1 recovered (`alert-2`) alert.

3. Alerts `alert-1`,  `alert-3`, `alert-4` are saved in the task state.

4. `alert-2` becomes active again (the others are still active)

5. Rule type reports 3 active alerts (`alert-1`,  `alert-2`, `alert-3`)

6. As a result, the action scheduler tries to schedule actions for
`alert-1`, `alert-3`, `alert-4` as they are the existing alerts.
But, since the rule type didn't report the `alert-4` it has no scheduled
actions, therefore the action scheduler assumes it is recovered and
tries to schedule a recovery action.

This PR changes the actionScheduler to handle active and recovered
alerts separately.
With this change, no action would be scheduled for an alert from
previous run (exists in the task state) and isn't reported by the
ruleTypeExecutor due to max-alerts-limit but it would be kept in the
task state.

(cherry picked from commit dd25bf8)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Oct 16, 2024
…#195948) (#196458)

# Backport

This will backport the following commits from `main` to `8.x`:
- [Skip scheduling actions for the alerts without scheduledActions
(#195948)](#195948)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Ersin
Erdal","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-10-15T23:40:19Z","message":"Skip
scheduling actions for the alerts without scheduledActions
(#195948)\n\nResolves: #190258\r\n\r\nAs a result of #190258, we have
found out that the odd behaviour happens\r\nwhen an existing alert is
pushed above the max alerts limit by a
new\r\nalert.\r\n\r\nScenario:\r\n\r\n1. The rule type detects 4 alerts
(`alert-1`, `alert-2`, `alert-3`,\r\n`alert-4`),\r\nBut reports only the
first 3 as the max alerts limit is 3.\r\n\r\n2. `alert-2` becomes
recovered, therefore the rule type reports 3 active\r\n(`alert-1`,
`alert-3`, `alert-4`), 1 recovered (`alert-2`) alert.\r\n\r\n3. Alerts
`alert-1`, `alert-3`, `alert-4` are saved in the task state.\r\n\r\n4.
`alert-2` becomes active again (the others are still active)\r\n\r\n5.
Rule type reports 3 active alerts (`alert-1`, `alert-2`,
`alert-3`)\r\n\r\n6. As a result, the action scheduler tries to schedule
actions for\r\n`alert-1`, `alert-3`, `alert-4` as they are the existing
alerts.\r\nBut, since the rule type didn't report the `alert-4` it has
no scheduled\r\nactions, therefore the action scheduler assumes it is
recovered and\r\ntries to schedule a recovery action.\r\n\r\nThis PR
changes the actionScheduler to handle active and recovered\r\nalerts
separately.\r\nWith this change, no action would be scheduled for an
alert from\r\nprevious run (exists in the task state) and isn't reported
by the\r\nruleTypeExecutor due to max-alerts-limit but it would be kept
in the\r\ntask
state.","sha":"dd25bf8807c3ff3982d455f070b6e6c65233662d","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","Team:ResponseOps","v9.0.0","backport:prev-minor"],"title":"Skip
scheduling actions for the alerts without
scheduledActions","number":195948,"url":"https://github.com/elastic/kibana/pull/195948","mergeCommit":{"message":"Skip
scheduling actions for the alerts without scheduledActions
(#195948)\n\nResolves: #190258\r\n\r\nAs a result of #190258, we have
found out that the odd behaviour happens\r\nwhen an existing alert is
pushed above the max alerts limit by a
new\r\nalert.\r\n\r\nScenario:\r\n\r\n1. The rule type detects 4 alerts
(`alert-1`, `alert-2`, `alert-3`,\r\n`alert-4`),\r\nBut reports only the
first 3 as the max alerts limit is 3.\r\n\r\n2. `alert-2` becomes
recovered, therefore the rule type reports 3 active\r\n(`alert-1`,
`alert-3`, `alert-4`), 1 recovered (`alert-2`) alert.\r\n\r\n3. Alerts
`alert-1`, `alert-3`, `alert-4` are saved in the task state.\r\n\r\n4.
`alert-2` becomes active again (the others are still active)\r\n\r\n5.
Rule type reports 3 active alerts (`alert-1`, `alert-2`,
`alert-3`)\r\n\r\n6. As a result, the action scheduler tries to schedule
actions for\r\n`alert-1`, `alert-3`, `alert-4` as they are the existing
alerts.\r\nBut, since the rule type didn't report the `alert-4` it has
no scheduled\r\nactions, therefore the action scheduler assumes it is
recovered and\r\ntries to schedule a recovery action.\r\n\r\nThis PR
changes the actionScheduler to handle active and recovered\r\nalerts
separately.\r\nWith this change, no action would be scheduled for an
alert from\r\nprevious run (exists in the task state) and isn't reported
by the\r\nruleTypeExecutor due to max-alerts-limit but it would be kept
in the\r\ntask
state.","sha":"dd25bf8807c3ff3982d455f070b6e6c65233662d"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/195948","number":195948,"mergeCommit":{"message":"Skip
scheduling actions for the alerts without scheduledActions
(#195948)\n\nResolves: #190258\r\n\r\nAs a result of #190258, we have
found out that the odd behaviour happens\r\nwhen an existing alert is
pushed above the max alerts limit by a
new\r\nalert.\r\n\r\nScenario:\r\n\r\n1. The rule type detects 4 alerts
(`alert-1`, `alert-2`, `alert-3`,\r\n`alert-4`),\r\nBut reports only the
first 3 as the max alerts limit is 3.\r\n\r\n2. `alert-2` becomes
recovered, therefore the rule type reports 3 active\r\n(`alert-1`,
`alert-3`, `alert-4`), 1 recovered (`alert-2`) alert.\r\n\r\n3. Alerts
`alert-1`, `alert-3`, `alert-4` are saved in the task state.\r\n\r\n4.
`alert-2` becomes active again (the others are still active)\r\n\r\n5.
Rule type reports 3 active alerts (`alert-1`, `alert-2`,
`alert-3`)\r\n\r\n6. As a result, the action scheduler tries to schedule
actions for\r\n`alert-1`, `alert-3`, `alert-4` as they are the existing
alerts.\r\nBut, since the rule type didn't report the `alert-4` it has
no scheduled\r\nactions, therefore the action scheduler assumes it is
recovered and\r\ntries to schedule a recovery action.\r\n\r\nThis PR
changes the actionScheduler to handle active and recovered\r\nalerts
separately.\r\nWith this change, no action would be scheduled for an
alert from\r\nprevious run (exists in the task state) and isn't reported
by the\r\nruleTypeExecutor due to max-alerts-limit but it would be kept
in the\r\ntask
state.","sha":"dd25bf8807c3ff3982d455f070b6e6c65233662d"}}]}]
BACKPORT-->

Co-authored-by: Ersin Erdal <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) release_note:fix Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.16.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ResponseOps][alerting] Investigate odd behaviors when max alerts limit reached
4 participants