Skip scheduling actions for the alerts without scheduledActions #195948

ersin-erdal · 2024-10-11T15:22:04Z

Resolves: #190258

As a result of #190258, we have found out that the odd behaviour happens when an existing alert is pushed above the max alerts limit by a new alert.

Scenario:

The rule type detects 4 alerts (alert-1, alert-2, alert-3, alert-4),
But reports only the first 3 as the max alerts limit is 3.
alert-2 becomes recovered, therefore the rule type reports 3 active (alert-1, alert-3, alert-4), 1 recovered (alert-2) alert.
Alerts alert-1, alert-3, alert-4 are saved in the task state.
alert-2 becomes active again (the others are still active)
Rule type reports 3 active alerts (alert-1, alert-2, alert-3)
As a result, the action scheduler tries to schedule actions for alert-1, alert-3, alert-4 as they are the existing alerts.
But, since the rule type didn't report the alert-4 it has no scheduled actions, therefore the action scheduler assumes it is recovered and tries to schedule a recovery action.

This PR changes the actionScheduler to handle active and recovered alerts separately.
With this change, no action would be scheduled for an alert from previous run (exists in the task state) and isn't reported by the ruleTypeExecutor due to max-alerts-limit but it would be kept in the task state.

…rsin-erdal/kibana into 190259-alert-without-scheduled-actions

elasticmachine · 2024-10-13T01:31:30Z

💔 Build Failed

Buildkite Build
Commit: 6322237

Failed CI Steps

Test Failures

[job] [logs] FTR Configs #22 / serverless search UI With custom role should have limited navigation menu
[job] [logs] FTR Configs #22 / serverless search UI With custom role should have limited navigation menu

Metrics [docs]

✅ unchanged

History

cc @ersin-erdal

pmuellr · 2024-10-14T17:32:14Z

I'm thinking this may resolve at least some of the alert doc conflicts we are seeing: #190376 - not sure about all of them though.

I'll add a note in that issue that we should check after this is released, to see if the conflicts reduce or are gone for the release that shipped them.

pmuellr

code LGTM

Tested the way I repro'd the behavior in the referenced issue. No longer seeing any of the odd behavior I saw with that.

pmuellr · 2024-10-14T16:42:26Z

x-pack/plugins/alerting/server/alerts_client/types.ts

-  ): Record<string, LegacyAlert<State, Context, ActionGroupIds | RecoveryActionGroupId>>;
+    type: 'new' | 'active' | 'activeCurrent'
+  ): Record<string, LegacyAlert<State, Context, ActionGroupIds>> | {};
+  getProcessedAlerts(


Interesting! Overloaded definition of getProcessed() alerts, typed to return different things based on the input!

WIll have to remember this, wonder if it could help with some of our other complex type things ...

pmuellr · 2024-10-14T16:57:16Z

x-pack/plugins/alerting/server/task_runner/action_scheduler/lib/get_summarized_alerts.ts

@@ -56,7 +57,8 @@ export const getSummarizedAlerts = async <
   * yet (the update call uses refresh: false). So we need to rely on the in
   * memory alerts to do this.
   */
-  const newAlertsInMemory = Object.values(alertsClient.getProcessedAlerts('new') || {}) || [];
+  const newAlertsInMemory: Array<Alert<State, Context, ActionGroupIds>> =
+    Object.values(alertsClient.getProcessedAlerts('new') || {}) || [];


The || {}) || [] looks ... busy, and maybe unneeded? I tried removing it, and it seems to compile.

Then since all the rest is typed, you can remove the type declaration, and eventually the import for Alert:

const newAlertsInMemory = Object.values(alertsClient.getProcessedAlerts('new'));

Seemed to type check ok!

I guess these were for the tests as well, I will try to remove those fallback.

pmuellr · 2024-10-14T17:16:08Z

x-pack/plugins/alerting/server/alerts_client/types.ts

-    type: 'new' | 'active' | 'activeCurrent' | 'recovered' | 'recoveredCurrent'
-  ): Record<string, LegacyAlert<State, Context, ActionGroupIds | RecoveryActionGroupId>>;
+    type: 'new' | 'active' | 'activeCurrent'
+  ): Record<string, LegacyAlert<State, Context, ActionGroupIds>> | {};


Surprised the {} is needed here. {} should conform to Record<string, **anything**> I think, so seems like overkill. Tried removing them and running type_check and it failed. Weird.

One of the type check failures was in this method:

kibana/x-pack/plugins/alerting/server/alerts_client/legacy_alerts_client.ts

Lines 234 to 242 in 424ffba

public getProcessedAlerts(

type: 'new' | 'active' | 'activeCurrent' | 'recovered' | 'recoveredCurrent'

) {

if (Object.hasOwn(this.processedAlerts, type)) {

return this.processedAlerts[type];

}

return {};

}

Perhaps this is just overly complex TS needed to handle overridden methods or something. Doesn't seem like it should cause any problems, just ... weird.

It is because getProcessedAlerts returns {} if it cannot find the given type in the processedAlerts.
line 241 above.
I was also expecting it to conform with Record but.... :)

pmuellr · 2024-10-14T18:15:35Z

x-pack/plugins/alerting/server/task_runner/action_scheduler/action_scheduler.ts

-  public async run(
-    alerts: Record<string, Alert<State, Context, ActionGroupIds | RecoveryActionGroupId>>
-  ): Promise<RunResult> {
+  public async run({


The typing seems weird here, as I would expect in the tests for this, we could just not pass either of these properties (or neither!) and we'd get the defaults. But you can't. I think we'd need a ? after the property names to make that would. Would cut down on a few lines in the test file (above) ... :-)

I guess I'm a little surprised TS or lint didn't complain about "the default value will never be used because the property is not optional", or something.

I didn't make them optional because we always pass the active and recovered alerts. getProcessAlerts returns {} incase of no alerts.

The default values (activeCurrentAlerts = {}) are for the unit tests actually. Some tests does not pass any alerts at all and Object.values(activeCurrentAlerts) breaks the execution as activeCurrentAlerts is undefined.

I can remove the default values and use Object.values(activeCurrentAlerts || {}) as well

I think maybe I'm more confused - we have tests passing undefined for some of these properties - are they bypassing the typing somehow? Because it looks to me like they are required.

Seems like we should clean this up, but we can do that in a separate PR - can you create an issue?

pmuellr · 2024-10-14T18:46:51Z

...ins/alerting/server/task_runner/action_scheduler/schedulers/summary_action_scheduler.test.ts

-      const results = await scheduler.getActionsToSchedule({ alerts, throttledSummaryActions });
+      const results = await scheduler.getActionsToSchedule({
+        activeCurrentAlerts: alerts,
+        recoveredCurrentAlerts: {},


Seems odd that none of the tests in this file seem to use non-empty recovered alerts, do we need some more tests here? In a follow-on PR?

I am adding it.

It uses the passed alerts just to calculate the number of alerts. Looks like adding it to a test is enough

pmuellr · 2024-10-14T18:48:49Z

...gins/alerting/server/task_runner/action_scheduler/schedulers/system_action_scheduler.test.ts

-      const results = await scheduler.getActionsToSchedule({ alerts });
+      const results = await scheduler.getActionsToSchedule({
+        activeCurrentAlerts: alerts,
+        recoveredCurrentAlerts: {},


Another file not using any recovered alerts ...

system action scheduler doesn't use the alerts, so i removed all of them.

pmuellr · 2024-10-14T19:10:17Z

...lugins/alerting/server/task_runner/action_scheduler/schedulers/per_alert_action_scheduler.ts

+    return false;
+  }
+
+  private isValidActionGroup(actionGroup: ActionGroupIds | RecoveryActionGroupId) {


nice cleanup to extract these to functions!

…rsin-erdal/kibana into 190259-alert-without-scheduled-actions

elasticmachine · 2024-10-15T20:48:26Z

💔 Build Failed

Buildkite Build
Commit: 5c1f93b

Failed CI Steps

Test Failures

[job] [logs] Jest Tests #11 / Action Scheduler does not schedule actions for the for-each type alerts that are filtered out
[job] [logs] Jest Tests #11 / Action Scheduler does not schedule actions for the for-each type alerts that are filtered out
[job] [logs] Jest Tests #11 / Action Scheduler does not schedule actions for the summarized alerts that are filtered out (for each alert)
[job] [logs] Jest Tests #11 / Action Scheduler does not schedule actions for the summarized alerts that are filtered out (for each alert)
[job] [logs] Jest Tests #11 / Action Scheduler does not schedule actions for the summarized alerts that are filtered out (summary of alerts onThrottleInterval)
[job] [logs] Jest Tests #11 / Action Scheduler does not schedule actions for the summarized alerts that are filtered out (summary of alerts onThrottleInterval)
[job] [logs] Jest Tests #11 / Action Scheduler does not schedule summary actions when there is an active maintenance window
[job] [logs] Jest Tests #11 / Action Scheduler does not schedule summary actions when there is an active maintenance window
[job] [logs] Jest Tests #11 / Action Scheduler rule url populates the rule.url with start and stop time when available
[job] [logs] Jest Tests #11 / Action Scheduler rule url populates the rule.url with start and stop time when available
[job] [logs] Jest Tests #11 / Action Scheduler skips summary actions (per rule run) when there is no alerts
[job] [logs] Jest Tests #11 / Action Scheduler skips summary actions (per rule run) when there is no alerts
[job] [logs] Jest Tests #11 / Action Scheduler System actions does not execute if the connector adapter is not configured
[job] [logs] Jest Tests #11 / Action Scheduler System actions does not execute if the connector adapter is not configured
[job] [logs] Jest Tests #11 / Action Scheduler System actions triggers system actions with summarization per rule run
[job] [logs] Jest Tests #11 / Action Scheduler System actions triggers system actions with summarization per rule run
[job] [logs] Jest Tests #11 / Action Scheduler triggers summary actions (custom interval)
[job] [logs] Jest Tests #11 / Action Scheduler triggers summary actions (custom interval)
[job] [logs] Jest Tests #11 / Action Scheduler triggers summary actions (per rule run)
[job] [logs] Jest Tests #11 / Action Scheduler triggers summary actions (per rule run)

Metrics [docs]

✅ unchanged

History

💛 Build #242101 was flaky e95c3d0
💔 Build #241871 failed 6322237
💔 Build #241867 failed ef2e5d8
💔 Build #241853 failed 8647e71
💔 Build #241850 failed 1cc6438

cc @ersin-erdal

kibanamachine · 2024-10-15T23:40:36Z

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/11356081317

…tic#195948) Resolves: elastic#190258 As a result of elastic#190258, we have found out that the odd behaviour happens when an existing alert is pushed above the max alerts limit by a new alert. Scenario: 1. The rule type detects 4 alerts (`alert-1`, `alert-2`, `alert-3`, `alert-4`), But reports only the first 3 as the max alerts limit is 3. 2. `alert-2` becomes recovered, therefore the rule type reports 3 active (`alert-1`, `alert-3`, `alert-4`), 1 recovered (`alert-2`) alert. 3. Alerts `alert-1`, `alert-3`, `alert-4` are saved in the task state. 4. `alert-2` becomes active again (the others are still active) 5. Rule type reports 3 active alerts (`alert-1`, `alert-2`, `alert-3`) 6. As a result, the action scheduler tries to schedule actions for `alert-1`, `alert-3`, `alert-4` as they are the existing alerts. But, since the rule type didn't report the `alert-4` it has no scheduled actions, therefore the action scheduler assumes it is recovered and tries to schedule a recovery action. This PR changes the actionScheduler to handle active and recovered alerts separately. With this change, no action would be scheduled for an alert from previous run (exists in the task state) and isn't reported by the ruleTypeExecutor due to max-alerts-limit but it would be kept in the task state. (cherry picked from commit dd25bf8)

kibanamachine · 2024-10-15T23:45:02Z

💚 All backports created successfully

Status	Branch	Result
✅	8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

…#195948) (#196458) # Backport This will backport the following commits from `main` to `8.x`: - [Skip scheduling actions for the alerts without scheduledActions (#195948)](#195948)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Ersin Erdal <[email protected]>

Skip scheduling actions for the alerts without scheduledActions

Loading
Loading status checks…

8b9bf1c

ersin-erdal added release_note:fix Team:ResponseOps backport:prev-minor labels Oct 11, 2024

ersin-erdal self-assigned this Oct 11, 2024

ersin-erdal mentioned this pull request Oct 11, 2024

Skip action scheduling for the alerts mistakenly reported due to max alerts limit #193995

Closed

fix comment in the test

Loading
Loading status checks…

78f9558

ersin-erdal marked this pull request as ready for review October 11, 2024 15:28

ersin-erdal requested a review from a team as a code owner October 11, 2024 15:28

ersin-erdal and others added 10 commits October 11, 2024 17:29

Merge branch 'main' into 190259-alert-without-scheduled-actions

Loading
Loading status checks…

83c541f

Merge branch 'main' into 190259-alert-without-scheduled-actions

Loading
Loading status checks…

535a4b8

Merge branch 'main' into 190259-alert-without-scheduled-actions

Loading
Loading status checks…

03bd58a

getMaintenanceWindows

64755dd

Merge branch '190259-alert-without-scheduled-actions' of github.com:e…

Loading
Loading status checks…

8de638f

…rsin-erdal/kibana into 190259-alert-without-scheduled-actions

revert

Loading
Loading status checks…

1cc6438

Remove duplicated RuleAction

Loading
Loading status checks…

2a3a723

Remove duplicated RuleAction

Loading
Loading status checks…

8647e71

Fix type checks

Loading
Loading status checks…

ef2e5d8

default alerts

Loading
Loading status checks…

6322237

Merge branch 'main' into 190259-alert-without-scheduled-actions

Loading
Loading status checks…

e95c3d0

pmuellr mentioned this pull request Oct 14, 2024

[ResponseOps][Alerting] alert updates not recognizing document conflicts #190376

Open

pmuellr approved these changes Oct 14, 2024

View reviewed changes

ersin-erdal added 2 commits October 15, 2024 21:16

Add unit test for recovered alert in summarized alert scheduler

ecd9e68

Merge branch '190259-alert-without-scheduled-actions' of github.com:e…

Loading
Loading status checks…

5c1f93b

…rsin-erdal/kibana into 190259-alert-without-scheduled-actions

fix unit test

Loading
Loading status checks…

57adea1

ersin-erdal merged commit dd25bf8 into elastic:main Oct 15, 2024
37 checks passed

ersin-erdal deleted the 190259-alert-without-scheduled-actions branch October 15, 2024 23:40

kibanamachine added the v9.0.0 label Oct 15, 2024

kibanamachine mentioned this pull request Oct 15, 2024

[8.x] Skip scheduling actions for the alerts without scheduledActions (#195948) #196458

Merged

kibanamachine added the v8.16.0 label Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip scheduling actions for the alerts without scheduledActions #195948

Skip scheduling actions for the alerts without scheduledActions #195948

ersin-erdal commented Oct 11, 2024 •

edited by kibanamachine

Loading

elasticmachine commented Oct 13, 2024 •

edited

Loading

pmuellr commented Oct 14, 2024

pmuellr left a comment

pmuellr Oct 14, 2024

pmuellr Oct 14, 2024

ersin-erdal Oct 15, 2024

pmuellr Oct 14, 2024

ersin-erdal Oct 15, 2024 •

edited

Loading

pmuellr Oct 14, 2024

ersin-erdal Oct 15, 2024 •

edited

Loading

pmuellr Oct 15, 2024

pmuellr Oct 14, 2024

ersin-erdal Oct 15, 2024

ersin-erdal Oct 15, 2024

pmuellr Oct 14, 2024

ersin-erdal Oct 15, 2024

pmuellr Oct 14, 2024

elasticmachine commented Oct 15, 2024 •

edited

Loading

kibanamachine commented Oct 15, 2024

kibanamachine commented Oct 15, 2024

	public getProcessedAlerts(
	type: 'new' \| 'active' \| 'activeCurrent' \| 'recovered' \| 'recoveredCurrent'
	) {
	if (Object.hasOwn(this.processedAlerts, type)) {
	return this.processedAlerts[type];
	}

	return {};
	}

Skip scheduling actions for the alerts without scheduledActions #195948

Skip scheduling actions for the alerts without scheduledActions #195948

Conversation

ersin-erdal commented Oct 11, 2024 • edited by kibanamachine Loading

elasticmachine commented Oct 13, 2024 • edited Loading

💔 Build Failed

Failed CI Steps

Test Failures

Metrics [docs]

History

pmuellr commented Oct 14, 2024

pmuellr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ersin-erdal Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ersin-erdal Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticmachine commented Oct 15, 2024 • edited Loading

💔 Build Failed

Failed CI Steps

Test Failures

Metrics [docs]

History

kibanamachine commented Oct 15, 2024

kibanamachine commented Oct 15, 2024

💚 All backports created successfully

Questions ?

ersin-erdal commented Oct 11, 2024 •

edited by kibanamachine

Loading

elasticmachine commented Oct 13, 2024 •

edited

Loading

ersin-erdal Oct 15, 2024 •

edited

Loading

ersin-erdal Oct 15, 2024 •

edited

Loading

elasticmachine commented Oct 15, 2024 •

edited

Loading