[Response Ops][Alerting] Backfill Rule Runs #177622

ymao1 · 2024-02-22T16:45:11Z

This is the feature branch that contains the following commits. Each individual PR contains a summary and verification instructions.

github-actions · 2024-02-22T16:45:25Z

A documentation preview will be available soon.

🔨 Buildkite builds
📚 HTML diff
📙 Preview page

Request a new doc build by commenting

Rebuild this PR: run docs-build
Rebuild this PR and all Elastic docs: run docs-build rebuild

_{run docs-build is much faster than run docs-build rebuild. A rebuild should only be needed in rare situations.}

_{If your PR continues to fail for an unknown reason, the doc build pipeline may be broken. Elastic employees can check the pipeline status here.}

…branch) (#176185) Towards #174355 Note that this merges into a feature branch ## Summary Adds API for scheduling backfill jobs. Other APIs such as `get`, `find` and `delete` will be added in follow-on PRs. This PR introduces 2 concepts - `ad hoc run` - This is an execution of a rule over a specific time range. I kept this terminology generic so that in the future, it could be used to support other custom rule executions (like preview rule runs). The parameters used for the `ad hoc run` are specified in a new encrypted saved object type (`ad_hoc_run_params`). This SO is encrypted because it stores the API key to use (copied from the rule) - `backfill job` - This is a specific type of `ad hoc run` that schedules a rule run for a historical time range to cover a gap in execution ### Schedule Backfill API * Only allows scheduling for persistent (not lifecycle) rule types - this is currently all detection rules * Only allows scheduling for currently enabled rules * Limits the max number of backfill jobs that can be scheduled at one time (currently limited to 10) * Checks that user has the appropriate RBAC permissions for the alerting rule types they are scheduling backfills for. This only requires `READ` permission for the rule type, which follows the same permission required to invoke the `runSoon` API * Once all permissions and pre-requisites have been validated, the API creates an `ad_hoc_run_params` saved object that is stored in the `.kibana_alerting_cases` index * Task runner to run the rule using the parameters in `ad_hoc_run_params` will be added in a follow-on PR. **Sample Request** ``` POST /internal/alerting/rules/backfill/_schedule [ { "rule_id": "abc", "start": "2023-12-30T12:00:00.000Z", "end": "2024-01-01T12:00:00.000Z", } ] ``` This would create an `ad_hoc_run_params` saved object that looks like ``` { "apiKeyId": <apiKeyId>, "apiKeyToUse": <apiKey>, // this is copied from the decrypted rule and then re-encrypted "createdAt": "2024-01-30T00:00:00.000Z", "duration": "12h", // uses the same schedule interval as the rule "enabled": false, "end": "2024-01-01T12:00:00.000Z", "rule": { // copied from the rule "name": "my rule name", "tags": ["foo"], "alertTypeId": "myType", "params": {}, "apiKeyOwner": "user", "apiKeyCreatedByUser": false, "consumer": "myApp", "enabled": true, "schedule": { "interval": "12h", }, "createdBy": "user", "updatedBy": "user", "createdAt": "2019-02-12T21:01:22.479Z", "updatedAt": "2019-02-12T21:01:22.479Z", "revision": 0, }, "spaceId": "default", "start": "2023-12-30T12:00:00.000Z", "status": "pending", "schedule": [ { "interval": "12h", "runAt": "2023-12-31T00:00:00.000Z", "status": "pending" }, { "interval": "12h", "runAt": "2023-12-31T12:00:00.000Z", "status": "pending" }, { "interval": "12h", "runAt": "2024-01-01T00:00:00.000Z", "status": "pending" }, { "interval": "12h", "runAt": "2024-01-01T12:00:00.000Z", "status": "pending" }, ], } ```

…ll-rule-runs

…e branch) (#177640) Resolves #174358 Note that this merges into a feature branch ## Summary Adds task runner for ad hoc runs of rules over an arbitrary time interval. This PR: 1. Updates the `BackfillClient` to create a task (`type:ad_hoc_run-backfill`) for each backfill job scheduled via the [schedule API](#176185) 2. Creates an `AdHocTaskRunner` to run these tasks. This task runner does the following: a. Loads and decrypts the `AdHocRun` saved object b. Determines which schedule entry from the saved object to run by looking for the next `PENDING` entry c. Uses the `runAt` time in the schedule as the `startedAt` time passed into the executors and to determine the time range returned by the `getTimeRange` function. d. Creates an `execute-backfill` event log doc that contains the ID of the `AdHocRun` saved object and the `runAt` and interval for the ad hoc run e. Updates the schedule entry in the `AdHocRun` saved object to reflect the outcome of the execution, either `success`, `timeout` or `error`. 3. Alerts created via the ad hoc task runner uses the following timestamps - `@timestamp` - this is set to the backfill `runAt` time - `kibana.alert.start` - this is set to the backfill `runAt` time - `kibana.alert.rule.execution.timestamp` - this is a new field that reflects the actual time the backfill was run. For real-time rule runs, this timestamp should match the `@timestamp` timestamp. 4. When all schedule entries for a backfill have been run, either successfully or unsuccessfully, the `ad_hoc_run-backfill` task will be deleted and the `AdHocRun` saved object will be deleted. We can use the event log entries to get the status of each backfill run, similar to what we do with action runs. We do this cleanup in order to clean up the tasks and saved objects. In the future, if we introduce ways to retry specific runs, we can change the logic to leave behind the task and/or SO where it makes sense. ## To Verify - Index some documents with timestamps in the past. - Create a detection rule that queries over the index and grab the ID of the rule. I typically choose a long schedule interval like `12h` so that the actual rule only runs once during testing. - Use the schedule backfill API to schedule a backfill. Pick a long backfill time range so it doesn't run too fast ``` POST /internal/alerting/rules/backfill/_schedule [ { "rule_id": <ruleID>, "start": "2023-12-30T12:00:00.000Z", "end": "2024-01-01T12:00:00.000Z", } ] ``` - Verify that a task was created for this backfill. The task should use the same ID as the backfill - Query for the `AdHocRun` SO in Dev Tools. You should be able to see the schedule entries changing status as the backfill executions are fulfilled. - Verify that the appropriate event log docs are written for the backfills. There should be one `execute-backfill` event written for each schedule entry - Verify that alerts are found and the timestamp fields for these alerts are populated correctly.

…ll-rule-runs

…na into alerting/backfill-rule-runs

…ll-rule-runs

…feature branch) (#179975) Resolves #174355 ## Summary Adds `get`, `find`, `delete` APIs for backfills. ### GET `GET kbn:/internal/alerting/rules/backfill/{backfillId}` Returns backfill information by ID ### FIND `POST kbn:/internal/alerting/rules/backfill/_find` **Query parameters** (at least one must be specified; using multiple implies `AND`ing the conditions) `ruleIds` - Returns any scheduled backfills for the given rule IDs `start` - Returns any scheduled backfills where the start of the backfill time range `>= start` `end` - Returns any scheduled backfills where the end of the backfill time range `<= end` ### DELETE `DELETE kbn:/internal/alerting/rules/backfill/{backfillId}` Deletes the specific backfill along with the associated task. Note that the task is created in [this PR](#177640) so there will need to be some adjustments to the functional test after than PR is merged to the feature branch. --------- Co-authored-by: kibanamachine <[email protected]>

…na into alerting/backfill-rule-runs

…ll-rule-runs

…-fix'

…ll-rule-runs

elasticmachine · 2024-04-23T01:57:43Z

Pinging @elastic/response-ops (Team:ResponseOps)

jeramysoucy

Kibana Security changes LGTM! Just one comment regarding the new ESO AAD before I approve.

jeramysoucy · 2024-04-23T07:16:49Z

x-pack/plugins/alerting/server/saved_objects/index.ts

+export const AdHocRunAttributesIncludedInAAD = [
+  'enabled',
+  'start',
+  'duration',
+  'createdAt',
+  'rule',
+  'spaceId',
+];


Just FYI - once these attributes are included in AAD, they cannot be removed due to the limitations of zero-downtime upgrades in serverless (at least for the foreseeable future). I just wanted to double check that each one was considered preferred/necessary to include in AAD before approving the PR.

Thanks for flagging @jeramysoucy! I discussed it with the team and we're going to start with the minimum fields that make sense at this time. Updated in 23cda6d

If we want to include additional fields in the future, what is the process for that?

@ymao1 Unfortunately, adding or removing an existing attribute to/from AAD is not supported at this time. I wish I had a better answer than that. New attributes can be added to AAD, but only before they are utilized/populated:

The addition will require 2 serverless release stages.
Release 1: Add the field to attributesToIncludeInAAD in the ESO type registration. Do not yet use/populate the new field.
Release 2: Begin using the new field. Implement model version change to backfill data as needed.

Adding new nested fields of attributes that are already included in AAD is handled automatically.

We have a backlog task to document developer guidelines for AAD, as it is not straight-forward. Additionally, our team has a longterm goal of supporting any AAD change in serverless, but this work is not in our roadmap at this time.

For now, the best approach is to be conservative with which attributes are included in AAD. You can always tag the Kibana Security team for advice or input when making ESO changes. It is our intention that down the road we can make ESO management much easier for developers.

elasticmachine · 2024-04-23T07:25:11Z

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

jloleysens

Core and SO changes LGTM

…ll-rule-runs

kibana-ci · 2024-04-25T13:35:02Z

💚 Build Succeeded

Buildkite Build
Commit: d9afab8

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id	before	after	diff
`@kbn/core-saved-objects-base-server-internal`	182	183	+1
`@kbn/rule-data-utils`	121	122	+1
`alerting`	826	828	+2
`ruleRegistry`	245	247	+2
total			+6

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`apm`	3.2MB	3.2MB	+41.0B
`cases`	475.8KB	475.9KB	+41.0B
`infra`	1.5MB	1.5MB	+41.0B
`observability`	287.3KB	287.3KB	+41.0B
`securitySolution`	17.3MB	17.3MB	+227.0B
`slo`	722.9KB	722.9KB	+41.0B
`triggersActionsUi`	1.6MB	1.6MB	+41.0B
total			+473.0B

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id	before	after	diff
`alerting`	54	55	+1

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id	before	after	diff
`apm`	34.5KB	34.6KB	+63.0B
`cases`	153.0KB	153.1KB	+63.0B
`infra`	102.4KB	102.4KB	+63.0B
`observability`	150.7KB	150.7KB	+63.0B
`slo`	22.0KB	22.1KB	+63.0B
`triggersActionsUi`	120.7KB	120.7KB	+63.0B
total			+378.0B

Unknown metric groups

API count

id	before	after	diff
`@kbn/core-saved-objects-base-server-internal`	225	226	+1
`@kbn/rule-data-utils`	124	125	+1
`alerting`	858	860	+2
`ruleRegistry`	274	276	+2
total			+6

ESLint disabled in files

id	before	after	diff
`alerting`	2	4	+2

ESLint disabled line counts

id	before	after	diff
`alerting`	92	93	+1

Total ESLint disabled count

id	before	after	diff
`alerting`	94	97	+3

History

💚 Build #205290 succeeded 23cda6d
💚 Build #205134 succeeded 6fdfee8
💔 Build #204733 failed 8174622
💚 Build #204604 succeeded 0dcc993
💔 Build #204497 failed 4d5950c
💔 Build #204469 failed bcf80b4

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

marshallmain

This is going to be awesome!

kdelemme

code change LGTM

ymao1 changed the title ~~[Response Ops][Alerting] Schedule backfill API (merging into feature …~~ [Response Ops][Alerting] Backfill Rule Runs Feb 22, 2024

ymao1 self-assigned this Feb 22, 2024

ymao1 added 2 commits March 8, 2024 12:04

ymao1 force-pushed the alerting/backfill-rule-runs branch from 843c731 to 33d57e6 Compare March 8, 2024 17:06

ymao1 and others added 18 commits March 11, 2024 08:44

Merge branch 'main' of github.com:elastic/kibana into alerting/backfi…

f4dac04

…ll-rule-runs

Merging in main

bfe78c1

Merge branch 'main' of github.com:elastic/kibana into alerting/backfi…

9ba76b3

…ll-rule-runs

Merge branch 'main' of github.com:elastic/kibana into alerting/backfi…

2d3757e

…ll-rule-runs

Fixing bad merge

bc5ccde

Merge branch 'main' of github.com:elastic/kibana into alerting/backfi…

a50d4fb

…ll-rule-runs

Merge branch 'main' of github.com:elastic/kibana into alerting/backfi…

d4fa376

…ll-rule-runs

Fixing new jest integration test

d403de0

Merging in main

66fdd71

Merge branch 'main' of github.com:elastic/kibana into alerting/backfi…

992232b

…ll-rule-runs

Merge branch 'main' of github.com:elastic/kibana into alerting/backfi…

2458ce0

…ll-rule-runs

Merge branch 'alerting/backfill-rule-runs' of github.com:elastic/kiba…

3f17b8b

…na into alerting/backfill-rule-runs

Merge branch 'main' of github.com:elastic/kibana into alerting/backfi…

60f6c3f

…ll-rule-runs

Merging in main

6ef5d63

Merging in main

333d3cc

Merge branch 'alerting/backfill-rule-runs' of github.com:elastic/kiba…

b1b2c3e

…na into alerting/backfill-rule-runs

ymao1 mentioned this pull request Apr 17, 2024

[Response Ops][Alerting] Add serial/synchronized mode to backfills #181072

Open

ymao1 and others added 5 commits April 17, 2024 15:06

Merge branch 'main' of github.com:elastic/kibana into alerting/backfi…

bcf80b4

…ll-rule-runs

[CI] Auto-commit changed files from 'node scripts/eslint --no-cache -…

0572829

…-fix'

Fixing type after merge

239eb65

Merge branch 'main' of github.com:elastic/kibana into alerting/backfi…

4d5950c

…ll-rule-runs

Merge branch 'main' of github.com:elastic/kibana into alerting/backfi…

0dcc993

…ll-rule-runs

ymao1 added the release_note:skip Skip the PR/issue when compiling release notes label Apr 23, 2024

ymao1 marked this pull request as ready for review April 23, 2024 01:57

ymao1 requested review from a team as code owners April 23, 2024 01:57

ymao1 requested a review from dhurley14 April 23, 2024 01:57

jeramysoucy reviewed Apr 23, 2024

View reviewed changes

botelastic bot added the Team:obs-ux-management Observability Management User Experience Team label Apr 23, 2024

jloleysens approved these changes Apr 23, 2024

View reviewed changes

Reducing number of AAD fields to minimum

23cda6d

jeramysoucy approved these changes Apr 23, 2024

View reviewed changes

pmuellr self-requested a review April 23, 2024 14:23

dhurley14 requested a review from nkhristinin April 23, 2024 14:54

ymao1 requested a review from marshallmain April 23, 2024 15:00

ymao1 added ci:cloud-deploy Create or update a Cloud deployment and removed ci:cloud-deploy Create or update a Cloud deployment labels Apr 23, 2024

Merge branch 'main' of github.com:elastic/kibana into alerting/backfi…

d9afab8

…ll-rule-runs

marshallmain approved these changes Apr 25, 2024

View reviewed changes

kdelemme approved these changes Apr 25, 2024

View reviewed changes

mikecote approved these changes Apr 25, 2024

View reviewed changes

ymao1 merged commit ee1552f into main Apr 25, 2024
36 checks passed

ymao1 deleted the alerting/backfill-rule-runs branch April 25, 2024 19:36

kibanamachine added the backport:skip This commit does not require backporting label Apr 25, 2024

maryam-saeidi mentioned this pull request May 23, 2024

[8.14] [Alert table] Fix kibana.alert.rule.execution.timstamp timezone and format (#183905) #183969

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Response Ops][Alerting] Backfill Rule Runs #177622

[Response Ops][Alerting] Backfill Rule Runs #177622

ymao1 commented Feb 22, 2024 •

edited

Loading

github-actions bot commented Feb 22, 2024

elasticmachine commented Apr 23, 2024

jeramysoucy left a comment

jeramysoucy Apr 23, 2024

ymao1 Apr 23, 2024

jeramysoucy Apr 23, 2024

elasticmachine commented Apr 23, 2024

jloleysens left a comment

kibana-ci commented Apr 25, 2024

API count

ESLint disabled in files

ESLint disabled line counts

Total ESLint disabled count

marshallmain left a comment

kdelemme left a comment

[Response Ops][Alerting] Backfill Rule Runs #177622

[Response Ops][Alerting] Backfill Rule Runs #177622

Conversation

ymao1 commented Feb 22, 2024 • edited Loading

github-actions bot commented Feb 22, 2024

elasticmachine commented Apr 23, 2024

jeramysoucy left a comment

Choose a reason for hiding this comment

jeramysoucy Apr 23, 2024

Choose a reason for hiding this comment

ymao1 Apr 23, 2024

Choose a reason for hiding this comment

jeramysoucy Apr 23, 2024

Choose a reason for hiding this comment

elasticmachine commented Apr 23, 2024

jloleysens left a comment

Choose a reason for hiding this comment

kibana-ci commented Apr 25, 2024

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Async chunks

Public APIs missing exports

Page load bundle

API count

ESLint disabled in files

ESLint disabled line counts

Total ESLint disabled count

History

marshallmain left a comment

Choose a reason for hiding this comment

kdelemme left a comment

Choose a reason for hiding this comment

ymao1 commented Feb 22, 2024 •

edited

Loading