Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Response Ops][Alerting] Backfill Rule Runs #177622

Merged
merged 31 commits into from
Apr 25, 2024
Merged

Conversation

ymao1
Copy link
Contributor

@ymao1 ymao1 commented Feb 22, 2024

This is the feature branch that contains the following commits. Each individual PR contains a summary and verification instructions.

@ymao1 ymao1 changed the title [Response Ops][Alerting] Schedule backfill API (merging into feature … [Response Ops][Alerting] Backfill Rule Runs Feb 22, 2024
Copy link
Contributor

A documentation preview will be available soon.

Request a new doc build by commenting
  • Rebuild this PR: run docs-build
  • Rebuild this PR and all Elastic docs: run docs-build rebuild

run docs-build is much faster than run docs-build rebuild. A rebuild should only be needed in rare situations.

If your PR continues to fail for an unknown reason, the doc build pipeline may be broken. Elastic employees can check the pipeline status here.

@ymao1 ymao1 self-assigned this Feb 22, 2024
ymao1 added 2 commits March 8, 2024 12:04
…branch) (#176185)

Towards #174355

Note that this merges into a feature branch

## Summary

Adds API for scheduling backfill jobs. Other APIs such as `get`, `find`
and `delete` will be added in follow-on PRs.

This PR introduces 2 concepts
- `ad hoc run` - This is an execution of a rule over a specific time
range. I kept this terminology generic so that in the future, it could
be used to support other custom rule executions (like preview rule
runs). The parameters used for the `ad hoc run` are specified in a new
encrypted saved object type (`ad_hoc_run_params`). This SO is encrypted
because it stores the API key to use (copied from the rule)
- `backfill job` - This is a specific type of `ad hoc run` that
schedules a rule run for a historical time range to cover a gap in
execution

### Schedule Backfill API

* Only allows scheduling for persistent (not lifecycle) rule types -
this is currently all detection rules
* Only allows scheduling for currently enabled rules
* Limits the max number of backfill jobs that can be scheduled at one
time (currently limited to 10)
* Checks that user has the appropriate RBAC permissions for the alerting
rule types they are scheduling backfills for. This only requires `READ`
permission for the rule type, which follows the same permission required
to invoke the `runSoon` API
* Once all permissions and pre-requisites have been validated, the API
creates an `ad_hoc_run_params` saved object that is stored in the
`.kibana_alerting_cases` index
* Task runner to run the rule using the parameters in
`ad_hoc_run_params` will be added in a follow-on PR.

**Sample Request**
```
POST /internal/alerting/rules/backfill/_schedule
[
  {
    "rule_id": "abc",
    "start": "2023-12-30T12:00:00.000Z",
    "end": "2024-01-01T12:00:00.000Z",
  }
]
```

This would create an `ad_hoc_run_params` saved object that looks like

```
{
  "apiKeyId": <apiKeyId>,
  "apiKeyToUse": <apiKey>,  // this is copied from the decrypted rule and then re-encrypted
  "createdAt": "2024-01-30T00:00:00.000Z",
  "duration": "12h", // uses the same schedule interval as the rule
  "enabled": false,
  "end": "2024-01-01T12:00:00.000Z",
  "rule": {             // copied from the rule
    "name": "my rule name", 
    "tags": ["foo"],
    "alertTypeId": "myType",
    "params": {},
    "apiKeyOwner": "user",
    "apiKeyCreatedByUser": false,
    "consumer": "myApp",
    "enabled": true,
    "schedule": {
      "interval": "12h",
    },
    "createdBy": "user",
    "updatedBy": "user",
    "createdAt": "2019-02-12T21:01:22.479Z",
    "updatedAt": "2019-02-12T21:01:22.479Z",
    "revision": 0,
  },
  "spaceId": "default",
  "start": "2023-12-30T12:00:00.000Z",
  "status": "pending",
  "schedule": [
    { "interval": "12h", "runAt": "2023-12-31T00:00:00.000Z", "status": "pending" },
    { "interval": "12h", "runAt": "2023-12-31T12:00:00.000Z", "status": "pending" },
    { "interval": "12h", "runAt": "2024-01-01T00:00:00.000Z", "status": "pending" },
    { "interval": "12h", "runAt": "2024-01-01T12:00:00.000Z", "status": "pending" },
  ],
}
```
…branch) (#176185)

Towards #174355

Note that this merges into a feature branch

## Summary

Adds API for scheduling backfill jobs. Other APIs such as `get`, `find`
and `delete` will be added in follow-on PRs.

This PR introduces 2 concepts
- `ad hoc run` - This is an execution of a rule over a specific time
range. I kept this terminology generic so that in the future, it could
be used to support other custom rule executions (like preview rule
runs). The parameters used for the `ad hoc run` are specified in a new
encrypted saved object type (`ad_hoc_run_params`). This SO is encrypted
because it stores the API key to use (copied from the rule)
- `backfill job` - This is a specific type of `ad hoc run` that
schedules a rule run for a historical time range to cover a gap in
execution

### Schedule Backfill API

* Only allows scheduling for persistent (not lifecycle) rule types -
this is currently all detection rules
* Only allows scheduling for currently enabled rules
* Limits the max number of backfill jobs that can be scheduled at one
time (currently limited to 10)
* Checks that user has the appropriate RBAC permissions for the alerting
rule types they are scheduling backfills for. This only requires `READ`
permission for the rule type, which follows the same permission required
to invoke the `runSoon` API
* Once all permissions and pre-requisites have been validated, the API
creates an `ad_hoc_run_params` saved object that is stored in the
`.kibana_alerting_cases` index
* Task runner to run the rule using the parameters in
`ad_hoc_run_params` will be added in a follow-on PR.

**Sample Request**
```
POST /internal/alerting/rules/backfill/_schedule
[
  {
    "rule_id": "abc",
    "start": "2023-12-30T12:00:00.000Z",
    "end": "2024-01-01T12:00:00.000Z",
  }
]
```

This would create an `ad_hoc_run_params` saved object that looks like

```
{
  "apiKeyId": <apiKeyId>,
  "apiKeyToUse": <apiKey>,  // this is copied from the decrypted rule and then re-encrypted
  "createdAt": "2024-01-30T00:00:00.000Z",
  "duration": "12h", // uses the same schedule interval as the rule
  "enabled": false,
  "end": "2024-01-01T12:00:00.000Z",
  "rule": {             // copied from the rule
    "name": "my rule name", 
    "tags": ["foo"],
    "alertTypeId": "myType",
    "params": {},
    "apiKeyOwner": "user",
    "apiKeyCreatedByUser": false,
    "consumer": "myApp",
    "enabled": true,
    "schedule": {
      "interval": "12h",
    },
    "createdBy": "user",
    "updatedBy": "user",
    "createdAt": "2019-02-12T21:01:22.479Z",
    "updatedAt": "2019-02-12T21:01:22.479Z",
    "revision": 0,
  },
  "spaceId": "default",
  "start": "2023-12-30T12:00:00.000Z",
  "status": "pending",
  "schedule": [
    { "interval": "12h", "runAt": "2023-12-31T00:00:00.000Z", "status": "pending" },
    { "interval": "12h", "runAt": "2023-12-31T12:00:00.000Z", "status": "pending" },
    { "interval": "12h", "runAt": "2024-01-01T00:00:00.000Z", "status": "pending" },
    { "interval": "12h", "runAt": "2024-01-01T12:00:00.000Z", "status": "pending" },
  ],
}
```
@ymao1 ymao1 force-pushed the alerting/backfill-rule-runs branch from 843c731 to 33d57e6 Compare March 8, 2024 17:06
ymao1 and others added 18 commits March 11, 2024 08:44
…e branch) (#177640)

Resolves #174358

Note that this merges into a feature branch

## Summary

Adds task runner for ad hoc runs of rules over an arbitrary time
interval. This PR:
1. Updates the `BackfillClient` to create a task
(`type:ad_hoc_run-backfill`) for each backfill job scheduled via the
[schedule API](#176185)
2. Creates an `AdHocTaskRunner` to run these tasks. This task runner
does the following:
  a. Loads and decrypts the `AdHocRun` saved object
b. Determines which schedule entry from the saved object to run by
looking for the next `PENDING` entry
c. Uses the `runAt` time in the schedule as the `startedAt` time passed
into the executors and to determine the time range returned by the
`getTimeRange` function.
d. Creates an `execute-backfill` event log doc that contains the ID of
the `AdHocRun` saved object and the `runAt` and interval for the ad hoc
run
e. Updates the schedule entry in the `AdHocRun` saved object to reflect
the outcome of the execution, either `success`, `timeout` or `error`.
3. Alerts created via the ad hoc task runner uses the following
timestamps
  - `@timestamp` - this is set to the backfill `runAt` time
  - `kibana.alert.start` - this is set to the backfill `runAt` time
- `kibana.alert.rule.execution.timestamp` - this is a new field that
reflects the actual time the backfill was run. For real-time rule runs,
this timestamp should match the `@timestamp` timestamp.
4. When all schedule entries for a backfill have been run, either
successfully or unsuccessfully, the `ad_hoc_run-backfill` task will be
deleted and the `AdHocRun` saved object will be deleted. We can use the
event log entries to get the status of each backfill run, similar to
what we do with action runs. We do this cleanup in order to clean up the
tasks and saved objects. In the future, if we introduce ways to retry
specific runs, we can change the logic to leave behind the task and/or
SO where it makes sense.

## To Verify
- Index some documents with timestamps in the past.
- Create a detection rule that queries over the index and grab the ID of
the rule. I typically choose a long schedule interval like `12h` so that
the actual rule only runs once during testing.
- Use the schedule backfill API to schedule a backfill. Pick a long
backfill time range so it doesn't run too fast

```
POST /internal/alerting/rules/backfill/_schedule
[
  {
    "rule_id": <ruleID>,
    "start": "2023-12-30T12:00:00.000Z",
    "end": "2024-01-01T12:00:00.000Z",
  }
]
```

- Verify that a task was created for this backfill. The task should use
the same ID as the backfill
- Query for the `AdHocRun` SO in Dev Tools. You should be able to see
the schedule entries changing status as the backfill executions are
fulfilled.
- Verify that the appropriate event log docs are written for the
backfills. There should be one `execute-backfill` event written for each
schedule entry
- Verify that alerts are found and the timestamp fields for these alerts
are populated correctly.
…feature branch) (#179975)

Resolves #174355

## Summary

Adds `get`, `find`, `delete` APIs for backfills. 

### GET
`GET kbn:/internal/alerting/rules/backfill/{backfillId}`

Returns backfill information by ID

### FIND
`POST kbn:/internal/alerting/rules/backfill/_find`

**Query parameters** (at least one must be specified; using multiple
implies `AND`ing the conditions)
`ruleIds` - Returns any scheduled backfills for the given rule IDs
`start` - Returns any scheduled backfills where the start of the
backfill time range `>= start`
`end` - Returns any scheduled backfills where the end of the backfill
time range `<= end`

### DELETE
`DELETE kbn:/internal/alerting/rules/backfill/{backfillId}`

Deletes the specific backfill along with the associated task. Note that
the task is created in [this
PR](#177640) so there will need to
be some adjustments to the functional test after than PR is merged to
the feature branch.

---------

Co-authored-by: kibanamachine <[email protected]>
@ymao1 ymao1 added the release_note:skip Skip the PR/issue when compiling release notes label Apr 23, 2024
@ymao1 ymao1 marked this pull request as ready for review April 23, 2024 01:57
@ymao1 ymao1 requested review from a team as code owners April 23, 2024 01:57
@ymao1 ymao1 requested a review from dhurley14 April 23, 2024 01:57
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

Copy link
Contributor

@jeramysoucy jeramysoucy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kibana Security changes LGTM! Just one comment regarding the new ESO AAD before I approve.

Comment on lines 92 to 99
export const AdHocRunAttributesIncludedInAAD = [
'enabled',
'start',
'duration',
'createdAt',
'rule',
'spaceId',
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI - once these attributes are included in AAD, they cannot be removed due to the limitations of zero-downtime upgrades in serverless (at least for the foreseeable future). I just wanted to double check that each one was considered preferred/necessary to include in AAD before approving the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for flagging @jeramysoucy! I discussed it with the team and we're going to start with the minimum fields that make sense at this time. Updated in 23cda6d

If we want to include additional fields in the future, what is the process for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ymao1 Unfortunately, adding or removing an existing attribute to/from AAD is not supported at this time. I wish I had a better answer than that. New attributes can be added to AAD, but only before they are utilized/populated:

The addition will require 2 serverless release stages.
Release 1: Add the field to attributesToIncludeInAAD in the ESO type registration. Do not yet use/populate the new field.
Release 2: Begin using the new field. Implement model version change to backfill data as needed.

Adding new nested fields of attributes that are already included in AAD is handled automatically.

We have a backlog task to document developer guidelines for AAD, as it is not straight-forward. Additionally, our team has a longterm goal of supporting any AAD change in serverless, but this work is not in our roadmap at this time.

For now, the best approach is to be conservative with which attributes are included in AAD. You can always tag the Kibana Security team for advice or input when making ESO changes. It is our intention that down the road we can make ESO management much easier for developers.

@botelastic botelastic bot added the Team:obs-ux-management Observability Management User Experience Team label Apr 23, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

Copy link
Contributor

@jloleysens jloleysens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core and SO changes LGTM

@pmuellr pmuellr self-requested a review April 23, 2024 14:23
@dhurley14 dhurley14 requested a review from nkhristinin April 23, 2024 14:54
@ymao1 ymao1 requested a review from marshallmain April 23, 2024 15:00
@ymao1 ymao1 added ci:cloud-deploy Create or update a Cloud deployment and removed ci:cloud-deploy Create or update a Cloud deployment labels Apr 23, 2024
@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/core-saved-objects-base-server-internal 182 183 +1
@kbn/rule-data-utils 121 122 +1
alerting 826 828 +2
ruleRegistry 245 247 +2
total +6

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
apm 3.2MB 3.2MB +41.0B
cases 475.8KB 475.9KB +41.0B
infra 1.5MB 1.5MB +41.0B
observability 287.3KB 287.3KB +41.0B
securitySolution 17.3MB 17.3MB +227.0B
slo 722.9KB 722.9KB +41.0B
triggersActionsUi 1.6MB 1.6MB +41.0B
total +473.0B

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
alerting 54 55 +1

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
apm 34.5KB 34.6KB +63.0B
cases 153.0KB 153.1KB +63.0B
infra 102.4KB 102.4KB +63.0B
observability 150.7KB 150.7KB +63.0B
slo 22.0KB 22.1KB +63.0B
triggersActionsUi 120.7KB 120.7KB +63.0B
total +378.0B
Unknown metric groups

API count

id before after diff
@kbn/core-saved-objects-base-server-internal 225 226 +1
@kbn/rule-data-utils 124 125 +1
alerting 858 860 +2
ruleRegistry 274 276 +2
total +6

ESLint disabled in files

id before after diff
alerting 2 4 +2

ESLint disabled line counts

id before after diff
alerting 92 93 +1

Total ESLint disabled count

id before after diff
alerting 94 97 +3

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

Copy link
Contributor

@marshallmain marshallmain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be awesome!

Copy link
Contributor

@kdelemme kdelemme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code change LGTM

@ymao1 ymao1 merged commit ee1552f into main Apr 25, 2024
36 checks passed
@ymao1 ymao1 deleted the alerting/backfill-rule-runs branch April 25, 2024 19:36
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Team:obs-ux-management Observability Management User Experience Team Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.15.0
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

9 participants