-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Response Ops][Alerting] Backfill Rule Runs #177622
Conversation
A documentation preview will be available soon. Request a new doc build by commenting
If your PR continues to fail for an unknown reason, the doc build pipeline may be broken. Elastic employees can check the pipeline status here. |
…branch) (#176185) Towards #174355 Note that this merges into a feature branch ## Summary Adds API for scheduling backfill jobs. Other APIs such as `get`, `find` and `delete` will be added in follow-on PRs. This PR introduces 2 concepts - `ad hoc run` - This is an execution of a rule over a specific time range. I kept this terminology generic so that in the future, it could be used to support other custom rule executions (like preview rule runs). The parameters used for the `ad hoc run` are specified in a new encrypted saved object type (`ad_hoc_run_params`). This SO is encrypted because it stores the API key to use (copied from the rule) - `backfill job` - This is a specific type of `ad hoc run` that schedules a rule run for a historical time range to cover a gap in execution ### Schedule Backfill API * Only allows scheduling for persistent (not lifecycle) rule types - this is currently all detection rules * Only allows scheduling for currently enabled rules * Limits the max number of backfill jobs that can be scheduled at one time (currently limited to 10) * Checks that user has the appropriate RBAC permissions for the alerting rule types they are scheduling backfills for. This only requires `READ` permission for the rule type, which follows the same permission required to invoke the `runSoon` API * Once all permissions and pre-requisites have been validated, the API creates an `ad_hoc_run_params` saved object that is stored in the `.kibana_alerting_cases` index * Task runner to run the rule using the parameters in `ad_hoc_run_params` will be added in a follow-on PR. **Sample Request** ``` POST /internal/alerting/rules/backfill/_schedule [ { "rule_id": "abc", "start": "2023-12-30T12:00:00.000Z", "end": "2024-01-01T12:00:00.000Z", } ] ``` This would create an `ad_hoc_run_params` saved object that looks like ``` { "apiKeyId": <apiKeyId>, "apiKeyToUse": <apiKey>, // this is copied from the decrypted rule and then re-encrypted "createdAt": "2024-01-30T00:00:00.000Z", "duration": "12h", // uses the same schedule interval as the rule "enabled": false, "end": "2024-01-01T12:00:00.000Z", "rule": { // copied from the rule "name": "my rule name", "tags": ["foo"], "alertTypeId": "myType", "params": {}, "apiKeyOwner": "user", "apiKeyCreatedByUser": false, "consumer": "myApp", "enabled": true, "schedule": { "interval": "12h", }, "createdBy": "user", "updatedBy": "user", "createdAt": "2019-02-12T21:01:22.479Z", "updatedAt": "2019-02-12T21:01:22.479Z", "revision": 0, }, "spaceId": "default", "start": "2023-12-30T12:00:00.000Z", "status": "pending", "schedule": [ { "interval": "12h", "runAt": "2023-12-31T00:00:00.000Z", "status": "pending" }, { "interval": "12h", "runAt": "2023-12-31T12:00:00.000Z", "status": "pending" }, { "interval": "12h", "runAt": "2024-01-01T00:00:00.000Z", "status": "pending" }, { "interval": "12h", "runAt": "2024-01-01T12:00:00.000Z", "status": "pending" }, ], } ```
…branch) (#176185) Towards #174355 Note that this merges into a feature branch ## Summary Adds API for scheduling backfill jobs. Other APIs such as `get`, `find` and `delete` will be added in follow-on PRs. This PR introduces 2 concepts - `ad hoc run` - This is an execution of a rule over a specific time range. I kept this terminology generic so that in the future, it could be used to support other custom rule executions (like preview rule runs). The parameters used for the `ad hoc run` are specified in a new encrypted saved object type (`ad_hoc_run_params`). This SO is encrypted because it stores the API key to use (copied from the rule) - `backfill job` - This is a specific type of `ad hoc run` that schedules a rule run for a historical time range to cover a gap in execution ### Schedule Backfill API * Only allows scheduling for persistent (not lifecycle) rule types - this is currently all detection rules * Only allows scheduling for currently enabled rules * Limits the max number of backfill jobs that can be scheduled at one time (currently limited to 10) * Checks that user has the appropriate RBAC permissions for the alerting rule types they are scheduling backfills for. This only requires `READ` permission for the rule type, which follows the same permission required to invoke the `runSoon` API * Once all permissions and pre-requisites have been validated, the API creates an `ad_hoc_run_params` saved object that is stored in the `.kibana_alerting_cases` index * Task runner to run the rule using the parameters in `ad_hoc_run_params` will be added in a follow-on PR. **Sample Request** ``` POST /internal/alerting/rules/backfill/_schedule [ { "rule_id": "abc", "start": "2023-12-30T12:00:00.000Z", "end": "2024-01-01T12:00:00.000Z", } ] ``` This would create an `ad_hoc_run_params` saved object that looks like ``` { "apiKeyId": <apiKeyId>, "apiKeyToUse": <apiKey>, // this is copied from the decrypted rule and then re-encrypted "createdAt": "2024-01-30T00:00:00.000Z", "duration": "12h", // uses the same schedule interval as the rule "enabled": false, "end": "2024-01-01T12:00:00.000Z", "rule": { // copied from the rule "name": "my rule name", "tags": ["foo"], "alertTypeId": "myType", "params": {}, "apiKeyOwner": "user", "apiKeyCreatedByUser": false, "consumer": "myApp", "enabled": true, "schedule": { "interval": "12h", }, "createdBy": "user", "updatedBy": "user", "createdAt": "2019-02-12T21:01:22.479Z", "updatedAt": "2019-02-12T21:01:22.479Z", "revision": 0, }, "spaceId": "default", "start": "2023-12-30T12:00:00.000Z", "status": "pending", "schedule": [ { "interval": "12h", "runAt": "2023-12-31T00:00:00.000Z", "status": "pending" }, { "interval": "12h", "runAt": "2023-12-31T12:00:00.000Z", "status": "pending" }, { "interval": "12h", "runAt": "2024-01-01T00:00:00.000Z", "status": "pending" }, { "interval": "12h", "runAt": "2024-01-01T12:00:00.000Z", "status": "pending" }, ], } ```
843c731
to
33d57e6
Compare
…e branch) (#177640) Resolves #174358 Note that this merges into a feature branch ## Summary Adds task runner for ad hoc runs of rules over an arbitrary time interval. This PR: 1. Updates the `BackfillClient` to create a task (`type:ad_hoc_run-backfill`) for each backfill job scheduled via the [schedule API](#176185) 2. Creates an `AdHocTaskRunner` to run these tasks. This task runner does the following: a. Loads and decrypts the `AdHocRun` saved object b. Determines which schedule entry from the saved object to run by looking for the next `PENDING` entry c. Uses the `runAt` time in the schedule as the `startedAt` time passed into the executors and to determine the time range returned by the `getTimeRange` function. d. Creates an `execute-backfill` event log doc that contains the ID of the `AdHocRun` saved object and the `runAt` and interval for the ad hoc run e. Updates the schedule entry in the `AdHocRun` saved object to reflect the outcome of the execution, either `success`, `timeout` or `error`. 3. Alerts created via the ad hoc task runner uses the following timestamps - `@timestamp` - this is set to the backfill `runAt` time - `kibana.alert.start` - this is set to the backfill `runAt` time - `kibana.alert.rule.execution.timestamp` - this is a new field that reflects the actual time the backfill was run. For real-time rule runs, this timestamp should match the `@timestamp` timestamp. 4. When all schedule entries for a backfill have been run, either successfully or unsuccessfully, the `ad_hoc_run-backfill` task will be deleted and the `AdHocRun` saved object will be deleted. We can use the event log entries to get the status of each backfill run, similar to what we do with action runs. We do this cleanup in order to clean up the tasks and saved objects. In the future, if we introduce ways to retry specific runs, we can change the logic to leave behind the task and/or SO where it makes sense. ## To Verify - Index some documents with timestamps in the past. - Create a detection rule that queries over the index and grab the ID of the rule. I typically choose a long schedule interval like `12h` so that the actual rule only runs once during testing. - Use the schedule backfill API to schedule a backfill. Pick a long backfill time range so it doesn't run too fast ``` POST /internal/alerting/rules/backfill/_schedule [ { "rule_id": <ruleID>, "start": "2023-12-30T12:00:00.000Z", "end": "2024-01-01T12:00:00.000Z", } ] ``` - Verify that a task was created for this backfill. The task should use the same ID as the backfill - Query for the `AdHocRun` SO in Dev Tools. You should be able to see the schedule entries changing status as the backfill executions are fulfilled. - Verify that the appropriate event log docs are written for the backfills. There should be one `execute-backfill` event written for each schedule entry - Verify that alerts are found and the timestamp fields for these alerts are populated correctly.
…na into alerting/backfill-rule-runs
…feature branch) (#179975) Resolves #174355 ## Summary Adds `get`, `find`, `delete` APIs for backfills. ### GET `GET kbn:/internal/alerting/rules/backfill/{backfillId}` Returns backfill information by ID ### FIND `POST kbn:/internal/alerting/rules/backfill/_find` **Query parameters** (at least one must be specified; using multiple implies `AND`ing the conditions) `ruleIds` - Returns any scheduled backfills for the given rule IDs `start` - Returns any scheduled backfills where the start of the backfill time range `>= start` `end` - Returns any scheduled backfills where the end of the backfill time range `<= end` ### DELETE `DELETE kbn:/internal/alerting/rules/backfill/{backfillId}` Deletes the specific backfill along with the associated task. Note that the task is created in [this PR](#177640) so there will need to be some adjustments to the functional test after than PR is merged to the feature branch. --------- Co-authored-by: kibanamachine <[email protected]>
…na into alerting/backfill-rule-runs
Pinging @elastic/response-ops (Team:ResponseOps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kibana Security changes LGTM! Just one comment regarding the new ESO AAD before I approve.
export const AdHocRunAttributesIncludedInAAD = [ | ||
'enabled', | ||
'start', | ||
'duration', | ||
'createdAt', | ||
'rule', | ||
'spaceId', | ||
]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI - once these attributes are included in AAD, they cannot be removed due to the limitations of zero-downtime upgrades in serverless (at least for the foreseeable future). I just wanted to double check that each one was considered preferred/necessary to include in AAD before approving the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for flagging @jeramysoucy! I discussed it with the team and we're going to start with the minimum fields that make sense at this time. Updated in 23cda6d
If we want to include additional fields in the future, what is the process for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ymao1 Unfortunately, adding or removing an existing attribute to/from AAD is not supported at this time. I wish I had a better answer than that. New attributes can be added to AAD, but only before they are utilized/populated:
The addition will require 2 serverless release stages.
Release 1: Add the field to attributesToIncludeInAAD in the ESO type registration. Do not yet use/populate the new field.
Release 2: Begin using the new field. Implement model version change to backfill data as needed.
Adding new nested fields of attributes that are already included in AAD is handled automatically.
We have a backlog task to document developer guidelines for AAD, as it is not straight-forward. Additionally, our team has a longterm goal of supporting any AAD change in serverless, but this work is not in our roadmap at this time.
For now, the best approach is to be conservative with which attributes are included in AAD. You can always tag the Kibana Security team for advice or input when making ESO changes. It is our intention that down the road we can make ESO management much easier for developers.
Pinging @elastic/obs-ux-management-team (Team:obs-ux-management) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Core and SO changes LGTM
💚 Build Succeeded
Metrics [docs]Public APIs missing comments
Async chunks
Public APIs missing exports
Page load bundle
Unknown metric groupsAPI count
ESLint disabled in files
ESLint disabled line counts
Total ESLint disabled count
History
To update your PR or re-run it, just comment with: cc @ymao1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be awesome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code change LGTM
This is the feature branch that contains the following commits. Each individual PR contains a summary and verification instructions.