[Alerting] Storing custom searchable rule execution data inside the rule #112193

banderror · 2021-09-14T23:50:29Z

Parent ticket: #101013
Related to: #91265, #106347

Why

Security Solution needs to migrate its rule execution statuses and metrics from our sidecar SO to event_log (see the parent ticket). Given the rule execution events are in event_log, we will need to be able to execute relatively complex queries to event_log in order to fetch data for the Rule Monitoring table in Security.

Examples of such queries with aggregations:

types see in [Security Solution] Extend event_log plugin with functionality required for Rule Execution Log #106347
example 1: query rule data (rule IDs known beforehand)
example 2: sort rules by their execution status (rule IDs unknown)

We identified that queries with aggregations mentioned above are not performant.

What

We could simplify many of the mentioned queries and improve performance by duplicating information from the last execution and storing it inside the rule itself (or close to it). Additionally, this would allow us to postpone the implementation of a custom RBAC for event-log-via-alerting (see #106347) until after 7.16.

What fields we’d like to store:

Field name	Field type	Example values	Comment
`status`	keyword	`going to run`, `succeeded`, `partial failure`, `failed`	Custom, solution-specific rule execution status. The values mentioned are currently used in Security Solution
`status_order`	short	`0`, `10`, `20`, `30`	A field for sorting rules by their `status` (custom order instead of the alphabetical order)
`metrics.indexing_duration_sum_ms`	long	`1200`	Total time that the rule spent for indexing alerts during the last execution
`metrics.search_duration_sum_ms`	long	`1200`	Total time that the rule spent for searching source documents during the last execution
`metrics.execution_gap_duration_s`	long	`42000`	Execution gap (see the definition below)

Execution gap metric that we use in Security Solution is defined the following way:

gapDuration = calculatedLookback - ruleDefinitionLookback

ruleDefinitionLookback = "additional lookback time" interval specified in the rule parameters

calculatedLookback = now - previousStartedAt - interval
now = Date.now()
previousStartedAt = when did task manager last start this rule?
interval = the rule runs every "interval" value, e.g. every 10 minutes; specified in the rule parameters

This looks very much similar to the existing rule state, with the difference that:

We need all the fields in it to be searchable, filterable, sortable. The current rule state, if I remember correctly, is a serialized JSON string stored in the Task Manager index. It's not possible to search, filter or sort over its fields.
We need to be able to add/remove fields in the future according to our needs in Security, for example add more solution-specific metrics (like [Security Solution] Include total indicator count when writing Indicator Match Rule execution logs #111903). It should be easy to add new fields or remove existing ones if they're not used by the solution anymore.

How

Some ideas discussed between @pmuellr, @spong, @xcrzx and @banderror:

Some of the fields could probably be made common for all the rules on the Framework level. E.g. the status field could become a "solution status" or "custom status". Maybe we could lift some of the metrics up to the Framework level as well.
Solution-specific fields like Indicator Match rule type metrics (like [Security Solution] Include total indicator count when writing Indicator Match Rule execution logs #111903) do not really belong to the Framework and should probably be stored in a separate object within the rule instance.
This object would need to support searching, filtering and sorting over all fields of different types. And it should be easily extensible. So ideally we’d like to define mappings for this object from the Security Solution side.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-09-14T23:50:31Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

elasticmachine · 2021-09-14T23:52:50Z

Pinging @elastic/security-detections-response (Team:Detections and Resp)

banderror · 2021-09-15T00:03:40Z

@elastic/kibana-alerting-services we need some 👀 from your side on this one and we'd like to start discussing options for how to implement this. In Security we have capacity for working on the implementation and ideally we need it in 7.16 and as soon as possible, so we could finalize the whole #101013 in 7.16 as well 😎

pmuellr · 2021-09-15T13:48:39Z

We had mentioned in talking about this issue that we track the "schedule delay" of rule/connector execution (now - scheduledDate) for the event log, and that this would be nice to add to the execution status.

Also note that the POC for this issue adds the rule execution duration to the rule execution status in the SO - #111805

Some notes on suggested fields:

status - this seems specific to the rule type or perhaps solution, wondering how confusing / error-prone that will be for searches; or is there an effort in RAC to standardize this?
status_order - same as status, perhaps worse, if you sort different rules with different types of order in one list
metrics.indexing_duration_sum_ms - almost a yes! but what rules do indexing :-); would eventually be applicable to the index action though, so ya - my understanding is this would be for indexing by the rule registry
metrics.search_duration_sum_ms - yes!
metrics.execution_gap_duration_s - this is similar to, but different than our "schedule delay", which we should be able to add easily. Otherwise it seems like it's very specific to security rules. At the same time, you can imagine this being applicable to many rules, as a general concept that alerting dealt with - but of course we don't today. Is this another topic we should standardize in RAC?

Options on extendability; below, some of the options would create a new "generic" field in execution_status to store the extended fields in, and I'll refer to that field as extra:

add each new field to a new rule- or solution-specific object field in the saved object execution_status as normal fields; eg execution_status.index_threshold.someValueHere
add extra as type:object enabled:false
add extra as type:flattened
add extra as type:nested where the inner docs are {key: keyword, nval: float, kval: keyword kval.text: text}
let each plugin extend the rule SO, the same way SO's add their properties to attributes, so add whatever fields they want
what else?

Option 1 provides the best story from a lot of aspects, but does bloat the mappings, whereas the other options don't, as much. We'd want to define how the migrations for these work; I'm thinking a migration would never add values for these, but if a field was removed in some release, we'd clearly have to delete it in a migration. Meaning rule solution authors having to update the migration in the alerting plugin, which is not great. And could be a bad experience if a user built things on it, like dashboard visualizations.

For option 2, to search/sort, runtime fields would need to be provided, that would presumably be accessing the source, so is likely the "slowest" story for queries.

For option 3, doesn't allow numeric access, just keyword, so would be problematic for numeric fields, which I assume will be pretty important. Presumably the numeric fields could be accessed via runtime fields, not sure if the source is required.

For option 4, the way we'd store values is obviously clumsy, but we're essentially modelling Map<string, number|string> here so I think that's what we'd have to do. Nested fields aren't available within most Kibana visualizations, so presumably runtime fields would be needed to access the data for that purpose. I think there may be some nested support in KQL though (for Discover usage, or other searching needs).

For option 5, I believe we've looked at doing this, perhaps as part of the "searchable rule params" issue we worked on a while back (will need to look for the issue) - but the TL;DR is we settled on using the flattened type for this (which doesn't seem appropriate here with numeric values).

mikecote · 2021-09-15T19:53:56Z

For option 2, to search/sort, runtime fields would need to be provided, that would presumably be accessing the source, so is likely the "slowest" story for querie

Side note, I don't believe free text searching can be done on runtime fields, but I may be wrong.

pmuellr · 2021-09-16T12:48:53Z

I don't believe free text searching can be done on runtime fields, but I may be wrong.

I believe you are correct, but I'm guessing this isn't a concern. It will contain output from the execution, so perhaps some error messages would be text search candidates, but guessing not being able to text search through them is survivable.

pmuellr · 2021-09-16T12:49:33Z

One point that came when Mike and I chatted about this was whether you can use runtime fields in SO searches. Dunno!

pmuellr · 2021-09-16T13:39:34Z

In a chat with Core, it appears there isn't a way to send runtime field definitions with SO find() calls, but in theory we could add runtime field definitions in the SO mappings itself. Will need to do some experiments.

I hadn't considered that, but it makes sense to "burn" runtime fields into the mappings for this, as presumably they wouldn't change during a particular Kibana release.

mikecote · 2021-09-16T14:01:59Z

I hadn't considered that, but it makes sense to "burn" runtime fields into the mappings for this, as presumably they wouldn't change during a particular Kibana release.

If they are part of the mapping, would the runtime fields get "calculated" anytime a search is happening by Kibana? Including different SO types?

pmuellr · 2021-09-21T13:45:34Z

If they are part of the mapping, would the runtime fields get "calculated" anytime a search is happening by Kibana? Including different SO types?

Good question we should ask ES folks.

Thinking about this some more, it feels like this could end up being another form of "migration" that we need to deal with over time, and one that's different from our current migrations, so ... will add some amount of additional complexity, just thinking about how to keep such fields "stable" over time. Probably easier than our existing migrations, since we don't migrate data, but we can also never change existing mappings for old indices.

Feeling like depending on runtime fields should probably have a small research spike associated with it, if we wanted to go that route.

pmuellr · 2021-09-21T13:48:31Z

Mentioned is a discussion: we should add the number of alerts generated from a rule execution to the execution status. And the event log as well? We can also provide the number of new and recovered alerts. And by "alerts", I mean "instanceIds", and it would be the number that are "active", not the number that are actually going to schedule actions to be fired.

I think the number of actions that are scheduled to be fired would be another interesting number to track.

pmuellr · 2021-09-21T13:55:45Z

Thought I'd point out the place where the execution status is written to the rule SO:

kibana/x-pack/plugins/alerting/server/task_runner/task_runner.ts

Lines 603 to 618 in b1d6779

    
           const client = this.context.internalSavedObjectsRepository; 
        
           const attributes = { 
        
             executionStatus: alertExecutionStatusToRaw(executionStatus), 
        
           }; 
        
           try { 
        
             await partiallyUpdateAlert(client, alertId, attributes, { 
        
               ignore404: true, 
        
               namespace, 
        
               refresh: false, 
        
             }); 
        
           } catch (err) { 
        
             this.logger.error( 
        
               `error updating alert execution status for ${this.alertType.id}:${alertId} ${err.message}` 
        
             ); 
        
           }

The executionStatus code itself is here: https://github.com/elastic/kibana/blob/master/x-pack/plugins/alerting/server/lib/alert_execution_status.ts

jasonrhodes · 2021-09-21T19:36:12Z

Hey all, I'll be honest with you, I'm not sure we have the headspace on our side to really fully understand what we're looking at here or how we can give useful feedback. I think the overall idea makes sense to me and if/when we come across the need to use this, I'm confident we'll be able to adapt things so that it works for us.

Do you have specific questions or areas that you think would be good for observability to address for this initial implementation?

spong · 2021-09-21T22:49:52Z

Hey all, I'll be honest with you, I'm not sure we have the headspace on our side to really fully understand what we're looking at here or how we can give useful feedback. I think the overall idea makes sense to me and if/when we come across the need to use this, I'm confident we'll be able to adapt things so that it works for us.

Do you have specific questions or areas that you think would be good for observability to address for this initial implementation?

Understandable, and definitely agree about being able to adapt as needed, so I think we're all good there 👍. This effort is in support of the Rule Management/Monitoring redesign (https://github.com/elastic/stack-design-team/issues/68), and I think we're mostly just looking for sign-off from Observability on these fields in support of that effort since you all will be adopting the Rule Management here at some point. So more of a verification that the proposed UI/UX meets your needs and that we don't need to add any additional fields as part of this initial effort (can always add more later :).

banderror · 2021-09-22T12:49:35Z

Let me try to summarize what we have at this point after the discussions.
@pmuellr @mikecote @spong @jasonrhodes please let me know if you have any concerns.

TL;DR:

Let's focus on the fields in the PR description and proceed with the Option 1 (see below). This way we'd have a chance to meet our goals in Security in 7.16. We in Security are ready to start implementing it.
After 7.16 will should have more information about the Rule Management / Monitoring UX in Security and Observability, and we will have time to add more metrics and come up with an extensible solution.

Fields

See the updated table in the PR description. All the fields there are required by Security and we need to focus on implementing them in the Alerting Framework in 7.16, if possible. We have at most 1-2 weeks for that, because on top of that we will need to build the execution logging in Security and release it in 7.16.
We got a sign-off from Observability on these fields (@jasonrhodes please correct me if I've misunderstood 🙂)
We requested a final confirmation on the need for showing Execution Gap metric in the UI from the Observability and Security PM and UX.
We will need to be able to add more execution-related fields in the future (common or solution-specific), but figuring out all the fields is out of scope of this ticket 🙂

Implementation

We discussed the options suggested by Patrick:

Add the fields to the rule saved object itself as normal fields of execution_status object.
- Preferable option. The fields are generic enough. Seems to be the simplest option to implement.
Add extra as type:object enabled:false and use runtime fields.
- Extensible, but requires runtime fields to make it searchable/filterable/sortable (research is needed to confirm that). Will have perf issues (testing is needed). Runtime fields are not supported by Saved Objects.
Add extra as type:flattened
- Similar to the previous one. Flattened field type doesn't support numerical fields properly. Runtime fields would be needed.
Add extra as type:nested where the inner docs are {key: keyword, nval: float, kval: keyword kval.text: text}
- Extensible. Nested field type. Complexity would be on the solution side. Hard to use arbitrary types for field values.
Let each plugin extend the rule SO, the same way SO's add their properties to attributes, so add whatever fields they want
- Extensible. The most flexible way. Complexity would be encapsulated in the Framework. We need essentially the same thing (sortability, searchability, etc) for the rule params object.

pmuellr · 2021-09-27T17:48:47Z

One of the options discussed here was the thought of using runtime fields to allow rule-specific fields to be extracted from the rule saved object. Turns out Kibana doesn't support passing runtime field mappings into SO search requests, so I've opened an issue to request that - add support for runtime fields in Saved Object find() requests #113152.

pmuellr · 2021-09-28T20:49:24Z

Mike had proposed an option 6: wait for elasticsearch "join" capability in search, at which point we could do a query to join rules against some other SO containing the rule-specific data. May be a long wait :-). But honestly, that would be the cleanest solution. And we'd need to wait for SO's to support "join" as well ...

Options 2 and 3 require SO support runtime fields in the SO client find() API, which isn't currently supported - see issue #113152

For Option 4, I'll have to double-check on nested fields, but I think you can use them in KQL, but can't access them as fields for use in Lens. Which might be fine, because most users aren't going to need to build Lens graphs over .kibana (well, I have :-)). I believe there is also a performance impact, as these are implemented as separate documents associated with the primary document (or something like that).

For option 5, we'd likely need Kibana core to help out with this. Rather than invent some new thing on top of SO's specific to alerting rules, it would probably be better to extend SO's somehow to provide this capability.

banderror · 2023-01-27T17:07:34Z

Has been implemented as part of multiple other tickets: #135127, #147759, #130966

banderror added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Sep 14, 2021

banderror added Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team labels Sep 14, 2021

banderror mentioned this issue Sep 14, 2021

[RAC][Security Solution][Detections] Rule Execution Log - technical implementation #101013

Closed

11 tasks

ymao1 mentioned this issue Sep 15, 2021

Leveraging event log data to provide better insights of the alerting framework #111452

Closed

gmmorris assigned pmuellr Sep 22, 2021

spong mentioned this issue Sep 24, 2021

[RAC][Alerting][Security Solution] Adds Rule Execution UUID #113058

Merged

2 tasks

pmuellr mentioned this issue Sep 27, 2021

[alerting] adds additional properties to the rule execution status #113203

Closed

9 tasks

rudolf mentioned this issue Sep 28, 2021

[saved objects] add support for runtime fields in Saved Object find() requests #113152

Open

pmuellr added the discuss label Oct 14, 2021

pmuellr removed their assignment Oct 14, 2021

spong mentioned this issue Oct 21, 2021

[Security Solution][Detections] Reading last 5 failures from Event Log v1 - raw implementation #115574

Merged

8 tasks

banderror mentioned this issue Nov 11, 2021

[Security Solution] Rule Execution Log - technical debt #118324

Open

19 tasks

XavierM added this to AppEx: ResponseOps - Rules & Alerts Management Jan 6, 2022

banderror mentioned this issue Jan 20, 2022

[Security Solution][Detections] Migrate from ruleStatusSavedObjectType to Alerting Event Log for Rule Monitoring #91265

Closed

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

banderror mentioned this issue Feb 15, 2022

[RAM] [META] Make rule params searchable #123982

Closed

1 task

banderror mentioned this issue Mar 24, 2022

[DOCS] Execution summary is missing in the find rules API response elastic/security-docs#1759

Closed

banderror mentioned this issue Apr 26, 2022

[Security Solution] Pass rule execution statuses and metrics to Alerting Framework #130966

Closed

4 tasks

banderror added 8.3 candidate and removed discuss labels Apr 27, 2022

banderror added 8.4 candidate and removed 8.3 candidate labels Jun 10, 2022

banderror mentioned this issue Jun 27, 2022

[ResponseOps] API for reading/writing rule execution summary #135127

Closed

banderror assigned xcrzx Jun 27, 2022

banderror mentioned this issue Aug 16, 2022

[Security Solution] Get rid of "Advanced sorting" switch for the Rules table #138907

Closed

3 tasks

banderror added 8.5 candidate and removed 8.4 candidate labels Aug 16, 2022

banderror added 8.6 candidate and removed 8.5 candidate labels Oct 5, 2022

banderror added 8.7 candidate and removed 8.6 candidate labels Nov 24, 2022

banderror assigned jpdjere and unassigned xcrzx Dec 19, 2022

banderror assigned maximpn Dec 29, 2022

banderror closed this as completed Jan 27, 2023

github-project-automation bot moved this to Done in AppEx: ResponseOps - Rules & Alerts Management Jan 27, 2023

banderror added the v8.7.0 label Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting] Storing custom searchable rule execution data inside the rule #112193

[Alerting] Storing custom searchable rule execution data inside the rule #112193

banderror commented Sep 14, 2021 •

edited

Loading

elasticmachine commented Sep 14, 2021

elasticmachine commented Sep 14, 2021

banderror commented Sep 15, 2021

pmuellr commented Sep 15, 2021 •

edited

Loading

mikecote commented Sep 15, 2021

pmuellr commented Sep 16, 2021

pmuellr commented Sep 16, 2021

pmuellr commented Sep 16, 2021

mikecote commented Sep 16, 2021

pmuellr commented Sep 21, 2021

pmuellr commented Sep 21, 2021

pmuellr commented Sep 21, 2021

jasonrhodes commented Sep 21, 2021

spong commented Sep 21, 2021

banderror commented Sep 22, 2021

pmuellr commented Sep 27, 2021

pmuellr commented Sep 28, 2021

banderror commented Jan 27, 2023

[Alerting] Storing custom searchable rule execution data inside the rule #112193

[Alerting] Storing custom searchable rule execution data inside the rule #112193

Comments

banderror commented Sep 14, 2021 • edited Loading

Why

What

How

elasticmachine commented Sep 14, 2021

elasticmachine commented Sep 14, 2021

banderror commented Sep 15, 2021

pmuellr commented Sep 15, 2021 • edited Loading

mikecote commented Sep 15, 2021

pmuellr commented Sep 16, 2021

pmuellr commented Sep 16, 2021

pmuellr commented Sep 16, 2021

mikecote commented Sep 16, 2021

pmuellr commented Sep 21, 2021

pmuellr commented Sep 21, 2021

pmuellr commented Sep 21, 2021

jasonrhodes commented Sep 21, 2021

spong commented Sep 21, 2021

banderror commented Sep 22, 2021

Fields

Implementation

pmuellr commented Sep 27, 2021

pmuellr commented Sep 28, 2021

banderror commented Jan 27, 2023

banderror commented Sep 14, 2021 •

edited

Loading

pmuellr commented Sep 15, 2021 •

edited

Loading