Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Storing custom searchable rule execution data inside the rule #112193

Closed
Tracked by #118324 ...
banderror opened this issue Sep 14, 2021 · 19 comments
Closed
Tracked by #118324 ...
Assignees
Labels
8.7 candidate Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.7.0

Comments

@banderror
Copy link
Contributor

banderror commented Sep 14, 2021

Parent ticket: #101013
Related to: #91265, #106347

Why

Security Solution needs to migrate its rule execution statuses and metrics from our sidecar SO to event_log (see the parent ticket). Given the rule execution events are in event_log, we will need to be able to execute relatively complex queries to event_log in order to fetch data for the Rule Monitoring table in Security.

Examples of such queries with aggregations:

We identified that queries with aggregations mentioned above are not performant.

What

We could simplify many of the mentioned queries and improve performance by duplicating information from the last execution and storing it inside the rule itself (or close to it). Additionally, this would allow us to postpone the implementation of a custom RBAC for event-log-via-alerting (see #106347) until after 7.16.

What fields we’d like to store:

Field name Field type Example values Comment
status keyword going to run, succeeded, partial failure, failed Custom, solution-specific rule execution status. The values mentioned are currently used in Security Solution
status_order short 0, 10, 20, 30 A field for sorting rules by their status (custom order instead of the alphabetical order)
metrics.indexing_duration_sum_ms long 1200 Total time that the rule spent for indexing alerts during the last execution
metrics.search_duration_sum_ms long 1200 Total time that the rule spent for searching source documents during the last execution
metrics.execution_gap_duration_s long 42000 Execution gap (see the definition below)

Execution gap metric that we use in Security Solution is defined the following way:

gapDuration = calculatedLookback - ruleDefinitionLookback

ruleDefinitionLookback = "additional lookback time" interval specified in the rule parameters

calculatedLookback = now - previousStartedAt - interval
now = Date.now()
previousStartedAt = when did task manager last start this rule?
interval = the rule runs every "interval" value, e.g. every 10 minutes; specified in the rule parameters

This looks very much similar to the existing rule state, with the difference that:

  • We need all the fields in it to be searchable, filterable, sortable. The current rule state, if I remember correctly, is a serialized JSON string stored in the Task Manager index. It's not possible to search, filter or sort over its fields.
  • We need to be able to add/remove fields in the future according to our needs in Security, for example add more solution-specific metrics (like [Security Solution] Include total indicator count when writing Indicator Match Rule execution logs #111903). It should be easy to add new fields or remove existing ones if they're not used by the solution anymore.

How

Some ideas discussed between @pmuellr, @spong, @xcrzx and @banderror:

  • Some of the fields could probably be made common for all the rules on the Framework level. E.g. the status field could become a "solution status" or "custom status". Maybe we could lift some of the metrics up to the Framework level as well.
  • Solution-specific fields like Indicator Match rule type metrics (like [Security Solution] Include total indicator count when writing Indicator Match Rule execution logs #111903) do not really belong to the Framework and should probably be stored in a separate object within the rule instance.
  • This object would need to support searching, filtering and sorting over all fields of different types. And it should be easily extensible. So ideally we’d like to define mappings for this object from the Security Solution side.
@banderror banderror added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Sep 14, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@banderror banderror added Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team labels Sep 14, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@banderror
Copy link
Contributor Author

@elastic/kibana-alerting-services we need some 👀 from your side on this one and we'd like to start discussing options for how to implement this. In Security we have capacity for working on the implementation and ideally we need it in 7.16 and as soon as possible, so we could finalize the whole #101013 in 7.16 as well 😎

@pmuellr
Copy link
Member

pmuellr commented Sep 15, 2021

We had mentioned in talking about this issue that we track the "schedule delay" of rule/connector execution (now - scheduledDate) for the event log, and that this would be nice to add to the execution status.

Also note that the POC for this issue adds the rule execution duration to the rule execution status in the SO - #111805

Some notes on suggested fields:

  • status - this seems specific to the rule type or perhaps solution, wondering how confusing / error-prone that will be for searches; or is there an effort in RAC to standardize this?
  • status_order - same as status, perhaps worse, if you sort different rules with different types of order in one list
  • metrics.indexing_duration_sum_ms - almost a yes! but what rules do indexing :-); would eventually be applicable to the index action though, so ya - my understanding is this would be for indexing by the rule registry
  • metrics.search_duration_sum_ms - yes!
  • metrics.execution_gap_duration_s - this is similar to, but different than our "schedule delay", which we should be able to add easily. Otherwise it seems like it's very specific to security rules. At the same time, you can imagine this being applicable to many rules, as a general concept that alerting dealt with - but of course we don't today. Is this another topic we should standardize in RAC?

Options on extendability; below, some of the options would create a new "generic" field in execution_status to store the extended fields in, and I'll refer to that field as extra:

  1. add each new field to a new rule- or solution-specific object field in the saved object execution_status as normal fields; eg execution_status.index_threshold.someValueHere
  2. add extra as type:object enabled:false
  3. add extra as type:flattened
  4. add extra as type:nested where the inner docs are {key: keyword, nval: float, kval: keyword kval.text: text}
  5. let each plugin extend the rule SO, the same way SO's add their properties to attributes, so add whatever fields they want
  6. what else?

Option 1 provides the best story from a lot of aspects, but does bloat the mappings, whereas the other options don't, as much. We'd want to define how the migrations for these work; I'm thinking a migration would never add values for these, but if a field was removed in some release, we'd clearly have to delete it in a migration. Meaning rule solution authors having to update the migration in the alerting plugin, which is not great. And could be a bad experience if a user built things on it, like dashboard visualizations.

For option 2, to search/sort, runtime fields would need to be provided, that would presumably be accessing the source, so is likely the "slowest" story for queries.

For option 3, doesn't allow numeric access, just keyword, so would be problematic for numeric fields, which I assume will be pretty important. Presumably the numeric fields could be accessed via runtime fields, not sure if the source is required.

For option 4, the way we'd store values is obviously clumsy, but we're essentially modelling Map<string, number|string> here so I think that's what we'd have to do. Nested fields aren't available within most Kibana visualizations, so presumably runtime fields would be needed to access the data for that purpose. I think there may be some nested support in KQL though (for Discover usage, or other searching needs).

For option 5, I believe we've looked at doing this, perhaps as part of the "searchable rule params" issue we worked on a while back (will need to look for the issue) - but the TL;DR is we settled on using the flattened type for this (which doesn't seem appropriate here with numeric values).

@mikecote
Copy link
Contributor

For option 2, to search/sort, runtime fields would need to be provided, that would presumably be accessing the source, so is likely the "slowest" story for querie

Side note, I don't believe free text searching can be done on runtime fields, but I may be wrong.

@pmuellr
Copy link
Member

pmuellr commented Sep 16, 2021

I don't believe free text searching can be done on runtime fields, but I may be wrong.

I believe you are correct, but I'm guessing this isn't a concern. It will contain output from the execution, so perhaps some error messages would be text search candidates, but guessing not being able to text search through them is survivable.

@pmuellr
Copy link
Member

pmuellr commented Sep 16, 2021

One point that came when Mike and I chatted about this was whether you can use runtime fields in SO searches. Dunno!

@pmuellr
Copy link
Member

pmuellr commented Sep 16, 2021

In a chat with Core, it appears there isn't a way to send runtime field definitions with SO find() calls, but in theory we could add runtime field definitions in the SO mappings itself. Will need to do some experiments.

I hadn't considered that, but it makes sense to "burn" runtime fields into the mappings for this, as presumably they wouldn't change during a particular Kibana release.

@mikecote
Copy link
Contributor

I hadn't considered that, but it makes sense to "burn" runtime fields into the mappings for this, as presumably they wouldn't change during a particular Kibana release.

If they are part of the mapping, would the runtime fields get "calculated" anytime a search is happening by Kibana? Including different SO types?

@pmuellr
Copy link
Member

pmuellr commented Sep 21, 2021

If they are part of the mapping, would the runtime fields get "calculated" anytime a search is happening by Kibana? Including different SO types?

Good question we should ask ES folks.

Thinking about this some more, it feels like this could end up being another form of "migration" that we need to deal with over time, and one that's different from our current migrations, so ... will add some amount of additional complexity, just thinking about how to keep such fields "stable" over time. Probably easier than our existing migrations, since we don't migrate data, but we can also never change existing mappings for old indices.

Feeling like depending on runtime fields should probably have a small research spike associated with it, if we wanted to go that route.

@pmuellr
Copy link
Member

pmuellr commented Sep 21, 2021

Mentioned is a discussion: we should add the number of alerts generated from a rule execution to the execution status. And the event log as well? We can also provide the number of new and recovered alerts. And by "alerts", I mean "instanceIds", and it would be the number that are "active", not the number that are actually going to schedule actions to be fired.

I think the number of actions that are scheduled to be fired would be another interesting number to track.

@pmuellr
Copy link
Member

pmuellr commented Sep 21, 2021

Thought I'd point out the place where the execution status is written to the rule SO:

const client = this.context.internalSavedObjectsRepository;
const attributes = {
executionStatus: alertExecutionStatusToRaw(executionStatus),
};
try {
await partiallyUpdateAlert(client, alertId, attributes, {
ignore404: true,
namespace,
refresh: false,
});
} catch (err) {
this.logger.error(
`error updating alert execution status for ${this.alertType.id}:${alertId} ${err.message}`
);
}

The executionStatus code itself is here: https://github.com/elastic/kibana/blob/master/x-pack/plugins/alerting/server/lib/alert_execution_status.ts

@jasonrhodes
Copy link
Member

Hey all, I'll be honest with you, I'm not sure we have the headspace on our side to really fully understand what we're looking at here or how we can give useful feedback. I think the overall idea makes sense to me and if/when we come across the need to use this, I'm confident we'll be able to adapt things so that it works for us.

Do you have specific questions or areas that you think would be good for observability to address for this initial implementation?

@spong
Copy link
Member

spong commented Sep 21, 2021

Hey all, I'll be honest with you, I'm not sure we have the headspace on our side to really fully understand what we're looking at here or how we can give useful feedback. I think the overall idea makes sense to me and if/when we come across the need to use this, I'm confident we'll be able to adapt things so that it works for us.

Do you have specific questions or areas that you think would be good for observability to address for this initial implementation?

Understandable, and definitely agree about being able to adapt as needed, so I think we're all good there 👍. This effort is in support of the Rule Management/Monitoring redesign (https://github.com/elastic/stack-design-team/issues/68), and I think we're mostly just looking for sign-off from Observability on these fields in support of that effort since you all will be adopting the Rule Management here at some point. So more of a verification that the proposed UI/UX meets your needs and that we don't need to add any additional fields as part of this initial effort (can always add more later :).

@banderror
Copy link
Contributor Author

Let me try to summarize what we have at this point after the discussions.
@pmuellr @mikecote @spong @jasonrhodes please let me know if you have any concerns.

TL;DR:

  • Let's focus on the fields in the PR description and proceed with the Option 1 (see below). This way we'd have a chance to meet our goals in Security in 7.16. We in Security are ready to start implementing it.
  • After 7.16 will should have more information about the Rule Management / Monitoring UX in Security and Observability, and we will have time to add more metrics and come up with an extensible solution.

Fields

  • See the updated table in the PR description. All the fields there are required by Security and we need to focus on implementing them in the Alerting Framework in 7.16, if possible. We have at most 1-2 weeks for that, because on top of that we will need to build the execution logging in Security and release it in 7.16.
  • We got a sign-off from Observability on these fields (@jasonrhodes please correct me if I've misunderstood 🙂)
  • We requested a final confirmation on the need for showing Execution Gap metric in the UI from the Observability and Security PM and UX.
  • We will need to be able to add more execution-related fields in the future (common or solution-specific), but figuring out all the fields is out of scope of this ticket 🙂

Implementation

We discussed the options suggested by Patrick:

  1. Add the fields to the rule saved object itself as normal fields of execution_status object.
    • Preferable option. The fields are generic enough. Seems to be the simplest option to implement.
  2. Add extra as type:object enabled:false and use runtime fields.
    • Extensible, but requires runtime fields to make it searchable/filterable/sortable (research is needed to confirm that). Will have perf issues (testing is needed). Runtime fields are not supported by Saved Objects.
  3. Add extra as type:flattened
    • Similar to the previous one. Flattened field type doesn't support numerical fields properly. Runtime fields would be needed.
  4. Add extra as type:nested where the inner docs are {key: keyword, nval: float, kval: keyword kval.text: text}
    • Extensible. Nested field type. Complexity would be on the solution side. Hard to use arbitrary types for field values.
  5. Let each plugin extend the rule SO, the same way SO's add their properties to attributes, so add whatever fields they want
    • Extensible. The most flexible way. Complexity would be encapsulated in the Framework. We need essentially the same thing (sortability, searchability, etc) for the rule params object.

@pmuellr
Copy link
Member

pmuellr commented Sep 27, 2021

One of the options discussed here was the thought of using runtime fields to allow rule-specific fields to be extracted from the rule saved object. Turns out Kibana doesn't support passing runtime field mappings into SO search requests, so I've opened an issue to request that - add support for runtime fields in Saved Object find() requests #113152.

@pmuellr
Copy link
Member

pmuellr commented Sep 28, 2021

Mike had proposed an option 6: wait for elasticsearch "join" capability in search, at which point we could do a query to join rules against some other SO containing the rule-specific data. May be a long wait :-). But honestly, that would be the cleanest solution. And we'd need to wait for SO's to support "join" as well ...

Options 2 and 3 require SO support runtime fields in the SO client find() API, which isn't currently supported - see issue #113152

For Option 4, I'll have to double-check on nested fields, but I think you can use them in KQL, but can't access them as fields for use in Lens. Which might be fine, because most users aren't going to need to build Lens graphs over .kibana (well, I have :-)). I believe there is also a performance impact, as these are implemented as separate documents associated with the primary document (or something like that).

For option 5, we'd likely need Kibana core to help out with this. Rather than invent some new thing on top of SO's specific to alerting rules, it would probably be better to extend SO's somehow to provide this capability.

@pmuellr pmuellr removed their assignment Oct 14, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@banderror banderror assigned jpdjere and unassigned xcrzx Dec 19, 2022
@banderror
Copy link
Contributor Author

Has been implemented as part of multiple other tickets: #135127, #147759, #130966

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.7 candidate Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.7.0
Projects
No open projects
Development

No branches or pull requests

10 participants