Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution][Detection Engine] Reduce CPU usage when creating alerts from source documents #192287

Merged
merged 17 commits into from
Sep 18, 2024

Conversation

marshallmain
Copy link
Contributor

@marshallmain marshallmain commented Sep 6, 2024

Summary

Parent issue: https://github.com/elastic/security-team/issues/10106

tl;dr: This PR reduces the CPU time used by Kibana to transform source documents into alerts by ~2x for relatively small docs and up to 20x in some pathological cases.

Test Case Time on main Time on PR branch Speedup
8k fields 500ms 126ms 4x
1k fields 400ms 26ms 16x
100 fields 30ms 12ms 2.5x
10k fields nested 16s 800ms 20x
1k fields nested 200ms 80ms 2.5x

Detection rules generally aim to keep computationally expensive tasks in Elasticsearch since Kibana's NodeJS architecture is not well suited to CPU bound logic. However, we are (for most rule types) currently unable to avoid loading full documents from source indices in ES into Kibana, transforming them into alerts, and sending those transformed documents back to Elasticsearch for indexing into the alerts index. When the source documents contain many fields, this flow is expensive and contributes to event loop blockages.

This PR refactors and optimizes the alert creation process to simplify the code and reduce the CPU burden. Optimizations include:

  • Mutating the source document object into an alert instead of copying it. The elasticsearch-js client already creates an object in memory for the entire source document in the query response, so we can simply use that instead of making another copy.
  • Use Object.keys to iterate over all properties of the object, fetching each property individually, instead of Object.entries.
  • Traverse the document only once - previously we iterated over at least the top level keys 4 times, in filterSource, stripNonEcsFields, additionalAlertFields, and buildFieldsKeyAsArrayMap. Now traverseDoc handles 3 of those cases together in one pass, and we avoid needing buildFieldsKeyAsArrayMap by fetching values on demand with any mix of dot and nested notation.
    • buildFieldsKeyAsArrayMap turns a key in dot notation into an array the represents the nested notation path to the object, accounting for any intermediate keys that might have dots in them. However, building this map for an entire large object is expensive, and I found that it's significantly faster for us to just check for potential intermediate dot-notation keys when fetching instead of precomputing the map.
  • Happy path optimization: for mergeMissingFieldsWithSource (the default merge option) typically the vast majority of fields will already be in _source and we can skip the rest of the logic for merging that field from fields into _source. Now we'll check if a field key is already present in _source as the very first operation when merging fields into _source and skip to the next field if it is.
    • filterFieldEntry - isIgnored splits ignored fields into a map of ordinary field names and an array of regular expressions so we can more efficiently check if a field key is on the ignore list. The ignore list includes ~80 field names from the legacy signal mappings, plus any user-added ignore fields. This function is also called much less often now that we only call filterFieldEntry after checking to ensure that the field key is not already in _source.

Supporting changes (necessary, but not optimizations on their own):

  • robustGet, robustSet, robustPathIsValid - "robust" functions in the sense that they can handle any mix of dot and nested notation. Whereas lodash get would not find my-value in { a: { 'b.c': 'my-value' }} given the key a.b.c, robustGet will find it by popping prefixes from the key and checking each combination until it find the value or verifies that there is no valid path matching the key.
  • Empty arrays are now treated as "no value" when creating alerts in order to make the logic more consistent. All empty arrays are silently (not reported in "removed fields") removed from alert documents. This resolves an inconsistency where an empty array in an incompatible field (e.g. agent.name.text: []) would be reported as a removed empty array, whereas an array with values in an incompatible field (e.g. agent.name.text: ['agent1', 'agent2']) would report two individual values removed from agent.name.text but not report the array as removed, even though the whole field (array included) is gone.
    • The new behavior would be:
      • agent.name.text: [] -> completely removed without reporting the array removed, instead of removed with reporting the array removed
      • agent.name.text: ['agent1', 'agent2'] -> same as before: completely removed, reporting agent1 and agent2 removed, not reporting the array removed
    • See tests in strip_non_ecs_fields.test.ts for examples
  • event.* fields are copied from source docs to alerts with the original notation instead of transforming to dot notation. This is more in line with how we preserve the notation for the rest of the fields from source. The exception is still event.kind, which we have always populated as event.kind: 'signal' for alerts and we use dot notation for that field.

Test Setup and Results

To index test documents, add the following code in security_solution/scripts/quickstart/scratchpad.ts. Modify the index name and numFields as needed for different test variations. Run the script with node x-pack/plugins/security_solution/scripts/quickstart/run.js --kibana http://localhost:5601/<YOUR KIBANA BASE PATH>

const index = 'test-data-8k-fields';
      const timestamp = new Date('2024-08-21T20:36:37.114Z');
      const response = await esClient.indices.create({
        index,
        mappings: addPropertiesToMapping(
          getEcsMapping(),
          generateLargeMappingProperties({ size: 10000 })
        ),
        settings: getSettings({ maxFields: 20000 }),
      });

      const docs = range(10).map((idx) => {
        const doc = buildLargeDocument({
          numFields: 8000,
          fieldSize: 5,
        });
        addTimestampToDoc({
          timestamp: new Date(timestamp.getTime() + idx),
          doc,
        });
        addFieldToDoc({
          fieldName: 'host.name',
          fieldValue: 'test host',
          doc,
        });
        return doc;
      });

      await esClient.bulk({
        index,
        operations: docs.flatMap((doc) => [{ index: {} }, doc]),
      });

To index nested test documents, replace the buildLargeDocument call in the code above with the function call below. levels controls the depth of nesting, so the total number of fields will be fieldsPerObject ^ levels (10,000 as shown below).

buildLargeNestedDocument({ fieldsPerObject: 10, levels: 4, fieldSize: 5 });
        addTimestampToDoc({
          timestamp: new Date(timestamp.getTime() + idx),
          doc,
        });

To measure rule performance, run Kibana with node --inspect scripts/kibana --dev instead of the usual yarn start. Go to chrome://inspect/#devices in your browser, open Dev tools and start a performance profile capture, then run a rule preview with the following code in scratchpad.ts instead of the document indexing code above:

const previewPromises = range(1).map(
        (idx) => () =>
          detectionsClient.rulePreview({
            body: {
              ...getBasicRuleMetadata(),
              type: 'query',
              timeframeEnd: '2024-08-21T20:37:37.114Z',
              invocationCount: 1,
              from: 'now-6m',
              interval: '5m',
              index: [index],
              query: '*',
            },
          })
      );

      const results = (await concurrentlyExec(previewPromises, 50)).map(
        (result) => result.data.logs
      );
      console.log(JSON.stringify(results));

To test multiple simultaneous rule executions, change range(1) to range(number of simultaneous rule executions), but test results below all use a single rule execution. I ran the test multiple times consecutively to verify that results were relatively consistent, times were always within about 10% of the values reported here.

8k fields

main - 500ms to build 10 alerts, 1300ms rule duration as reported by rule preview
Screenshot 2024-09-11 at 9 20 57 AM

optimize-alert-creation - 126ms to build 10 alerts, 1055ms rule duration as reported by rule preview
Screenshot 2024-09-11 at 2 33 24 PM

1k fields

main - 400ms to build 10 alerts, 850ms total rule duration as reported by rule preview
Screenshot 2024-09-11 at 12 20 45 PM
Yellow blocks at the bottom of the graph are GC activity - seems to be a very inefficient GC activity pattern, lots of separate collections

optimize-alert-creation - 26ms to build 10 alerts, 578ms total rule duration as reported by rule preview
Screenshot 2024-09-11 at 2 28 57 PM

100 fields

main - 30ms to build 10 alerts
Screenshot 2024-09-11 at 12 24 00 PM

optimize-alert-creation - 12ms to build 10 alerts
Screenshot 2024-09-11 at 2 27 26 PM

10k fields, 4 levels of nesting (10 fields per object on each level)

main - 16s to build 10 alerts (!!)
Screenshot 2024-09-11 at 12 24 00 PM

optimize-alert-creation - 800ms to build 10 alerts (20x speedup)
Screenshot 2024-09-11 at 1 01 12 PM

1k fields, 3 levels of nesting (10 fields per object on each level)

main - 200ms to build 10 alerts
Screenshot 2024-09-11 at 12 42 55 PM

optimize-alert-creation - 80ms to build 10 alerts
Screenshot 2024-09-11 at 2 24 15 PM

@marshallmain marshallmain marked this pull request as ready for review September 11, 2024 18:49
@marshallmain marshallmain requested review from a team as code owners September 11, 2024 18:49
@marshallmain marshallmain requested a review from rylnd September 11, 2024 18:49
@marshallmain marshallmain added Team:Detection Engine Security Solution Detection Engine Area v8.16.0 labels Sep 11, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detection-engine (Team:Detection Engine)

@marshallmain marshallmain added the Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. label Sep 11, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

Copy link
Contributor

@vitaliidm vitaliidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work, @marshallmain

Few small comments from me

if (isArray(document) && document.length > 0) {
document.slice().forEach((value) => {
traverseAndDeleteInObj(value, documentKey, parent, parentPath);
export const traverseDoc = <T extends SourceFieldRecord>(document: T) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be renamed to address the fact is does not simply travers object but mutate it as well?

const {
result: validatedSource,
removed: removedSourceFields,
fieldsToAdd,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to return fieldsToAdd and then set them few lines below

    fieldsToAdd.forEach(({ key, value }) => {
      validatedSource[key] = value;
    });

can't it be done within traverseDoc? So validatedSource would already contain that fields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but to insert the new fields in an arbitrary location traverseDoc would need to also keep a reference to the top level object as it descends through the levels recursively. I felt like that would make the recursive part harder to follow since right now the only "global context" the recursive function needs is the path - the mutations on each sub-object are independent from any parents.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be implemented within

export const traverseAndMutateDoc = <T extends SourceFieldRecord>(document: T) => {
  return internalTraverseAndMutateDoc({
    document,
    path: [],
    topLevel: true,
    removed: [],
    fieldsToAdd: [],
  });
};

something like this, encapsulating this logic within traverseAndMutateDoc and internalTraverseAndMutateDoc would be the same.

export const traverseAndMutateDoc = <T extends SourceFieldRecord>(document: T) => {
  const traversedDoc = internalTraverseAndMutateDoc({
    document,
    path: [],
    topLevel: true,
    removed: [],
    fieldsToAdd: [],
  });

    traversedDoc.fieldsToAdd.forEach(({ key, value }) => {
      traversedDoc. result[key] = value;
    });

 return traversedDoc;
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, yeah I like that better. Made the change 👍


if (robustGet({ key: EVENT_KIND, document: validatedSource }) != null) {
robustSet({ key: EVENT_KIND, document: validatedSource, valueToSet: undefined });
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add here comment to explain why we set event fields to undefined?

* is computationally expensive so we only want to traverse it once, therefore a few distinct cases are handled in this function:
* 1. Fields that we must explicitly remove, like `kibana` and `signal`, fields, are removed from the document.
* 2. Fields that are incompatible with ECS are removed.
* 3. All `event.*` fields are collected so we can copy them to `kibana.alert.original_event` after traversing the document.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

event fields are returned in fieldsToAdd? Would be great to highlight it in function description

'event.agent_id_status': 'verified',
'event.ingested': '2022-03-23T16:50:28.994Z',
'event.dataset': 'elastic_agent.filebeat',
event: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could it be considered as breaking change? if some of the workflows rely on accessing dot notation properties from object, they might not work well with nested objects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially yes, but I think of it more as a "possibly breaking bug fix". For all other fields copied from the source docs, we keep the dot or nested formatting from the source doc. event fields should have followed this pattern too tbh but didn't. I think we can defend this kind of change if needed by claiming that both dot and nested are valid notations in Elasticsearch and the choice is an implementation detail, not part of the contract with consumers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, we would need then to ensure this part is highlighted in release notes

@@ -6,15 +6,20 @@
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be great to rename this file too, so it would reflect new name

const {
result: validatedSource,
removed: removedSourceFields,
fieldsToAdd,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be implemented within

export const traverseAndMutateDoc = <T extends SourceFieldRecord>(document: T) => {
  return internalTraverseAndMutateDoc({
    document,
    path: [],
    topLevel: true,
    removed: [],
    fieldsToAdd: [],
  });
};

something like this, encapsulating this logic within traverseAndMutateDoc and internalTraverseAndMutateDoc would be the same.

export const traverseAndMutateDoc = <T extends SourceFieldRecord>(document: T) => {
  const traversedDoc = internalTraverseAndMutateDoc({
    document,
    path: [],
    topLevel: true,
    removed: [],
    fieldsToAdd: [],
  });

    traversedDoc.fieldsToAdd.forEach(({ key, value }) => {
      traversedDoc. result[key] = value;
    });

 return traversedDoc;
};

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Jest Tests #4 / GlobalDataTagsTable should add new tags

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@marshallmain marshallmain merged commit a0421ab into elastic:main Sep 18, 2024
40 checks passed
@marshallmain marshallmain deleted the optimize-alert-creation branch September 18, 2024 12:36
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Sep 18, 2024
…alerts from source documents (elastic#192287)

## Summary

Parent issue: elastic/security-team#10106

tl;dr: This PR reduces the CPU time used by Kibana to transform source
documents into alerts by ~2x for relatively small docs and up to 20x in
some pathological cases.

| Test Case | Time on `main` | Time on PR branch | Speedup |
| ---- | ---- | ---- | ---- |
| 8k fields | 500ms | 126ms | 4x |
| 1k fields | 400ms | 26ms | 16x |
| 100 fields | 30ms | 12ms | 2.5x |
| 10k fields nested | 16s | 800ms | 20x |
| 1k fields nested | 200ms | 80ms | 2.5x |

Detection rules generally aim to keep computationally expensive tasks in
Elasticsearch since Kibana's NodeJS architecture is not well suited to
CPU bound logic. However, we are (for most rule types) currently unable
to avoid loading full documents from source indices in ES into Kibana,
transforming them into alerts, and sending those transformed documents
back to Elasticsearch for indexing into the alerts index. When the
source documents contain many fields, this flow is expensive and
contributes to event loop blockages.

This PR refactors and optimizes the alert creation process to simplify
the code and reduce the CPU burden. Optimizations include:
- Mutating the source document object into an alert instead of copying
it. The elasticsearch-js client already creates an object in memory for
the entire source document in the query response, so we can simply use
that instead of making another copy.
- Use `Object.keys` to iterate over all properties of the object,
fetching each property individually, instead of `Object.entries`.
- Traverse the document only once - previously we iterated over at least
the top level keys 4 times, in `filterSource`, `stripNonEcsFields`,
`additionalAlertFields`, and `buildFieldsKeyAsArrayMap`. Now
`traverseDoc` handles 3 of those cases together in one pass, and we
avoid needing `buildFieldsKeyAsArrayMap` by fetching values on demand
with any mix of dot and nested notation.
- `buildFieldsKeyAsArrayMap` turns a key in dot notation into an array
the represents the nested notation path to the object, accounting for
any intermediate keys that might have dots in them. However, building
this map for an entire large object is expensive, and I found that it's
significantly faster for us to just check for potential intermediate
dot-notation keys when fetching instead of precomputing the map.
- Happy path optimization: for `mergeMissingFieldsWithSource` (the
default merge option) typically the vast majority of fields will already
be in `_source` and we can skip the rest of the logic for merging that
field from `fields` into `_source`. Now we'll check if a field key is
already present in `_source` as the very first operation when merging
fields into `_source` and skip to the next field if it is.
- `filterFieldEntry` - `isIgnored` splits ignored fields into a map of
ordinary field names and an array of regular expressions so we can more
efficiently check if a field key is on the ignore list. The ignore list
includes ~80 field names from the legacy signal mappings, plus any
user-added ignore fields. This function is also called much less often
now that we only call `filterFieldEntry` **after** checking to ensure
that the field key is not already in `_source`.

Supporting changes (necessary, but not optimizations on their own):
- `robustGet`, `robustSet`, `robustPathIsValid` - "robust" functions in
the sense that they can handle any mix of dot and nested notation.
Whereas lodash `get` would not find `my-value` in `{ a: { 'b.c':
'my-value' }}` given the key `a.b.c`, `robustGet` will find it by
popping prefixes from the key and checking each combination until it
find the value or verifies that there is no valid path matching the key.
- Empty arrays are now treated as "no value" when creating alerts in
order to make the logic more consistent. All empty arrays are silently
(not reported in "removed fields") removed from alert documents. This
resolves an inconsistency where an empty array in an incompatible field
(e.g. `agent.name.text: []`) would be reported as a removed empty array,
whereas an array with values in an incompatible field (e.g.
`agent.name.text: ['agent1', 'agent2']`) would report two individual
values removed from `agent.name.text` but not report the array as
removed, even though the whole field (array included) is gone.
  - The new behavior would be:
- `agent.name.text: []` -> completely removed without reporting the
array removed, instead of removed with reporting the array removed
- `agent.name.text: ['agent1', 'agent2']` -> same as before: completely
removed, reporting `agent1` and `agent2` removed, not reporting the
array removed
  - See tests in `strip_non_ecs_fields.test.ts` for examples
- `event.*` fields are copied from source docs to alerts with the
original notation instead of transforming to dot notation. This is more
in line with how we preserve the notation for the rest of the fields
from `source`. The exception is still `event.kind`, which we have always
populated as `event.kind: 'signal'` for alerts and we use dot notation
for that field.

## Test Setup and Results

To index test documents, add the following code in
`security_solution/scripts/quickstart/scratchpad.ts`. Modify the index
name and `numFields` as needed for different test variations. Run the
script with `node
x-pack/plugins/security_solution/scripts/quickstart/run.js --kibana
http://localhost:5601/<YOUR KIBANA BASE PATH>`
```
const index = 'test-data-8k-fields';
      const timestamp = new Date('2024-08-21T20:36:37.114Z');
      const response = await esClient.indices.create({
        index,
        mappings: addPropertiesToMapping(
          getEcsMapping(),
          generateLargeMappingProperties({ size: 10000 })
        ),
        settings: getSettings({ maxFields: 20000 }),
      });

      const docs = range(10).map((idx) => {
        const doc = buildLargeDocument({
          numFields: 8000,
          fieldSize: 5,
        });
        addTimestampToDoc({
          timestamp: new Date(timestamp.getTime() + idx),
          doc,
        });
        addFieldToDoc({
          fieldName: 'host.name',
          fieldValue: 'test host',
          doc,
        });
        return doc;
      });

      await esClient.bulk({
        index,
        operations: docs.flatMap((doc) => [{ index: {} }, doc]),
      });
```

To index nested test documents, replace the `buildLargeDocument` call in
the code above with the function call below. `levels` controls the depth
of nesting, so the total number of fields will be `fieldsPerObject ^
levels` (10,000 as shown below).
```
buildLargeNestedDocument({ fieldsPerObject: 10, levels: 4, fieldSize: 5 });
        addTimestampToDoc({
          timestamp: new Date(timestamp.getTime() + idx),
          doc,
        });
```

To measure rule performance, run Kibana with `node --inspect
scripts/kibana --dev` instead of the usual `yarn start`. Go to
`chrome://inspect/#devices` in your browser, open Dev tools and start a
performance profile capture, then run a rule preview with the following
code in `scratchpad.ts` instead of the document indexing code above:
```
const previewPromises = range(1).map(
        (idx) => () =>
          detectionsClient.rulePreview({
            body: {
              ...getBasicRuleMetadata(),
              type: 'query',
              timeframeEnd: '2024-08-21T20:37:37.114Z',
              invocationCount: 1,
              from: 'now-6m',
              interval: '5m',
              index: [index],
              query: '*',
            },
          })
      );

      const results = (await concurrentlyExec(previewPromises, 50)).map(
        (result) => result.data.logs
      );
      console.log(JSON.stringify(results));
```
To test multiple simultaneous rule executions, change `range(1)` to
`range(number of simultaneous rule executions)`, but test results below
all use a single rule execution. I ran the test multiple times
consecutively to verify that results were relatively consistent, times
were always within about 10% of the values reported here.

### 8k fields
`main` - 500ms to build 10 alerts, 1300ms rule duration as reported by
rule preview
<img width="1558" alt="Screenshot 2024-09-11 at 9 20 57 AM"
src="https://github.com/user-attachments/assets/2a72e0d3-74cc-41c4-b51d-80e4ca058978">

`optimize-alert-creation` - 126ms to build 10 alerts, 1055ms rule
duration as reported by rule preview
<img width="1550" alt="Screenshot 2024-09-11 at 2 33 24 PM"
src="https://github.com/user-attachments/assets/c1217a47-872b-4291-a56e-ab48aead21a6">

### 1k fields
`main` - 400ms to build 10 alerts, 850ms total rule duration as reported
by rule preview
<img width="1552" alt="Screenshot 2024-09-11 at 12 20 45 PM"
src="https://github.com/user-attachments/assets/470c1b05-672b-4e0c-9c77-2e3a82e9e6d6">
Yellow blocks at the bottom of the graph are GC activity - seems to be a
very inefficient GC activity pattern, lots of separate collections

`optimize-alert-creation` - 26ms to build 10 alerts, 578ms total rule
duration as reported by rule preview
<img width="1550" alt="Screenshot 2024-09-11 at 2 28 57 PM"
src="https://github.com/user-attachments/assets/76523362-b3dd-40cb-92e6-1dcaeabffef1">

### 100 fields
`main` - 30ms to build 10 alerts
<img width="1554" alt="Screenshot 2024-09-11 at 12 24 00 PM"
src="https://github.com/user-attachments/assets/dde0a8ce-11bd-44e9-b441-c0b2cdf2fe1b">

`optimize-alert-creation` - 12ms to build 10 alerts
<img width="1551" alt="Screenshot 2024-09-11 at 2 27 26 PM"
src="https://github.com/user-attachments/assets/abbb561a-9222-458a-932c-cd233ce4163d">

### 10k fields, 4 levels of nesting (10 fields per object on each level)

`main` - **16s to build 10 alerts** (!!)
<img width="1554" alt="Screenshot 2024-09-11 at 12 24 00 PM"
src="https://github.com/user-attachments/assets/eee1acbe-9d82-46ce-85dc-b106d33b599e">

`optimize-alert-creation` - 800ms to build 10 alerts (20x speedup)
<img width="1552" alt="Screenshot 2024-09-11 at 1 01 12 PM"
src="https://github.com/user-attachments/assets/f8a3e655-8628-4c37-bb62-9ad246c14393">

### 1k fields, 3 levels of nesting (10 fields per object on each level)
`main` - 200ms to build 10 alerts
<img width="1550" alt="Screenshot 2024-09-11 at 12 42 55 PM"
src="https://github.com/user-attachments/assets/2530b71d-39be-4575-83e8-0de3977dc71c">

`optimize-alert-creation` - 80ms to build 10 alerts
<img width="1553" alt="Screenshot 2024-09-11 at 2 24 15 PM"
src="https://github.com/user-attachments/assets/1c4ffd36-97d5-49ac-ba48-0db05287455a">

---------

Co-authored-by: Vitalii Dmyterko <[email protected]>
(cherry picked from commit a0421ab)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Sep 18, 2024
…ating alerts from source documents (#192287) (#193287)

# Backport

This will backport the following commits from `main` to `8.x`:
- [[Security Solution][Detection Engine] Reduce CPU usage when creating
alerts from source documents
(#192287)](#192287)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Marshall
Main","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-09-18T12:36:32Z","message":"[Security
Solution][Detection Engine] Reduce CPU usage when creating alerts from
source documents (#192287)\n\n## Summary\r\n\r\nParent issue:
https://github.com/elastic/security-team/issues/10106\r\n\r\ntl;dr: This
PR reduces the CPU time used by Kibana to transform source\r\ndocuments
into alerts by ~2x for relatively small docs and up to 20x in\r\nsome
pathological cases.\r\n\r\n| Test Case | Time on `main` | Time on PR
branch | Speedup |\r\n| ---- | ---- | ---- | ---- |\r\n| 8k fields |
500ms | 126ms | 4x |\r\n| 1k fields | 400ms | 26ms | 16x |\r\n| 100
fields | 30ms | 12ms | 2.5x |\r\n| 10k fields nested | 16s | 800ms | 20x
|\r\n| 1k fields nested | 200ms | 80ms | 2.5x |\r\n\r\nDetection rules
generally aim to keep computationally expensive tasks
in\r\nElasticsearch since Kibana's NodeJS architecture is not well
suited to\r\nCPU bound logic. However, we are (for most rule types)
currently unable\r\nto avoid loading full documents from source indices
in ES into Kibana,\r\ntransforming them into alerts, and sending those
transformed documents\r\nback to Elasticsearch for indexing into the
alerts index. When the\r\nsource documents contain many fields, this
flow is expensive and\r\ncontributes to event loop
blockages.\r\n\r\nThis PR refactors and optimizes the alert creation
process to simplify\r\nthe code and reduce the CPU burden. Optimizations
include:\r\n- Mutating the source document object into an alert instead
of copying\r\nit. The elasticsearch-js client already creates an object
in memory for\r\nthe entire source document in the query response, so we
can simply use\r\nthat instead of making another copy.\r\n- Use
`Object.keys` to iterate over all properties of the object,\r\nfetching
each property individually, instead of `Object.entries`.\r\n- Traverse
the document only once - previously we iterated over at least\r\nthe top
level keys 4 times, in `filterSource`,
`stripNonEcsFields`,\r\n`additionalAlertFields`, and
`buildFieldsKeyAsArrayMap`. Now\r\n`traverseDoc` handles 3 of those
cases together in one pass, and we\r\navoid needing
`buildFieldsKeyAsArrayMap` by fetching values on demand\r\nwith any mix
of dot and nested notation.\r\n- `buildFieldsKeyAsArrayMap` turns a key
in dot notation into an array\r\nthe represents the nested notation path
to the object, accounting for\r\nany intermediate keys that might have
dots in them. However, building\r\nthis map for an entire large object
is expensive, and I found that it's\r\nsignificantly faster for us to
just check for potential intermediate\r\ndot-notation keys when fetching
instead of precomputing the map.\r\n- Happy path optimization: for
`mergeMissingFieldsWithSource` (the\r\ndefault merge option) typically
the vast majority of fields will already\r\nbe in `_source` and we can
skip the rest of the logic for merging that\r\nfield from `fields` into
`_source`. Now we'll check if a field key is\r\nalready present in
`_source` as the very first operation when merging\r\nfields into
`_source` and skip to the next field if it is.\r\n- `filterFieldEntry` -
`isIgnored` splits ignored fields into a map of\r\nordinary field names
and an array of regular expressions so we can more\r\nefficiently check
if a field key is on the ignore list. The ignore list\r\nincludes ~80
field names from the legacy signal mappings, plus any\r\nuser-added
ignore fields. This function is also called much less often\r\nnow that
we only call `filterFieldEntry` **after** checking to ensure\r\nthat the
field key is not already in `_source`.\r\n\r\nSupporting changes
(necessary, but not optimizations on their own):\r\n- `robustGet`,
`robustSet`, `robustPathIsValid` - \"robust\" functions in\r\nthe sense
that they can handle any mix of dot and nested notation.\r\nWhereas
lodash `get` would not find `my-value` in `{ a: { 'b.c':\r\n'my-value'
}}` given the key `a.b.c`, `robustGet` will find it by\r\npopping
prefixes from the key and checking each combination until it\r\nfind the
value or verifies that there is no valid path matching the key.\r\n-
Empty arrays are now treated as \"no value\" when creating alerts
in\r\norder to make the logic more consistent. All empty arrays are
silently\r\n(not reported in \"removed fields\") removed from alert
documents. This\r\nresolves an inconsistency where an empty array in an
incompatible field\r\n(e.g. `agent.name.text: []`) would be reported as
a removed empty array,\r\nwhereas an array with values in an
incompatible field (e.g.\r\n`agent.name.text: ['agent1', 'agent2']`)
would report two individual\r\nvalues removed from `agent.name.text` but
not report the array as\r\nremoved, even though the whole field (array
included) is gone.\r\n - The new behavior would be: \r\n-
`agent.name.text: []` -> completely removed without reporting
the\r\narray removed, instead of removed with reporting the array
removed\r\n- `agent.name.text: ['agent1', 'agent2']` -> same as before:
completely\r\nremoved, reporting `agent1` and `agent2` removed, not
reporting the\r\narray removed\r\n - See tests in
`strip_non_ecs_fields.test.ts` for examples\r\n- `event.*` fields are
copied from source docs to alerts with the\r\noriginal notation instead
of transforming to dot notation. This is more\r\nin line with how we
preserve the notation for the rest of the fields\r\nfrom `source`. The
exception is still `event.kind`, which we have always\r\npopulated as
`event.kind: 'signal'` for alerts and we use dot notation\r\nfor that
field.\r\n\r\n## Test Setup and Results\r\n\r\nTo index test documents,
add the following code
in\r\n`security_solution/scripts/quickstart/scratchpad.ts`. Modify the
index\r\nname and `numFields` as needed for different test variations.
Run the\r\nscript with
`node\r\nx-pack/plugins/security_solution/scripts/quickstart/run.js
--kibana\r\nhttp://localhost:5601/<YOUR KIBANA BASE
PATH>`\r\n```\r\nconst index = 'test-data-8k-fields';\r\n const
timestamp = new Date('2024-08-21T20:36:37.114Z');\r\n const response =
await esClient.indices.create({\r\n index,\r\n mappings:
addPropertiesToMapping(\r\n getEcsMapping(),\r\n
generateLargeMappingProperties({ size: 10000 })\r\n ),\r\n settings:
getSettings({ maxFields: 20000 }),\r\n });\r\n\r\n const docs =
range(10).map((idx) => {\r\n const doc = buildLargeDocument({\r\n
numFields: 8000,\r\n fieldSize: 5,\r\n });\r\n addTimestampToDoc({\r\n
timestamp: new Date(timestamp.getTime() + idx),\r\n doc,\r\n });\r\n
addFieldToDoc({\r\n fieldName: 'host.name',\r\n fieldValue: 'test
host',\r\n doc,\r\n });\r\n return doc;\r\n });\r\n\r\n await
esClient.bulk({\r\n index,\r\n operations: docs.flatMap((doc) => [{
index: {} }, doc]),\r\n });\r\n```\r\n\r\nTo index nested test
documents, replace the `buildLargeDocument` call in\r\nthe code above
with the function call below. `levels` controls the depth\r\nof nesting,
so the total number of fields will be `fieldsPerObject ^\r\nlevels`
(10,000 as shown below).\r\n```\r\nbuildLargeNestedDocument({
fieldsPerObject: 10, levels: 4, fieldSize: 5 });\r\n
addTimestampToDoc({\r\n timestamp: new Date(timestamp.getTime() +
idx),\r\n doc,\r\n });\r\n```\r\n\r\nTo measure rule performance, run
Kibana with `node --inspect\r\nscripts/kibana --dev` instead of the
usual `yarn start`. Go to\r\n`chrome://inspect/#devices` in your
browser, open Dev tools and start a\r\nperformance profile capture, then
run a rule preview with the following\r\ncode in `scratchpad.ts` instead
of the document indexing code above:\r\n```\r\nconst previewPromises =
range(1).map(\r\n (idx) => () =>\r\n detectionsClient.rulePreview({\r\n
body: {\r\n ...getBasicRuleMetadata(),\r\n type: 'query',\r\n
timeframeEnd: '2024-08-21T20:37:37.114Z',\r\n invocationCount: 1,\r\n
from: 'now-6m',\r\n interval: '5m',\r\n index: [index],\r\n query:
'*',\r\n },\r\n })\r\n );\r\n\r\n const results = (await
concurrentlyExec(previewPromises, 50)).map(\r\n (result) =>
result.data.logs\r\n );\r\n
console.log(JSON.stringify(results));\r\n```\r\nTo test multiple
simultaneous rule executions, change `range(1)` to\r\n`range(number of
simultaneous rule executions)`, but test results below\r\nall use a
single rule execution. I ran the test multiple times\r\nconsecutively to
verify that results were relatively consistent, times\r\nwere always
within about 10% of the values reported here.\r\n\r\n### 8k
fields\r\n`main` - 500ms to build 10 alerts, 1300ms rule duration as
reported by\r\nrule preview\r\n<img width=\"1558\" alt=\"Screenshot
2024-09-11 at 9 20
57 AM\"\r\nsrc=\"https://github.com/user-attachments/assets/2a72e0d3-74cc-41c4-b51d-80e4ca058978\">\r\n\r\n`optimize-alert-creation`
- 126ms to build 10 alerts, 1055ms rule\r\nduration as reported by rule
preview\r\n<img width=\"1550\" alt=\"Screenshot 2024-09-11 at 2 33
24 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/c1217a47-872b-4291-a56e-ab48aead21a6\">\r\n\r\n\r\n###
1k fields\r\n`main` - 400ms to build 10 alerts, 850ms total rule
duration as reported\r\nby rule preview\r\n<img width=\"1552\"
alt=\"Screenshot 2024-09-11 at 12 20
45 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/470c1b05-672b-4e0c-9c77-2e3a82e9e6d6\">\r\nYellow
blocks at the bottom of the graph are GC activity - seems to be
a\r\nvery inefficient GC activity pattern, lots of separate
collections\r\n\r\n`optimize-alert-creation` - 26ms to build 10 alerts,
578ms total rule\r\nduration as reported by rule preview\r\n<img
width=\"1550\" alt=\"Screenshot 2024-09-11 at 2 28
57 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/76523362-b3dd-40cb-92e6-1dcaeabffef1\">\r\n\r\n###
100 fields\r\n`main` - 30ms to build 10 alerts\r\n<img width=\"1554\"
alt=\"Screenshot 2024-09-11 at 12 24
00 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/dde0a8ce-11bd-44e9-b441-c0b2cdf2fe1b\">\r\n\r\n`optimize-alert-creation`
- 12ms to build 10 alerts\r\n<img width=\"1551\" alt=\"Screenshot
2024-09-11 at 2 27
26 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/abbb561a-9222-458a-932c-cd233ce4163d\">\r\n\r\n###
10k fields, 4 levels of nesting (10 fields per object on each
level)\r\n\r\n`main` - **16s to build 10 alerts** (!!)\r\n<img
width=\"1554\" alt=\"Screenshot 2024-09-11 at 12 24
00 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/eee1acbe-9d82-46ce-85dc-b106d33b599e\">\r\n\r\n`optimize-alert-creation`
- 800ms to build 10 alerts (20x speedup)\r\n<img width=\"1552\"
alt=\"Screenshot 2024-09-11 at 1 01
12 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/f8a3e655-8628-4c37-bb62-9ad246c14393\">\r\n\r\n###
1k fields, 3 levels of nesting (10 fields per object on each
level)\r\n`main` - 200ms to build 10 alerts\r\n<img width=\"1550\"
alt=\"Screenshot 2024-09-11 at 12 42
55 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/2530b71d-39be-4575-83e8-0de3977dc71c\">\r\n\r\n`optimize-alert-creation`
- 80ms to build 10 alerts\r\n<img width=\"1553\" alt=\"Screenshot
2024-09-11 at 2 24
15 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/1c4ffd36-97d5-49ac-ba48-0db05287455a\">\r\n\r\n---------\r\n\r\nCo-authored-by:
Vitalii Dmyterko
<[email protected]>","sha":"a0421ab81fc41ae13d2269f81c885865e93e85dd","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","v9.0.0","Team:
SecuritySolution","Team:Detection Engine","v8.16.0"],"title":"[Security
Solution][Detection Engine] Reduce CPU usage when creating alerts from
source
documents","number":192287,"url":"https://github.com/elastic/kibana/pull/192287","mergeCommit":{"message":"[Security
Solution][Detection Engine] Reduce CPU usage when creating alerts from
source documents (#192287)\n\n## Summary\r\n\r\nParent issue:
https://github.com/elastic/security-team/issues/10106\r\n\r\ntl;dr: This
PR reduces the CPU time used by Kibana to transform source\r\ndocuments
into alerts by ~2x for relatively small docs and up to 20x in\r\nsome
pathological cases.\r\n\r\n| Test Case | Time on `main` | Time on PR
branch | Speedup |\r\n| ---- | ---- | ---- | ---- |\r\n| 8k fields |
500ms | 126ms | 4x |\r\n| 1k fields | 400ms | 26ms | 16x |\r\n| 100
fields | 30ms | 12ms | 2.5x |\r\n| 10k fields nested | 16s | 800ms | 20x
|\r\n| 1k fields nested | 200ms | 80ms | 2.5x |\r\n\r\nDetection rules
generally aim to keep computationally expensive tasks
in\r\nElasticsearch since Kibana's NodeJS architecture is not well
suited to\r\nCPU bound logic. However, we are (for most rule types)
currently unable\r\nto avoid loading full documents from source indices
in ES into Kibana,\r\ntransforming them into alerts, and sending those
transformed documents\r\nback to Elasticsearch for indexing into the
alerts index. When the\r\nsource documents contain many fields, this
flow is expensive and\r\ncontributes to event loop
blockages.\r\n\r\nThis PR refactors and optimizes the alert creation
process to simplify\r\nthe code and reduce the CPU burden. Optimizations
include:\r\n- Mutating the source document object into an alert instead
of copying\r\nit. The elasticsearch-js client already creates an object
in memory for\r\nthe entire source document in the query response, so we
can simply use\r\nthat instead of making another copy.\r\n- Use
`Object.keys` to iterate over all properties of the object,\r\nfetching
each property individually, instead of `Object.entries`.\r\n- Traverse
the document only once - previously we iterated over at least\r\nthe top
level keys 4 times, in `filterSource`,
`stripNonEcsFields`,\r\n`additionalAlertFields`, and
`buildFieldsKeyAsArrayMap`. Now\r\n`traverseDoc` handles 3 of those
cases together in one pass, and we\r\navoid needing
`buildFieldsKeyAsArrayMap` by fetching values on demand\r\nwith any mix
of dot and nested notation.\r\n- `buildFieldsKeyAsArrayMap` turns a key
in dot notation into an array\r\nthe represents the nested notation path
to the object, accounting for\r\nany intermediate keys that might have
dots in them. However, building\r\nthis map for an entire large object
is expensive, and I found that it's\r\nsignificantly faster for us to
just check for potential intermediate\r\ndot-notation keys when fetching
instead of precomputing the map.\r\n- Happy path optimization: for
`mergeMissingFieldsWithSource` (the\r\ndefault merge option) typically
the vast majority of fields will already\r\nbe in `_source` and we can
skip the rest of the logic for merging that\r\nfield from `fields` into
`_source`. Now we'll check if a field key is\r\nalready present in
`_source` as the very first operation when merging\r\nfields into
`_source` and skip to the next field if it is.\r\n- `filterFieldEntry` -
`isIgnored` splits ignored fields into a map of\r\nordinary field names
and an array of regular expressions so we can more\r\nefficiently check
if a field key is on the ignore list. The ignore list\r\nincludes ~80
field names from the legacy signal mappings, plus any\r\nuser-added
ignore fields. This function is also called much less often\r\nnow that
we only call `filterFieldEntry` **after** checking to ensure\r\nthat the
field key is not already in `_source`.\r\n\r\nSupporting changes
(necessary, but not optimizations on their own):\r\n- `robustGet`,
`robustSet`, `robustPathIsValid` - \"robust\" functions in\r\nthe sense
that they can handle any mix of dot and nested notation.\r\nWhereas
lodash `get` would not find `my-value` in `{ a: { 'b.c':\r\n'my-value'
}}` given the key `a.b.c`, `robustGet` will find it by\r\npopping
prefixes from the key and checking each combination until it\r\nfind the
value or verifies that there is no valid path matching the key.\r\n-
Empty arrays are now treated as \"no value\" when creating alerts
in\r\norder to make the logic more consistent. All empty arrays are
silently\r\n(not reported in \"removed fields\") removed from alert
documents. This\r\nresolves an inconsistency where an empty array in an
incompatible field\r\n(e.g. `agent.name.text: []`) would be reported as
a removed empty array,\r\nwhereas an array with values in an
incompatible field (e.g.\r\n`agent.name.text: ['agent1', 'agent2']`)
would report two individual\r\nvalues removed from `agent.name.text` but
not report the array as\r\nremoved, even though the whole field (array
included) is gone.\r\n - The new behavior would be: \r\n-
`agent.name.text: []` -> completely removed without reporting
the\r\narray removed, instead of removed with reporting the array
removed\r\n- `agent.name.text: ['agent1', 'agent2']` -> same as before:
completely\r\nremoved, reporting `agent1` and `agent2` removed, not
reporting the\r\narray removed\r\n - See tests in
`strip_non_ecs_fields.test.ts` for examples\r\n- `event.*` fields are
copied from source docs to alerts with the\r\noriginal notation instead
of transforming to dot notation. This is more\r\nin line with how we
preserve the notation for the rest of the fields\r\nfrom `source`. The
exception is still `event.kind`, which we have always\r\npopulated as
`event.kind: 'signal'` for alerts and we use dot notation\r\nfor that
field.\r\n\r\n## Test Setup and Results\r\n\r\nTo index test documents,
add the following code
in\r\n`security_solution/scripts/quickstart/scratchpad.ts`. Modify the
index\r\nname and `numFields` as needed for different test variations.
Run the\r\nscript with
`node\r\nx-pack/plugins/security_solution/scripts/quickstart/run.js
--kibana\r\nhttp://localhost:5601/<YOUR KIBANA BASE
PATH>`\r\n```\r\nconst index = 'test-data-8k-fields';\r\n const
timestamp = new Date('2024-08-21T20:36:37.114Z');\r\n const response =
await esClient.indices.create({\r\n index,\r\n mappings:
addPropertiesToMapping(\r\n getEcsMapping(),\r\n
generateLargeMappingProperties({ size: 10000 })\r\n ),\r\n settings:
getSettings({ maxFields: 20000 }),\r\n });\r\n\r\n const docs =
range(10).map((idx) => {\r\n const doc = buildLargeDocument({\r\n
numFields: 8000,\r\n fieldSize: 5,\r\n });\r\n addTimestampToDoc({\r\n
timestamp: new Date(timestamp.getTime() + idx),\r\n doc,\r\n });\r\n
addFieldToDoc({\r\n fieldName: 'host.name',\r\n fieldValue: 'test
host',\r\n doc,\r\n });\r\n return doc;\r\n });\r\n\r\n await
esClient.bulk({\r\n index,\r\n operations: docs.flatMap((doc) => [{
index: {} }, doc]),\r\n });\r\n```\r\n\r\nTo index nested test
documents, replace the `buildLargeDocument` call in\r\nthe code above
with the function call below. `levels` controls the depth\r\nof nesting,
so the total number of fields will be `fieldsPerObject ^\r\nlevels`
(10,000 as shown below).\r\n```\r\nbuildLargeNestedDocument({
fieldsPerObject: 10, levels: 4, fieldSize: 5 });\r\n
addTimestampToDoc({\r\n timestamp: new Date(timestamp.getTime() +
idx),\r\n doc,\r\n });\r\n```\r\n\r\nTo measure rule performance, run
Kibana with `node --inspect\r\nscripts/kibana --dev` instead of the
usual `yarn start`. Go to\r\n`chrome://inspect/#devices` in your
browser, open Dev tools and start a\r\nperformance profile capture, then
run a rule preview with the following\r\ncode in `scratchpad.ts` instead
of the document indexing code above:\r\n```\r\nconst previewPromises =
range(1).map(\r\n (idx) => () =>\r\n detectionsClient.rulePreview({\r\n
body: {\r\n ...getBasicRuleMetadata(),\r\n type: 'query',\r\n
timeframeEnd: '2024-08-21T20:37:37.114Z',\r\n invocationCount: 1,\r\n
from: 'now-6m',\r\n interval: '5m',\r\n index: [index],\r\n query:
'*',\r\n },\r\n })\r\n );\r\n\r\n const results = (await
concurrentlyExec(previewPromises, 50)).map(\r\n (result) =>
result.data.logs\r\n );\r\n
console.log(JSON.stringify(results));\r\n```\r\nTo test multiple
simultaneous rule executions, change `range(1)` to\r\n`range(number of
simultaneous rule executions)`, but test results below\r\nall use a
single rule execution. I ran the test multiple times\r\nconsecutively to
verify that results were relatively consistent, times\r\nwere always
within about 10% of the values reported here.\r\n\r\n### 8k
fields\r\n`main` - 500ms to build 10 alerts, 1300ms rule duration as
reported by\r\nrule preview\r\n<img width=\"1558\" alt=\"Screenshot
2024-09-11 at 9 20
57 AM\"\r\nsrc=\"https://github.com/user-attachments/assets/2a72e0d3-74cc-41c4-b51d-80e4ca058978\">\r\n\r\n`optimize-alert-creation`
- 126ms to build 10 alerts, 1055ms rule\r\nduration as reported by rule
preview\r\n<img width=\"1550\" alt=\"Screenshot 2024-09-11 at 2 33
24 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/c1217a47-872b-4291-a56e-ab48aead21a6\">\r\n\r\n\r\n###
1k fields\r\n`main` - 400ms to build 10 alerts, 850ms total rule
duration as reported\r\nby rule preview\r\n<img width=\"1552\"
alt=\"Screenshot 2024-09-11 at 12 20
45 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/470c1b05-672b-4e0c-9c77-2e3a82e9e6d6\">\r\nYellow
blocks at the bottom of the graph are GC activity - seems to be
a\r\nvery inefficient GC activity pattern, lots of separate
collections\r\n\r\n`optimize-alert-creation` - 26ms to build 10 alerts,
578ms total rule\r\nduration as reported by rule preview\r\n<img
width=\"1550\" alt=\"Screenshot 2024-09-11 at 2 28
57 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/76523362-b3dd-40cb-92e6-1dcaeabffef1\">\r\n\r\n###
100 fields\r\n`main` - 30ms to build 10 alerts\r\n<img width=\"1554\"
alt=\"Screenshot 2024-09-11 at 12 24
00 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/dde0a8ce-11bd-44e9-b441-c0b2cdf2fe1b\">\r\n\r\n`optimize-alert-creation`
- 12ms to build 10 alerts\r\n<img width=\"1551\" alt=\"Screenshot
2024-09-11 at 2 27
26 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/abbb561a-9222-458a-932c-cd233ce4163d\">\r\n\r\n###
10k fields, 4 levels of nesting (10 fields per object on each
level)\r\n\r\n`main` - **16s to build 10 alerts** (!!)\r\n<img
width=\"1554\" alt=\"Screenshot 2024-09-11 at 12 24
00 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/eee1acbe-9d82-46ce-85dc-b106d33b599e\">\r\n\r\n`optimize-alert-creation`
- 800ms to build 10 alerts (20x speedup)\r\n<img width=\"1552\"
alt=\"Screenshot 2024-09-11 at 1 01
12 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/f8a3e655-8628-4c37-bb62-9ad246c14393\">\r\n\r\n###
1k fields, 3 levels of nesting (10 fields per object on each
level)\r\n`main` - 200ms to build 10 alerts\r\n<img width=\"1550\"
alt=\"Screenshot 2024-09-11 at 12 42
55 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/2530b71d-39be-4575-83e8-0de3977dc71c\">\r\n\r\n`optimize-alert-creation`
- 80ms to build 10 alerts\r\n<img width=\"1553\" alt=\"Screenshot
2024-09-11 at 2 24
15 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/1c4ffd36-97d5-49ac-ba48-0db05287455a\">\r\n\r\n---------\r\n\r\nCo-authored-by:
Vitalii Dmyterko
<[email protected]>","sha":"a0421ab81fc41ae13d2269f81c885865e93e85dd"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/192287","number":192287,"mergeCommit":{"message":"[Security
Solution][Detection Engine] Reduce CPU usage when creating alerts from
source documents (#192287)\n\n## Summary\r\n\r\nParent issue:
https://github.com/elastic/security-team/issues/10106\r\n\r\ntl;dr: This
PR reduces the CPU time used by Kibana to transform source\r\ndocuments
into alerts by ~2x for relatively small docs and up to 20x in\r\nsome
pathological cases.\r\n\r\n| Test Case | Time on `main` | Time on PR
branch | Speedup |\r\n| ---- | ---- | ---- | ---- |\r\n| 8k fields |
500ms | 126ms | 4x |\r\n| 1k fields | 400ms | 26ms | 16x |\r\n| 100
fields | 30ms | 12ms | 2.5x |\r\n| 10k fields nested | 16s | 800ms | 20x
|\r\n| 1k fields nested | 200ms | 80ms | 2.5x |\r\n\r\nDetection rules
generally aim to keep computationally expensive tasks
in\r\nElasticsearch since Kibana's NodeJS architecture is not well
suited to\r\nCPU bound logic. However, we are (for most rule types)
currently unable\r\nto avoid loading full documents from source indices
in ES into Kibana,\r\ntransforming them into alerts, and sending those
transformed documents\r\nback to Elasticsearch for indexing into the
alerts index. When the\r\nsource documents contain many fields, this
flow is expensive and\r\ncontributes to event loop
blockages.\r\n\r\nThis PR refactors and optimizes the alert creation
process to simplify\r\nthe code and reduce the CPU burden. Optimizations
include:\r\n- Mutating the source document object into an alert instead
of copying\r\nit. The elasticsearch-js client already creates an object
in memory for\r\nthe entire source document in the query response, so we
can simply use\r\nthat instead of making another copy.\r\n- Use
`Object.keys` to iterate over all properties of the object,\r\nfetching
each property individually, instead of `Object.entries`.\r\n- Traverse
the document only once - previously we iterated over at least\r\nthe top
level keys 4 times, in `filterSource`,
`stripNonEcsFields`,\r\n`additionalAlertFields`, and
`buildFieldsKeyAsArrayMap`. Now\r\n`traverseDoc` handles 3 of those
cases together in one pass, and we\r\navoid needing
`buildFieldsKeyAsArrayMap` by fetching values on demand\r\nwith any mix
of dot and nested notation.\r\n- `buildFieldsKeyAsArrayMap` turns a key
in dot notation into an array\r\nthe represents the nested notation path
to the object, accounting for\r\nany intermediate keys that might have
dots in them. However, building\r\nthis map for an entire large object
is expensive, and I found that it's\r\nsignificantly faster for us to
just check for potential intermediate\r\ndot-notation keys when fetching
instead of precomputing the map.\r\n- Happy path optimization: for
`mergeMissingFieldsWithSource` (the\r\ndefault merge option) typically
the vast majority of fields will already\r\nbe in `_source` and we can
skip the rest of the logic for merging that\r\nfield from `fields` into
`_source`. Now we'll check if a field key is\r\nalready present in
`_source` as the very first operation when merging\r\nfields into
`_source` and skip to the next field if it is.\r\n- `filterFieldEntry` -
`isIgnored` splits ignored fields into a map of\r\nordinary field names
and an array of regular expressions so we can more\r\nefficiently check
if a field key is on the ignore list. The ignore list\r\nincludes ~80
field names from the legacy signal mappings, plus any\r\nuser-added
ignore fields. This function is also called much less often\r\nnow that
we only call `filterFieldEntry` **after** checking to ensure\r\nthat the
field key is not already in `_source`.\r\n\r\nSupporting changes
(necessary, but not optimizations on their own):\r\n- `robustGet`,
`robustSet`, `robustPathIsValid` - \"robust\" functions in\r\nthe sense
that they can handle any mix of dot and nested notation.\r\nWhereas
lodash `get` would not find `my-value` in `{ a: { 'b.c':\r\n'my-value'
}}` given the key `a.b.c`, `robustGet` will find it by\r\npopping
prefixes from the key and checking each combination until it\r\nfind the
value or verifies that there is no valid path matching the key.\r\n-
Empty arrays are now treated as \"no value\" when creating alerts
in\r\norder to make the logic more consistent. All empty arrays are
silently\r\n(not reported in \"removed fields\") removed from alert
documents. This\r\nresolves an inconsistency where an empty array in an
incompatible field\r\n(e.g. `agent.name.text: []`) would be reported as
a removed empty array,\r\nwhereas an array with values in an
incompatible field (e.g.\r\n`agent.name.text: ['agent1', 'agent2']`)
would report two individual\r\nvalues removed from `agent.name.text` but
not report the array as\r\nremoved, even though the whole field (array
included) is gone.\r\n - The new behavior would be: \r\n-
`agent.name.text: []` -> completely removed without reporting
the\r\narray removed, instead of removed with reporting the array
removed\r\n- `agent.name.text: ['agent1', 'agent2']` -> same as before:
completely\r\nremoved, reporting `agent1` and `agent2` removed, not
reporting the\r\narray removed\r\n - See tests in
`strip_non_ecs_fields.test.ts` for examples\r\n- `event.*` fields are
copied from source docs to alerts with the\r\noriginal notation instead
of transforming to dot notation. This is more\r\nin line with how we
preserve the notation for the rest of the fields\r\nfrom `source`. The
exception is still `event.kind`, which we have always\r\npopulated as
`event.kind: 'signal'` for alerts and we use dot notation\r\nfor that
field.\r\n\r\n## Test Setup and Results\r\n\r\nTo index test documents,
add the following code
in\r\n`security_solution/scripts/quickstart/scratchpad.ts`. Modify the
index\r\nname and `numFields` as needed for different test variations.
Run the\r\nscript with
`node\r\nx-pack/plugins/security_solution/scripts/quickstart/run.js
--kibana\r\nhttp://localhost:5601/<YOUR KIBANA BASE
PATH>`\r\n```\r\nconst index = 'test-data-8k-fields';\r\n const
timestamp = new Date('2024-08-21T20:36:37.114Z');\r\n const response =
await esClient.indices.create({\r\n index,\r\n mappings:
addPropertiesToMapping(\r\n getEcsMapping(),\r\n
generateLargeMappingProperties({ size: 10000 })\r\n ),\r\n settings:
getSettings({ maxFields: 20000 }),\r\n });\r\n\r\n const docs =
range(10).map((idx) => {\r\n const doc = buildLargeDocument({\r\n
numFields: 8000,\r\n fieldSize: 5,\r\n });\r\n addTimestampToDoc({\r\n
timestamp: new Date(timestamp.getTime() + idx),\r\n doc,\r\n });\r\n
addFieldToDoc({\r\n fieldName: 'host.name',\r\n fieldValue: 'test
host',\r\n doc,\r\n });\r\n return doc;\r\n });\r\n\r\n await
esClient.bulk({\r\n index,\r\n operations: docs.flatMap((doc) => [{
index: {} }, doc]),\r\n });\r\n```\r\n\r\nTo index nested test
documents, replace the `buildLargeDocument` call in\r\nthe code above
with the function call below. `levels` controls the depth\r\nof nesting,
so the total number of fields will be `fieldsPerObject ^\r\nlevels`
(10,000 as shown below).\r\n```\r\nbuildLargeNestedDocument({
fieldsPerObject: 10, levels: 4, fieldSize: 5 });\r\n
addTimestampToDoc({\r\n timestamp: new Date(timestamp.getTime() +
idx),\r\n doc,\r\n });\r\n```\r\n\r\nTo measure rule performance, run
Kibana with `node --inspect\r\nscripts/kibana --dev` instead of the
usual `yarn start`. Go to\r\n`chrome://inspect/#devices` in your
browser, open Dev tools and start a\r\nperformance profile capture, then
run a rule preview with the following\r\ncode in `scratchpad.ts` instead
of the document indexing code above:\r\n```\r\nconst previewPromises =
range(1).map(\r\n (idx) => () =>\r\n detectionsClient.rulePreview({\r\n
body: {\r\n ...getBasicRuleMetadata(),\r\n type: 'query',\r\n
timeframeEnd: '2024-08-21T20:37:37.114Z',\r\n invocationCount: 1,\r\n
from: 'now-6m',\r\n interval: '5m',\r\n index: [index],\r\n query:
'*',\r\n },\r\n })\r\n );\r\n\r\n const results = (await
concurrentlyExec(previewPromises, 50)).map(\r\n (result) =>
result.data.logs\r\n );\r\n
console.log(JSON.stringify(results));\r\n```\r\nTo test multiple
simultaneous rule executions, change `range(1)` to\r\n`range(number of
simultaneous rule executions)`, but test results below\r\nall use a
single rule execution. I ran the test multiple times\r\nconsecutively to
verify that results were relatively consistent, times\r\nwere always
within about 10% of the values reported here.\r\n\r\n### 8k
fields\r\n`main` - 500ms to build 10 alerts, 1300ms rule duration as
reported by\r\nrule preview\r\n<img width=\"1558\" alt=\"Screenshot
2024-09-11 at 9 20
57 AM\"\r\nsrc=\"https://github.com/user-attachments/assets/2a72e0d3-74cc-41c4-b51d-80e4ca058978\">\r\n\r\n`optimize-alert-creation`
- 126ms to build 10 alerts, 1055ms rule\r\nduration as reported by rule
preview\r\n<img width=\"1550\" alt=\"Screenshot 2024-09-11 at 2 33
24 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/c1217a47-872b-4291-a56e-ab48aead21a6\">\r\n\r\n\r\n###
1k fields\r\n`main` - 400ms to build 10 alerts, 850ms total rule
duration as reported\r\nby rule preview\r\n<img width=\"1552\"
alt=\"Screenshot 2024-09-11 at 12 20
45 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/470c1b05-672b-4e0c-9c77-2e3a82e9e6d6\">\r\nYellow
blocks at the bottom of the graph are GC activity - seems to be
a\r\nvery inefficient GC activity pattern, lots of separate
collections\r\n\r\n`optimize-alert-creation` - 26ms to build 10 alerts,
578ms total rule\r\nduration as reported by rule preview\r\n<img
width=\"1550\" alt=\"Screenshot 2024-09-11 at 2 28
57 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/76523362-b3dd-40cb-92e6-1dcaeabffef1\">\r\n\r\n###
100 fields\r\n`main` - 30ms to build 10 alerts\r\n<img width=\"1554\"
alt=\"Screenshot 2024-09-11 at 12 24
00 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/dde0a8ce-11bd-44e9-b441-c0b2cdf2fe1b\">\r\n\r\n`optimize-alert-creation`
- 12ms to build 10 alerts\r\n<img width=\"1551\" alt=\"Screenshot
2024-09-11 at 2 27
26 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/abbb561a-9222-458a-932c-cd233ce4163d\">\r\n\r\n###
10k fields, 4 levels of nesting (10 fields per object on each
level)\r\n\r\n`main` - **16s to build 10 alerts** (!!)\r\n<img
width=\"1554\" alt=\"Screenshot 2024-09-11 at 12 24
00 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/eee1acbe-9d82-46ce-85dc-b106d33b599e\">\r\n\r\n`optimize-alert-creation`
- 800ms to build 10 alerts (20x speedup)\r\n<img width=\"1552\"
alt=\"Screenshot 2024-09-11 at 1 01
12 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/f8a3e655-8628-4c37-bb62-9ad246c14393\">\r\n\r\n###
1k fields, 3 levels of nesting (10 fields per object on each
level)\r\n`main` - 200ms to build 10 alerts\r\n<img width=\"1550\"
alt=\"Screenshot 2024-09-11 at 12 42
55 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/2530b71d-39be-4575-83e8-0de3977dc71c\">\r\n\r\n`optimize-alert-creation`
- 80ms to build 10 alerts\r\n<img width=\"1553\" alt=\"Screenshot
2024-09-11 at 2 24
15 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/1c4ffd36-97d5-49ac-ba48-0db05287455a\">\r\n\r\n---------\r\n\r\nCo-authored-by:
Vitalii Dmyterko
<[email protected]>","sha":"a0421ab81fc41ae13d2269f81c885865e93e85dd"}},{"branch":"8.x","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Marshall Main <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_note:skip Skip the PR/issue when compiling release notes Team:Detection Engine Security Solution Detection Engine Area Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.16.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants