Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.x] [Security Solution][Detection Engine] Reduce CPU usage when creating alerts from source documents (#192287) #193287

Merged
merged 1 commit into from
Sep 18, 2024

Conversation

kibanamachine
Copy link
Contributor

Backport

This will backport the following commits from main to 8.x:

Questions ?

Please refer to the Backport tool documentation

…alerts from source documents (elastic#192287)

## Summary

Parent issue: elastic/security-team#10106

tl;dr: This PR reduces the CPU time used by Kibana to transform source
documents into alerts by ~2x for relatively small docs and up to 20x in
some pathological cases.

| Test Case | Time on `main` | Time on PR branch | Speedup |
| ---- | ---- | ---- | ---- |
| 8k fields | 500ms | 126ms | 4x |
| 1k fields | 400ms | 26ms | 16x |
| 100 fields | 30ms | 12ms | 2.5x |
| 10k fields nested | 16s | 800ms | 20x |
| 1k fields nested | 200ms | 80ms | 2.5x |

Detection rules generally aim to keep computationally expensive tasks in
Elasticsearch since Kibana's NodeJS architecture is not well suited to
CPU bound logic. However, we are (for most rule types) currently unable
to avoid loading full documents from source indices in ES into Kibana,
transforming them into alerts, and sending those transformed documents
back to Elasticsearch for indexing into the alerts index. When the
source documents contain many fields, this flow is expensive and
contributes to event loop blockages.

This PR refactors and optimizes the alert creation process to simplify
the code and reduce the CPU burden. Optimizations include:
- Mutating the source document object into an alert instead of copying
it. The elasticsearch-js client already creates an object in memory for
the entire source document in the query response, so we can simply use
that instead of making another copy.
- Use `Object.keys` to iterate over all properties of the object,
fetching each property individually, instead of `Object.entries`.
- Traverse the document only once - previously we iterated over at least
the top level keys 4 times, in `filterSource`, `stripNonEcsFields`,
`additionalAlertFields`, and `buildFieldsKeyAsArrayMap`. Now
`traverseDoc` handles 3 of those cases together in one pass, and we
avoid needing `buildFieldsKeyAsArrayMap` by fetching values on demand
with any mix of dot and nested notation.
- `buildFieldsKeyAsArrayMap` turns a key in dot notation into an array
the represents the nested notation path to the object, accounting for
any intermediate keys that might have dots in them. However, building
this map for an entire large object is expensive, and I found that it's
significantly faster for us to just check for potential intermediate
dot-notation keys when fetching instead of precomputing the map.
- Happy path optimization: for `mergeMissingFieldsWithSource` (the
default merge option) typically the vast majority of fields will already
be in `_source` and we can skip the rest of the logic for merging that
field from `fields` into `_source`. Now we'll check if a field key is
already present in `_source` as the very first operation when merging
fields into `_source` and skip to the next field if it is.
- `filterFieldEntry` - `isIgnored` splits ignored fields into a map of
ordinary field names and an array of regular expressions so we can more
efficiently check if a field key is on the ignore list. The ignore list
includes ~80 field names from the legacy signal mappings, plus any
user-added ignore fields. This function is also called much less often
now that we only call `filterFieldEntry` **after** checking to ensure
that the field key is not already in `_source`.

Supporting changes (necessary, but not optimizations on their own):
- `robustGet`, `robustSet`, `robustPathIsValid` - "robust" functions in
the sense that they can handle any mix of dot and nested notation.
Whereas lodash `get` would not find `my-value` in `{ a: { 'b.c':
'my-value' }}` given the key `a.b.c`, `robustGet` will find it by
popping prefixes from the key and checking each combination until it
find the value or verifies that there is no valid path matching the key.
- Empty arrays are now treated as "no value" when creating alerts in
order to make the logic more consistent. All empty arrays are silently
(not reported in "removed fields") removed from alert documents. This
resolves an inconsistency where an empty array in an incompatible field
(e.g. `agent.name.text: []`) would be reported as a removed empty array,
whereas an array with values in an incompatible field (e.g.
`agent.name.text: ['agent1', 'agent2']`) would report two individual
values removed from `agent.name.text` but not report the array as
removed, even though the whole field (array included) is gone.
  - The new behavior would be:
- `agent.name.text: []` -> completely removed without reporting the
array removed, instead of removed with reporting the array removed
- `agent.name.text: ['agent1', 'agent2']` -> same as before: completely
removed, reporting `agent1` and `agent2` removed, not reporting the
array removed
  - See tests in `strip_non_ecs_fields.test.ts` for examples
- `event.*` fields are copied from source docs to alerts with the
original notation instead of transforming to dot notation. This is more
in line with how we preserve the notation for the rest of the fields
from `source`. The exception is still `event.kind`, which we have always
populated as `event.kind: 'signal'` for alerts and we use dot notation
for that field.

## Test Setup and Results

To index test documents, add the following code in
`security_solution/scripts/quickstart/scratchpad.ts`. Modify the index
name and `numFields` as needed for different test variations. Run the
script with `node
x-pack/plugins/security_solution/scripts/quickstart/run.js --kibana
http://localhost:5601/<YOUR KIBANA BASE PATH>`
```
const index = 'test-data-8k-fields';
      const timestamp = new Date('2024-08-21T20:36:37.114Z');
      const response = await esClient.indices.create({
        index,
        mappings: addPropertiesToMapping(
          getEcsMapping(),
          generateLargeMappingProperties({ size: 10000 })
        ),
        settings: getSettings({ maxFields: 20000 }),
      });

      const docs = range(10).map((idx) => {
        const doc = buildLargeDocument({
          numFields: 8000,
          fieldSize: 5,
        });
        addTimestampToDoc({
          timestamp: new Date(timestamp.getTime() + idx),
          doc,
        });
        addFieldToDoc({
          fieldName: 'host.name',
          fieldValue: 'test host',
          doc,
        });
        return doc;
      });

      await esClient.bulk({
        index,
        operations: docs.flatMap((doc) => [{ index: {} }, doc]),
      });
```

To index nested test documents, replace the `buildLargeDocument` call in
the code above with the function call below. `levels` controls the depth
of nesting, so the total number of fields will be `fieldsPerObject ^
levels` (10,000 as shown below).
```
buildLargeNestedDocument({ fieldsPerObject: 10, levels: 4, fieldSize: 5 });
        addTimestampToDoc({
          timestamp: new Date(timestamp.getTime() + idx),
          doc,
        });
```

To measure rule performance, run Kibana with `node --inspect
scripts/kibana --dev` instead of the usual `yarn start`. Go to
`chrome://inspect/#devices` in your browser, open Dev tools and start a
performance profile capture, then run a rule preview with the following
code in `scratchpad.ts` instead of the document indexing code above:
```
const previewPromises = range(1).map(
        (idx) => () =>
          detectionsClient.rulePreview({
            body: {
              ...getBasicRuleMetadata(),
              type: 'query',
              timeframeEnd: '2024-08-21T20:37:37.114Z',
              invocationCount: 1,
              from: 'now-6m',
              interval: '5m',
              index: [index],
              query: '*',
            },
          })
      );

      const results = (await concurrentlyExec(previewPromises, 50)).map(
        (result) => result.data.logs
      );
      console.log(JSON.stringify(results));
```
To test multiple simultaneous rule executions, change `range(1)` to
`range(number of simultaneous rule executions)`, but test results below
all use a single rule execution. I ran the test multiple times
consecutively to verify that results were relatively consistent, times
were always within about 10% of the values reported here.

### 8k fields
`main` - 500ms to build 10 alerts, 1300ms rule duration as reported by
rule preview
<img width="1558" alt="Screenshot 2024-09-11 at 9 20 57 AM"
src="https://github.com/user-attachments/assets/2a72e0d3-74cc-41c4-b51d-80e4ca058978">

`optimize-alert-creation` - 126ms to build 10 alerts, 1055ms rule
duration as reported by rule preview
<img width="1550" alt="Screenshot 2024-09-11 at 2 33 24 PM"
src="https://github.com/user-attachments/assets/c1217a47-872b-4291-a56e-ab48aead21a6">

### 1k fields
`main` - 400ms to build 10 alerts, 850ms total rule duration as reported
by rule preview
<img width="1552" alt="Screenshot 2024-09-11 at 12 20 45 PM"
src="https://github.com/user-attachments/assets/470c1b05-672b-4e0c-9c77-2e3a82e9e6d6">
Yellow blocks at the bottom of the graph are GC activity - seems to be a
very inefficient GC activity pattern, lots of separate collections

`optimize-alert-creation` - 26ms to build 10 alerts, 578ms total rule
duration as reported by rule preview
<img width="1550" alt="Screenshot 2024-09-11 at 2 28 57 PM"
src="https://github.com/user-attachments/assets/76523362-b3dd-40cb-92e6-1dcaeabffef1">

### 100 fields
`main` - 30ms to build 10 alerts
<img width="1554" alt="Screenshot 2024-09-11 at 12 24 00 PM"
src="https://github.com/user-attachments/assets/dde0a8ce-11bd-44e9-b441-c0b2cdf2fe1b">

`optimize-alert-creation` - 12ms to build 10 alerts
<img width="1551" alt="Screenshot 2024-09-11 at 2 27 26 PM"
src="https://github.com/user-attachments/assets/abbb561a-9222-458a-932c-cd233ce4163d">

### 10k fields, 4 levels of nesting (10 fields per object on each level)

`main` - **16s to build 10 alerts** (!!)
<img width="1554" alt="Screenshot 2024-09-11 at 12 24 00 PM"
src="https://github.com/user-attachments/assets/eee1acbe-9d82-46ce-85dc-b106d33b599e">

`optimize-alert-creation` - 800ms to build 10 alerts (20x speedup)
<img width="1552" alt="Screenshot 2024-09-11 at 1 01 12 PM"
src="https://github.com/user-attachments/assets/f8a3e655-8628-4c37-bb62-9ad246c14393">

### 1k fields, 3 levels of nesting (10 fields per object on each level)
`main` - 200ms to build 10 alerts
<img width="1550" alt="Screenshot 2024-09-11 at 12 42 55 PM"
src="https://github.com/user-attachments/assets/2530b71d-39be-4575-83e8-0de3977dc71c">

`optimize-alert-creation` - 80ms to build 10 alerts
<img width="1553" alt="Screenshot 2024-09-11 at 2 24 15 PM"
src="https://github.com/user-attachments/assets/1c4ffd36-97d5-49ac-ba48-0db05287455a">

---------

Co-authored-by: Vitalii Dmyterko <[email protected]>
(cherry picked from commit a0421ab)
@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

✅ unchanged

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @marshallmain

@kibanamachine kibanamachine merged commit e3ae68f into elastic:8.x Sep 18, 2024
36 of 37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants