Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Security Solution][Detection Engine] Reduce CPU usage when creating …
…alerts from source documents (#192287) ## Summary Parent issue: elastic/security-team#10106 tl;dr: This PR reduces the CPU time used by Kibana to transform source documents into alerts by ~2x for relatively small docs and up to 20x in some pathological cases. | Test Case | Time on `main` | Time on PR branch | Speedup | | ---- | ---- | ---- | ---- | | 8k fields | 500ms | 126ms | 4x | | 1k fields | 400ms | 26ms | 16x | | 100 fields | 30ms | 12ms | 2.5x | | 10k fields nested | 16s | 800ms | 20x | | 1k fields nested | 200ms | 80ms | 2.5x | Detection rules generally aim to keep computationally expensive tasks in Elasticsearch since Kibana's NodeJS architecture is not well suited to CPU bound logic. However, we are (for most rule types) currently unable to avoid loading full documents from source indices in ES into Kibana, transforming them into alerts, and sending those transformed documents back to Elasticsearch for indexing into the alerts index. When the source documents contain many fields, this flow is expensive and contributes to event loop blockages. This PR refactors and optimizes the alert creation process to simplify the code and reduce the CPU burden. Optimizations include: - Mutating the source document object into an alert instead of copying it. The elasticsearch-js client already creates an object in memory for the entire source document in the query response, so we can simply use that instead of making another copy. - Use `Object.keys` to iterate over all properties of the object, fetching each property individually, instead of `Object.entries`. - Traverse the document only once - previously we iterated over at least the top level keys 4 times, in `filterSource`, `stripNonEcsFields`, `additionalAlertFields`, and `buildFieldsKeyAsArrayMap`. Now `traverseDoc` handles 3 of those cases together in one pass, and we avoid needing `buildFieldsKeyAsArrayMap` by fetching values on demand with any mix of dot and nested notation. - `buildFieldsKeyAsArrayMap` turns a key in dot notation into an array the represents the nested notation path to the object, accounting for any intermediate keys that might have dots in them. However, building this map for an entire large object is expensive, and I found that it's significantly faster for us to just check for potential intermediate dot-notation keys when fetching instead of precomputing the map. - Happy path optimization: for `mergeMissingFieldsWithSource` (the default merge option) typically the vast majority of fields will already be in `_source` and we can skip the rest of the logic for merging that field from `fields` into `_source`. Now we'll check if a field key is already present in `_source` as the very first operation when merging fields into `_source` and skip to the next field if it is. - `filterFieldEntry` - `isIgnored` splits ignored fields into a map of ordinary field names and an array of regular expressions so we can more efficiently check if a field key is on the ignore list. The ignore list includes ~80 field names from the legacy signal mappings, plus any user-added ignore fields. This function is also called much less often now that we only call `filterFieldEntry` **after** checking to ensure that the field key is not already in `_source`. Supporting changes (necessary, but not optimizations on their own): - `robustGet`, `robustSet`, `robustPathIsValid` - "robust" functions in the sense that they can handle any mix of dot and nested notation. Whereas lodash `get` would not find `my-value` in `{ a: { 'b.c': 'my-value' }}` given the key `a.b.c`, `robustGet` will find it by popping prefixes from the key and checking each combination until it find the value or verifies that there is no valid path matching the key. - Empty arrays are now treated as "no value" when creating alerts in order to make the logic more consistent. All empty arrays are silently (not reported in "removed fields") removed from alert documents. This resolves an inconsistency where an empty array in an incompatible field (e.g. `agent.name.text: []`) would be reported as a removed empty array, whereas an array with values in an incompatible field (e.g. `agent.name.text: ['agent1', 'agent2']`) would report two individual values removed from `agent.name.text` but not report the array as removed, even though the whole field (array included) is gone. - The new behavior would be: - `agent.name.text: []` -> completely removed without reporting the array removed, instead of removed with reporting the array removed - `agent.name.text: ['agent1', 'agent2']` -> same as before: completely removed, reporting `agent1` and `agent2` removed, not reporting the array removed - See tests in `strip_non_ecs_fields.test.ts` for examples - `event.*` fields are copied from source docs to alerts with the original notation instead of transforming to dot notation. This is more in line with how we preserve the notation for the rest of the fields from `source`. The exception is still `event.kind`, which we have always populated as `event.kind: 'signal'` for alerts and we use dot notation for that field. ## Test Setup and Results To index test documents, add the following code in `security_solution/scripts/quickstart/scratchpad.ts`. Modify the index name and `numFields` as needed for different test variations. Run the script with `node x-pack/plugins/security_solution/scripts/quickstart/run.js --kibana http://localhost:5601/<YOUR KIBANA BASE PATH>` ``` const index = 'test-data-8k-fields'; const timestamp = new Date('2024-08-21T20:36:37.114Z'); const response = await esClient.indices.create({ index, mappings: addPropertiesToMapping( getEcsMapping(), generateLargeMappingProperties({ size: 10000 }) ), settings: getSettings({ maxFields: 20000 }), }); const docs = range(10).map((idx) => { const doc = buildLargeDocument({ numFields: 8000, fieldSize: 5, }); addTimestampToDoc({ timestamp: new Date(timestamp.getTime() + idx), doc, }); addFieldToDoc({ fieldName: 'host.name', fieldValue: 'test host', doc, }); return doc; }); await esClient.bulk({ index, operations: docs.flatMap((doc) => [{ index: {} }, doc]), }); ``` To index nested test documents, replace the `buildLargeDocument` call in the code above with the function call below. `levels` controls the depth of nesting, so the total number of fields will be `fieldsPerObject ^ levels` (10,000 as shown below). ``` buildLargeNestedDocument({ fieldsPerObject: 10, levels: 4, fieldSize: 5 }); addTimestampToDoc({ timestamp: new Date(timestamp.getTime() + idx), doc, }); ``` To measure rule performance, run Kibana with `node --inspect scripts/kibana --dev` instead of the usual `yarn start`. Go to `chrome://inspect/#devices` in your browser, open Dev tools and start a performance profile capture, then run a rule preview with the following code in `scratchpad.ts` instead of the document indexing code above: ``` const previewPromises = range(1).map( (idx) => () => detectionsClient.rulePreview({ body: { ...getBasicRuleMetadata(), type: 'query', timeframeEnd: '2024-08-21T20:37:37.114Z', invocationCount: 1, from: 'now-6m', interval: '5m', index: [index], query: '*', }, }) ); const results = (await concurrentlyExec(previewPromises, 50)).map( (result) => result.data.logs ); console.log(JSON.stringify(results)); ``` To test multiple simultaneous rule executions, change `range(1)` to `range(number of simultaneous rule executions)`, but test results below all use a single rule execution. I ran the test multiple times consecutively to verify that results were relatively consistent, times were always within about 10% of the values reported here. ### 8k fields `main` - 500ms to build 10 alerts, 1300ms rule duration as reported by rule preview <img width="1558" alt="Screenshot 2024-09-11 at 9 20 57 AM" src="https://github.com/user-attachments/assets/2a72e0d3-74cc-41c4-b51d-80e4ca058978"> `optimize-alert-creation` - 126ms to build 10 alerts, 1055ms rule duration as reported by rule preview <img width="1550" alt="Screenshot 2024-09-11 at 2 33 24 PM" src="https://github.com/user-attachments/assets/c1217a47-872b-4291-a56e-ab48aead21a6"> ### 1k fields `main` - 400ms to build 10 alerts, 850ms total rule duration as reported by rule preview <img width="1552" alt="Screenshot 2024-09-11 at 12 20 45 PM" src="https://github.com/user-attachments/assets/470c1b05-672b-4e0c-9c77-2e3a82e9e6d6"> Yellow blocks at the bottom of the graph are GC activity - seems to be a very inefficient GC activity pattern, lots of separate collections `optimize-alert-creation` - 26ms to build 10 alerts, 578ms total rule duration as reported by rule preview <img width="1550" alt="Screenshot 2024-09-11 at 2 28 57 PM" src="https://github.com/user-attachments/assets/76523362-b3dd-40cb-92e6-1dcaeabffef1"> ### 100 fields `main` - 30ms to build 10 alerts <img width="1554" alt="Screenshot 2024-09-11 at 12 24 00 PM" src="https://github.com/user-attachments/assets/dde0a8ce-11bd-44e9-b441-c0b2cdf2fe1b"> `optimize-alert-creation` - 12ms to build 10 alerts <img width="1551" alt="Screenshot 2024-09-11 at 2 27 26 PM" src="https://github.com/user-attachments/assets/abbb561a-9222-458a-932c-cd233ce4163d"> ### 10k fields, 4 levels of nesting (10 fields per object on each level) `main` - **16s to build 10 alerts** (!!) <img width="1554" alt="Screenshot 2024-09-11 at 12 24 00 PM" src="https://github.com/user-attachments/assets/eee1acbe-9d82-46ce-85dc-b106d33b599e"> `optimize-alert-creation` - 800ms to build 10 alerts (20x speedup) <img width="1552" alt="Screenshot 2024-09-11 at 1 01 12 PM" src="https://github.com/user-attachments/assets/f8a3e655-8628-4c37-bb62-9ad246c14393"> ### 1k fields, 3 levels of nesting (10 fields per object on each level) `main` - 200ms to build 10 alerts <img width="1550" alt="Screenshot 2024-09-11 at 12 42 55 PM" src="https://github.com/user-attachments/assets/2530b71d-39be-4575-83e8-0de3977dc71c"> `optimize-alert-creation` - 80ms to build 10 alerts <img width="1553" alt="Screenshot 2024-09-11 at 2 24 15 PM" src="https://github.com/user-attachments/assets/1c4ffd36-97d5-49ac-ba48-0db05287455a"> --------- Co-authored-by: Vitalii Dmyterko <[email protected]> (cherry picked from commit a0421ab)
- Loading branch information