-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Security Solution][Detection Engine] Reduce CPU usage when creating alerts from source documents #192287
[Security Solution][Detection Engine] Reduce CPU usage when creating alerts from source documents #192287
Conversation
… detected correctly, ensure original event.kind gets unset, variety of test fixes
Pinging @elastic/security-detection-engine (Team:Detection Engine) |
Pinging @elastic/security-solution (Team: SecuritySolution) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive work, @marshallmain
Few small comments from me
if (isArray(document) && document.length > 0) { | ||
document.slice().forEach((value) => { | ||
traverseAndDeleteInObj(value, documentKey, parent, parentPath); | ||
export const traverseDoc = <T extends SourceFieldRecord>(document: T) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be renamed to address the fact is does not simply travers object but mutate it as well?
const { | ||
result: validatedSource, | ||
removed: removedSourceFields, | ||
fieldsToAdd, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to return fieldsToAdd
and then set them few lines below
fieldsToAdd.forEach(({ key, value }) => {
validatedSource[key] = value;
});
can't it be done within traverseDoc
? So validatedSource
would already contain that fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but to insert the new fields in an arbitrary location traverseDoc
would need to also keep a reference to the top level object as it descends through the levels recursively. I felt like that would make the recursive part harder to follow since right now the only "global context" the recursive function needs is the path
- the mutations on each sub-object are independent from any parents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can be implemented within
export const traverseAndMutateDoc = <T extends SourceFieldRecord>(document: T) => {
return internalTraverseAndMutateDoc({
document,
path: [],
topLevel: true,
removed: [],
fieldsToAdd: [],
});
};
something like this, encapsulating this logic within traverseAndMutateDoc
and internalTraverseAndMutateDoc
would be the same.
export const traverseAndMutateDoc = <T extends SourceFieldRecord>(document: T) => {
const traversedDoc = internalTraverseAndMutateDoc({
document,
path: [],
topLevel: true,
removed: [],
fieldsToAdd: [],
});
traversedDoc.fieldsToAdd.forEach(({ key, value }) => {
traversedDoc. result[key] = value;
});
return traversedDoc;
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, yeah I like that better. Made the change 👍
|
||
if (robustGet({ key: EVENT_KIND, document: validatedSource }) != null) { | ||
robustSet({ key: EVENT_KIND, document: validatedSource, valueToSet: undefined }); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add here comment to explain why we set event fields to undefined
?
* is computationally expensive so we only want to traverse it once, therefore a few distinct cases are handled in this function: | ||
* 1. Fields that we must explicitly remove, like `kibana` and `signal`, fields, are removed from the document. | ||
* 2. Fields that are incompatible with ECS are removed. | ||
* 3. All `event.*` fields are collected so we can copy them to `kibana.alert.original_event` after traversing the document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
event fields are returned in fieldsToAdd
? Would be great to highlight it in function description
'event.agent_id_status': 'verified', | ||
'event.ingested': '2022-03-23T16:50:28.994Z', | ||
'event.dataset': 'elastic_agent.filebeat', | ||
event: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could it be considered as breaking change? if some of the workflows rely on accessing dot notation properties from object, they might not work well with nested objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially yes, but I think of it more as a "possibly breaking bug fix". For all other fields copied from the source docs, we keep the dot or nested formatting from the source doc. event
fields should have followed this pattern too tbh but didn't. I think we can defend this kind of change if needed by claiming that both dot and nested are valid notations in Elasticsearch and the choice is an implementation detail, not part of the contract with consumers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, we would need then to ensure this part is highlighted in release notes
@@ -6,15 +6,20 @@ | |||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be great to rename this file too, so it would reflect new name
const { | ||
result: validatedSource, | ||
removed: removedSourceFields, | ||
fieldsToAdd, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can be implemented within
export const traverseAndMutateDoc = <T extends SourceFieldRecord>(document: T) => {
return internalTraverseAndMutateDoc({
document,
path: [],
topLevel: true,
removed: [],
fieldsToAdd: [],
});
};
something like this, encapsulating this logic within traverseAndMutateDoc
and internalTraverseAndMutateDoc
would be the same.
export const traverseAndMutateDoc = <T extends SourceFieldRecord>(document: T) => {
const traversedDoc = internalTraverseAndMutateDoc({
document,
path: [],
topLevel: true,
removed: [],
fieldsToAdd: [],
});
traversedDoc.fieldsToAdd.forEach(({ key, value }) => {
traversedDoc. result[key] = value;
});
return traversedDoc;
};
💛 Build succeeded, but was flaky
Failed CI StepsTest FailuresMetrics [docs]
History
To update your PR or re-run it, just comment with: |
…alerts from source documents (elastic#192287) ## Summary Parent issue: elastic/security-team#10106 tl;dr: This PR reduces the CPU time used by Kibana to transform source documents into alerts by ~2x for relatively small docs and up to 20x in some pathological cases. | Test Case | Time on `main` | Time on PR branch | Speedup | | ---- | ---- | ---- | ---- | | 8k fields | 500ms | 126ms | 4x | | 1k fields | 400ms | 26ms | 16x | | 100 fields | 30ms | 12ms | 2.5x | | 10k fields nested | 16s | 800ms | 20x | | 1k fields nested | 200ms | 80ms | 2.5x | Detection rules generally aim to keep computationally expensive tasks in Elasticsearch since Kibana's NodeJS architecture is not well suited to CPU bound logic. However, we are (for most rule types) currently unable to avoid loading full documents from source indices in ES into Kibana, transforming them into alerts, and sending those transformed documents back to Elasticsearch for indexing into the alerts index. When the source documents contain many fields, this flow is expensive and contributes to event loop blockages. This PR refactors and optimizes the alert creation process to simplify the code and reduce the CPU burden. Optimizations include: - Mutating the source document object into an alert instead of copying it. The elasticsearch-js client already creates an object in memory for the entire source document in the query response, so we can simply use that instead of making another copy. - Use `Object.keys` to iterate over all properties of the object, fetching each property individually, instead of `Object.entries`. - Traverse the document only once - previously we iterated over at least the top level keys 4 times, in `filterSource`, `stripNonEcsFields`, `additionalAlertFields`, and `buildFieldsKeyAsArrayMap`. Now `traverseDoc` handles 3 of those cases together in one pass, and we avoid needing `buildFieldsKeyAsArrayMap` by fetching values on demand with any mix of dot and nested notation. - `buildFieldsKeyAsArrayMap` turns a key in dot notation into an array the represents the nested notation path to the object, accounting for any intermediate keys that might have dots in them. However, building this map for an entire large object is expensive, and I found that it's significantly faster for us to just check for potential intermediate dot-notation keys when fetching instead of precomputing the map. - Happy path optimization: for `mergeMissingFieldsWithSource` (the default merge option) typically the vast majority of fields will already be in `_source` and we can skip the rest of the logic for merging that field from `fields` into `_source`. Now we'll check if a field key is already present in `_source` as the very first operation when merging fields into `_source` and skip to the next field if it is. - `filterFieldEntry` - `isIgnored` splits ignored fields into a map of ordinary field names and an array of regular expressions so we can more efficiently check if a field key is on the ignore list. The ignore list includes ~80 field names from the legacy signal mappings, plus any user-added ignore fields. This function is also called much less often now that we only call `filterFieldEntry` **after** checking to ensure that the field key is not already in `_source`. Supporting changes (necessary, but not optimizations on their own): - `robustGet`, `robustSet`, `robustPathIsValid` - "robust" functions in the sense that they can handle any mix of dot and nested notation. Whereas lodash `get` would not find `my-value` in `{ a: { 'b.c': 'my-value' }}` given the key `a.b.c`, `robustGet` will find it by popping prefixes from the key and checking each combination until it find the value or verifies that there is no valid path matching the key. - Empty arrays are now treated as "no value" when creating alerts in order to make the logic more consistent. All empty arrays are silently (not reported in "removed fields") removed from alert documents. This resolves an inconsistency where an empty array in an incompatible field (e.g. `agent.name.text: []`) would be reported as a removed empty array, whereas an array with values in an incompatible field (e.g. `agent.name.text: ['agent1', 'agent2']`) would report two individual values removed from `agent.name.text` but not report the array as removed, even though the whole field (array included) is gone. - The new behavior would be: - `agent.name.text: []` -> completely removed without reporting the array removed, instead of removed with reporting the array removed - `agent.name.text: ['agent1', 'agent2']` -> same as before: completely removed, reporting `agent1` and `agent2` removed, not reporting the array removed - See tests in `strip_non_ecs_fields.test.ts` for examples - `event.*` fields are copied from source docs to alerts with the original notation instead of transforming to dot notation. This is more in line with how we preserve the notation for the rest of the fields from `source`. The exception is still `event.kind`, which we have always populated as `event.kind: 'signal'` for alerts and we use dot notation for that field. ## Test Setup and Results To index test documents, add the following code in `security_solution/scripts/quickstart/scratchpad.ts`. Modify the index name and `numFields` as needed for different test variations. Run the script with `node x-pack/plugins/security_solution/scripts/quickstart/run.js --kibana http://localhost:5601/<YOUR KIBANA BASE PATH>` ``` const index = 'test-data-8k-fields'; const timestamp = new Date('2024-08-21T20:36:37.114Z'); const response = await esClient.indices.create({ index, mappings: addPropertiesToMapping( getEcsMapping(), generateLargeMappingProperties({ size: 10000 }) ), settings: getSettings({ maxFields: 20000 }), }); const docs = range(10).map((idx) => { const doc = buildLargeDocument({ numFields: 8000, fieldSize: 5, }); addTimestampToDoc({ timestamp: new Date(timestamp.getTime() + idx), doc, }); addFieldToDoc({ fieldName: 'host.name', fieldValue: 'test host', doc, }); return doc; }); await esClient.bulk({ index, operations: docs.flatMap((doc) => [{ index: {} }, doc]), }); ``` To index nested test documents, replace the `buildLargeDocument` call in the code above with the function call below. `levels` controls the depth of nesting, so the total number of fields will be `fieldsPerObject ^ levels` (10,000 as shown below). ``` buildLargeNestedDocument({ fieldsPerObject: 10, levels: 4, fieldSize: 5 }); addTimestampToDoc({ timestamp: new Date(timestamp.getTime() + idx), doc, }); ``` To measure rule performance, run Kibana with `node --inspect scripts/kibana --dev` instead of the usual `yarn start`. Go to `chrome://inspect/#devices` in your browser, open Dev tools and start a performance profile capture, then run a rule preview with the following code in `scratchpad.ts` instead of the document indexing code above: ``` const previewPromises = range(1).map( (idx) => () => detectionsClient.rulePreview({ body: { ...getBasicRuleMetadata(), type: 'query', timeframeEnd: '2024-08-21T20:37:37.114Z', invocationCount: 1, from: 'now-6m', interval: '5m', index: [index], query: '*', }, }) ); const results = (await concurrentlyExec(previewPromises, 50)).map( (result) => result.data.logs ); console.log(JSON.stringify(results)); ``` To test multiple simultaneous rule executions, change `range(1)` to `range(number of simultaneous rule executions)`, but test results below all use a single rule execution. I ran the test multiple times consecutively to verify that results were relatively consistent, times were always within about 10% of the values reported here. ### 8k fields `main` - 500ms to build 10 alerts, 1300ms rule duration as reported by rule preview <img width="1558" alt="Screenshot 2024-09-11 at 9 20 57 AM" src="https://github.com/user-attachments/assets/2a72e0d3-74cc-41c4-b51d-80e4ca058978"> `optimize-alert-creation` - 126ms to build 10 alerts, 1055ms rule duration as reported by rule preview <img width="1550" alt="Screenshot 2024-09-11 at 2 33 24 PM" src="https://github.com/user-attachments/assets/c1217a47-872b-4291-a56e-ab48aead21a6"> ### 1k fields `main` - 400ms to build 10 alerts, 850ms total rule duration as reported by rule preview <img width="1552" alt="Screenshot 2024-09-11 at 12 20 45 PM" src="https://github.com/user-attachments/assets/470c1b05-672b-4e0c-9c77-2e3a82e9e6d6"> Yellow blocks at the bottom of the graph are GC activity - seems to be a very inefficient GC activity pattern, lots of separate collections `optimize-alert-creation` - 26ms to build 10 alerts, 578ms total rule duration as reported by rule preview <img width="1550" alt="Screenshot 2024-09-11 at 2 28 57 PM" src="https://github.com/user-attachments/assets/76523362-b3dd-40cb-92e6-1dcaeabffef1"> ### 100 fields `main` - 30ms to build 10 alerts <img width="1554" alt="Screenshot 2024-09-11 at 12 24 00 PM" src="https://github.com/user-attachments/assets/dde0a8ce-11bd-44e9-b441-c0b2cdf2fe1b"> `optimize-alert-creation` - 12ms to build 10 alerts <img width="1551" alt="Screenshot 2024-09-11 at 2 27 26 PM" src="https://github.com/user-attachments/assets/abbb561a-9222-458a-932c-cd233ce4163d"> ### 10k fields, 4 levels of nesting (10 fields per object on each level) `main` - **16s to build 10 alerts** (!!) <img width="1554" alt="Screenshot 2024-09-11 at 12 24 00 PM" src="https://github.com/user-attachments/assets/eee1acbe-9d82-46ce-85dc-b106d33b599e"> `optimize-alert-creation` - 800ms to build 10 alerts (20x speedup) <img width="1552" alt="Screenshot 2024-09-11 at 1 01 12 PM" src="https://github.com/user-attachments/assets/f8a3e655-8628-4c37-bb62-9ad246c14393"> ### 1k fields, 3 levels of nesting (10 fields per object on each level) `main` - 200ms to build 10 alerts <img width="1550" alt="Screenshot 2024-09-11 at 12 42 55 PM" src="https://github.com/user-attachments/assets/2530b71d-39be-4575-83e8-0de3977dc71c"> `optimize-alert-creation` - 80ms to build 10 alerts <img width="1553" alt="Screenshot 2024-09-11 at 2 24 15 PM" src="https://github.com/user-attachments/assets/1c4ffd36-97d5-49ac-ba48-0db05287455a"> --------- Co-authored-by: Vitalii Dmyterko <[email protected]> (cherry picked from commit a0421ab)
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
…ating alerts from source documents (#192287) (#193287) # Backport This will backport the following commits from `main` to `8.x`: - [[Security Solution][Detection Engine] Reduce CPU usage when creating alerts from source documents (#192287)](#192287) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Marshall Main","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-09-18T12:36:32Z","message":"[Security Solution][Detection Engine] Reduce CPU usage when creating alerts from source documents (#192287)\n\n## Summary\r\n\r\nParent issue: https://github.com/elastic/security-team/issues/10106\r\n\r\ntl;dr: This PR reduces the CPU time used by Kibana to transform source\r\ndocuments into alerts by ~2x for relatively small docs and up to 20x in\r\nsome pathological cases.\r\n\r\n| Test Case | Time on `main` | Time on PR branch | Speedup |\r\n| ---- | ---- | ---- | ---- |\r\n| 8k fields | 500ms | 126ms | 4x |\r\n| 1k fields | 400ms | 26ms | 16x |\r\n| 100 fields | 30ms | 12ms | 2.5x |\r\n| 10k fields nested | 16s | 800ms | 20x |\r\n| 1k fields nested | 200ms | 80ms | 2.5x |\r\n\r\nDetection rules generally aim to keep computationally expensive tasks in\r\nElasticsearch since Kibana's NodeJS architecture is not well suited to\r\nCPU bound logic. However, we are (for most rule types) currently unable\r\nto avoid loading full documents from source indices in ES into Kibana,\r\ntransforming them into alerts, and sending those transformed documents\r\nback to Elasticsearch for indexing into the alerts index. When the\r\nsource documents contain many fields, this flow is expensive and\r\ncontributes to event loop blockages.\r\n\r\nThis PR refactors and optimizes the alert creation process to simplify\r\nthe code and reduce the CPU burden. Optimizations include:\r\n- Mutating the source document object into an alert instead of copying\r\nit. The elasticsearch-js client already creates an object in memory for\r\nthe entire source document in the query response, so we can simply use\r\nthat instead of making another copy.\r\n- Use `Object.keys` to iterate over all properties of the object,\r\nfetching each property individually, instead of `Object.entries`.\r\n- Traverse the document only once - previously we iterated over at least\r\nthe top level keys 4 times, in `filterSource`, `stripNonEcsFields`,\r\n`additionalAlertFields`, and `buildFieldsKeyAsArrayMap`. Now\r\n`traverseDoc` handles 3 of those cases together in one pass, and we\r\navoid needing `buildFieldsKeyAsArrayMap` by fetching values on demand\r\nwith any mix of dot and nested notation.\r\n- `buildFieldsKeyAsArrayMap` turns a key in dot notation into an array\r\nthe represents the nested notation path to the object, accounting for\r\nany intermediate keys that might have dots in them. However, building\r\nthis map for an entire large object is expensive, and I found that it's\r\nsignificantly faster for us to just check for potential intermediate\r\ndot-notation keys when fetching instead of precomputing the map.\r\n- Happy path optimization: for `mergeMissingFieldsWithSource` (the\r\ndefault merge option) typically the vast majority of fields will already\r\nbe in `_source` and we can skip the rest of the logic for merging that\r\nfield from `fields` into `_source`. Now we'll check if a field key is\r\nalready present in `_source` as the very first operation when merging\r\nfields into `_source` and skip to the next field if it is.\r\n- `filterFieldEntry` - `isIgnored` splits ignored fields into a map of\r\nordinary field names and an array of regular expressions so we can more\r\nefficiently check if a field key is on the ignore list. The ignore list\r\nincludes ~80 field names from the legacy signal mappings, plus any\r\nuser-added ignore fields. This function is also called much less often\r\nnow that we only call `filterFieldEntry` **after** checking to ensure\r\nthat the field key is not already in `_source`.\r\n\r\nSupporting changes (necessary, but not optimizations on their own):\r\n- `robustGet`, `robustSet`, `robustPathIsValid` - \"robust\" functions in\r\nthe sense that they can handle any mix of dot and nested notation.\r\nWhereas lodash `get` would not find `my-value` in `{ a: { 'b.c':\r\n'my-value' }}` given the key `a.b.c`, `robustGet` will find it by\r\npopping prefixes from the key and checking each combination until it\r\nfind the value or verifies that there is no valid path matching the key.\r\n- Empty arrays are now treated as \"no value\" when creating alerts in\r\norder to make the logic more consistent. All empty arrays are silently\r\n(not reported in \"removed fields\") removed from alert documents. This\r\nresolves an inconsistency where an empty array in an incompatible field\r\n(e.g. `agent.name.text: []`) would be reported as a removed empty array,\r\nwhereas an array with values in an incompatible field (e.g.\r\n`agent.name.text: ['agent1', 'agent2']`) would report two individual\r\nvalues removed from `agent.name.text` but not report the array as\r\nremoved, even though the whole field (array included) is gone.\r\n - The new behavior would be: \r\n- `agent.name.text: []` -> completely removed without reporting the\r\narray removed, instead of removed with reporting the array removed\r\n- `agent.name.text: ['agent1', 'agent2']` -> same as before: completely\r\nremoved, reporting `agent1` and `agent2` removed, not reporting the\r\narray removed\r\n - See tests in `strip_non_ecs_fields.test.ts` for examples\r\n- `event.*` fields are copied from source docs to alerts with the\r\noriginal notation instead of transforming to dot notation. This is more\r\nin line with how we preserve the notation for the rest of the fields\r\nfrom `source`. The exception is still `event.kind`, which we have always\r\npopulated as `event.kind: 'signal'` for alerts and we use dot notation\r\nfor that field.\r\n\r\n## Test Setup and Results\r\n\r\nTo index test documents, add the following code in\r\n`security_solution/scripts/quickstart/scratchpad.ts`. Modify the index\r\nname and `numFields` as needed for different test variations. Run the\r\nscript with `node\r\nx-pack/plugins/security_solution/scripts/quickstart/run.js --kibana\r\nhttp://localhost:5601/<YOUR KIBANA BASE PATH>`\r\n```\r\nconst index = 'test-data-8k-fields';\r\n const timestamp = new Date('2024-08-21T20:36:37.114Z');\r\n const response = await esClient.indices.create({\r\n index,\r\n mappings: addPropertiesToMapping(\r\n getEcsMapping(),\r\n generateLargeMappingProperties({ size: 10000 })\r\n ),\r\n settings: getSettings({ maxFields: 20000 }),\r\n });\r\n\r\n const docs = range(10).map((idx) => {\r\n const doc = buildLargeDocument({\r\n numFields: 8000,\r\n fieldSize: 5,\r\n });\r\n addTimestampToDoc({\r\n timestamp: new Date(timestamp.getTime() + idx),\r\n doc,\r\n });\r\n addFieldToDoc({\r\n fieldName: 'host.name',\r\n fieldValue: 'test host',\r\n doc,\r\n });\r\n return doc;\r\n });\r\n\r\n await esClient.bulk({\r\n index,\r\n operations: docs.flatMap((doc) => [{ index: {} }, doc]),\r\n });\r\n```\r\n\r\nTo index nested test documents, replace the `buildLargeDocument` call in\r\nthe code above with the function call below. `levels` controls the depth\r\nof nesting, so the total number of fields will be `fieldsPerObject ^\r\nlevels` (10,000 as shown below).\r\n```\r\nbuildLargeNestedDocument({ fieldsPerObject: 10, levels: 4, fieldSize: 5 });\r\n addTimestampToDoc({\r\n timestamp: new Date(timestamp.getTime() + idx),\r\n doc,\r\n });\r\n```\r\n\r\nTo measure rule performance, run Kibana with `node --inspect\r\nscripts/kibana --dev` instead of the usual `yarn start`. Go to\r\n`chrome://inspect/#devices` in your browser, open Dev tools and start a\r\nperformance profile capture, then run a rule preview with the following\r\ncode in `scratchpad.ts` instead of the document indexing code above:\r\n```\r\nconst previewPromises = range(1).map(\r\n (idx) => () =>\r\n detectionsClient.rulePreview({\r\n body: {\r\n ...getBasicRuleMetadata(),\r\n type: 'query',\r\n timeframeEnd: '2024-08-21T20:37:37.114Z',\r\n invocationCount: 1,\r\n from: 'now-6m',\r\n interval: '5m',\r\n index: [index],\r\n query: '*',\r\n },\r\n })\r\n );\r\n\r\n const results = (await concurrentlyExec(previewPromises, 50)).map(\r\n (result) => result.data.logs\r\n );\r\n console.log(JSON.stringify(results));\r\n```\r\nTo test multiple simultaneous rule executions, change `range(1)` to\r\n`range(number of simultaneous rule executions)`, but test results below\r\nall use a single rule execution. I ran the test multiple times\r\nconsecutively to verify that results were relatively consistent, times\r\nwere always within about 10% of the values reported here.\r\n\r\n### 8k fields\r\n`main` - 500ms to build 10 alerts, 1300ms rule duration as reported by\r\nrule preview\r\n<img width=\"1558\" alt=\"Screenshot 2024-09-11 at 9 20 57 AM\"\r\nsrc=\"https://github.com/user-attachments/assets/2a72e0d3-74cc-41c4-b51d-80e4ca058978\">\r\n\r\n`optimize-alert-creation` - 126ms to build 10 alerts, 1055ms rule\r\nduration as reported by rule preview\r\n<img width=\"1550\" alt=\"Screenshot 2024-09-11 at 2 33 24 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/c1217a47-872b-4291-a56e-ab48aead21a6\">\r\n\r\n\r\n### 1k fields\r\n`main` - 400ms to build 10 alerts, 850ms total rule duration as reported\r\nby rule preview\r\n<img width=\"1552\" alt=\"Screenshot 2024-09-11 at 12 20 45 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/470c1b05-672b-4e0c-9c77-2e3a82e9e6d6\">\r\nYellow blocks at the bottom of the graph are GC activity - seems to be a\r\nvery inefficient GC activity pattern, lots of separate collections\r\n\r\n`optimize-alert-creation` - 26ms to build 10 alerts, 578ms total rule\r\nduration as reported by rule preview\r\n<img width=\"1550\" alt=\"Screenshot 2024-09-11 at 2 28 57 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/76523362-b3dd-40cb-92e6-1dcaeabffef1\">\r\n\r\n### 100 fields\r\n`main` - 30ms to build 10 alerts\r\n<img width=\"1554\" alt=\"Screenshot 2024-09-11 at 12 24 00 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/dde0a8ce-11bd-44e9-b441-c0b2cdf2fe1b\">\r\n\r\n`optimize-alert-creation` - 12ms to build 10 alerts\r\n<img width=\"1551\" alt=\"Screenshot 2024-09-11 at 2 27 26 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/abbb561a-9222-458a-932c-cd233ce4163d\">\r\n\r\n### 10k fields, 4 levels of nesting (10 fields per object on each level)\r\n\r\n`main` - **16s to build 10 alerts** (!!)\r\n<img width=\"1554\" alt=\"Screenshot 2024-09-11 at 12 24 00 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/eee1acbe-9d82-46ce-85dc-b106d33b599e\">\r\n\r\n`optimize-alert-creation` - 800ms to build 10 alerts (20x speedup)\r\n<img width=\"1552\" alt=\"Screenshot 2024-09-11 at 1 01 12 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/f8a3e655-8628-4c37-bb62-9ad246c14393\">\r\n\r\n### 1k fields, 3 levels of nesting (10 fields per object on each level)\r\n`main` - 200ms to build 10 alerts\r\n<img width=\"1550\" alt=\"Screenshot 2024-09-11 at 12 42 55 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/2530b71d-39be-4575-83e8-0de3977dc71c\">\r\n\r\n`optimize-alert-creation` - 80ms to build 10 alerts\r\n<img width=\"1553\" alt=\"Screenshot 2024-09-11 at 2 24 15 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/1c4ffd36-97d5-49ac-ba48-0db05287455a\">\r\n\r\n---------\r\n\r\nCo-authored-by: Vitalii Dmyterko <[email protected]>","sha":"a0421ab81fc41ae13d2269f81c885865e93e85dd","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","v9.0.0","Team: SecuritySolution","Team:Detection Engine","v8.16.0"],"title":"[Security Solution][Detection Engine] Reduce CPU usage when creating alerts from source documents","number":192287,"url":"https://github.com/elastic/kibana/pull/192287","mergeCommit":{"message":"[Security Solution][Detection Engine] Reduce CPU usage when creating alerts from source documents (#192287)\n\n## Summary\r\n\r\nParent issue: https://github.com/elastic/security-team/issues/10106\r\n\r\ntl;dr: This PR reduces the CPU time used by Kibana to transform source\r\ndocuments into alerts by ~2x for relatively small docs and up to 20x in\r\nsome pathological cases.\r\n\r\n| Test Case | Time on `main` | Time on PR branch | Speedup |\r\n| ---- | ---- | ---- | ---- |\r\n| 8k fields | 500ms | 126ms | 4x |\r\n| 1k fields | 400ms | 26ms | 16x |\r\n| 100 fields | 30ms | 12ms | 2.5x |\r\n| 10k fields nested | 16s | 800ms | 20x |\r\n| 1k fields nested | 200ms | 80ms | 2.5x |\r\n\r\nDetection rules generally aim to keep computationally expensive tasks in\r\nElasticsearch since Kibana's NodeJS architecture is not well suited to\r\nCPU bound logic. However, we are (for most rule types) currently unable\r\nto avoid loading full documents from source indices in ES into Kibana,\r\ntransforming them into alerts, and sending those transformed documents\r\nback to Elasticsearch for indexing into the alerts index. When the\r\nsource documents contain many fields, this flow is expensive and\r\ncontributes to event loop blockages.\r\n\r\nThis PR refactors and optimizes the alert creation process to simplify\r\nthe code and reduce the CPU burden. Optimizations include:\r\n- Mutating the source document object into an alert instead of copying\r\nit. The elasticsearch-js client already creates an object in memory for\r\nthe entire source document in the query response, so we can simply use\r\nthat instead of making another copy.\r\n- Use `Object.keys` to iterate over all properties of the object,\r\nfetching each property individually, instead of `Object.entries`.\r\n- Traverse the document only once - previously we iterated over at least\r\nthe top level keys 4 times, in `filterSource`, `stripNonEcsFields`,\r\n`additionalAlertFields`, and `buildFieldsKeyAsArrayMap`. Now\r\n`traverseDoc` handles 3 of those cases together in one pass, and we\r\navoid needing `buildFieldsKeyAsArrayMap` by fetching values on demand\r\nwith any mix of dot and nested notation.\r\n- `buildFieldsKeyAsArrayMap` turns a key in dot notation into an array\r\nthe represents the nested notation path to the object, accounting for\r\nany intermediate keys that might have dots in them. However, building\r\nthis map for an entire large object is expensive, and I found that it's\r\nsignificantly faster for us to just check for potential intermediate\r\ndot-notation keys when fetching instead of precomputing the map.\r\n- Happy path optimization: for `mergeMissingFieldsWithSource` (the\r\ndefault merge option) typically the vast majority of fields will already\r\nbe in `_source` and we can skip the rest of the logic for merging that\r\nfield from `fields` into `_source`. Now we'll check if a field key is\r\nalready present in `_source` as the very first operation when merging\r\nfields into `_source` and skip to the next field if it is.\r\n- `filterFieldEntry` - `isIgnored` splits ignored fields into a map of\r\nordinary field names and an array of regular expressions so we can more\r\nefficiently check if a field key is on the ignore list. The ignore list\r\nincludes ~80 field names from the legacy signal mappings, plus any\r\nuser-added ignore fields. This function is also called much less often\r\nnow that we only call `filterFieldEntry` **after** checking to ensure\r\nthat the field key is not already in `_source`.\r\n\r\nSupporting changes (necessary, but not optimizations on their own):\r\n- `robustGet`, `robustSet`, `robustPathIsValid` - \"robust\" functions in\r\nthe sense that they can handle any mix of dot and nested notation.\r\nWhereas lodash `get` would not find `my-value` in `{ a: { 'b.c':\r\n'my-value' }}` given the key `a.b.c`, `robustGet` will find it by\r\npopping prefixes from the key and checking each combination until it\r\nfind the value or verifies that there is no valid path matching the key.\r\n- Empty arrays are now treated as \"no value\" when creating alerts in\r\norder to make the logic more consistent. All empty arrays are silently\r\n(not reported in \"removed fields\") removed from alert documents. This\r\nresolves an inconsistency where an empty array in an incompatible field\r\n(e.g. `agent.name.text: []`) would be reported as a removed empty array,\r\nwhereas an array with values in an incompatible field (e.g.\r\n`agent.name.text: ['agent1', 'agent2']`) would report two individual\r\nvalues removed from `agent.name.text` but not report the array as\r\nremoved, even though the whole field (array included) is gone.\r\n - The new behavior would be: \r\n- `agent.name.text: []` -> completely removed without reporting the\r\narray removed, instead of removed with reporting the array removed\r\n- `agent.name.text: ['agent1', 'agent2']` -> same as before: completely\r\nremoved, reporting `agent1` and `agent2` removed, not reporting the\r\narray removed\r\n - See tests in `strip_non_ecs_fields.test.ts` for examples\r\n- `event.*` fields are copied from source docs to alerts with the\r\noriginal notation instead of transforming to dot notation. This is more\r\nin line with how we preserve the notation for the rest of the fields\r\nfrom `source`. The exception is still `event.kind`, which we have always\r\npopulated as `event.kind: 'signal'` for alerts and we use dot notation\r\nfor that field.\r\n\r\n## Test Setup and Results\r\n\r\nTo index test documents, add the following code in\r\n`security_solution/scripts/quickstart/scratchpad.ts`. Modify the index\r\nname and `numFields` as needed for different test variations. Run the\r\nscript with `node\r\nx-pack/plugins/security_solution/scripts/quickstart/run.js --kibana\r\nhttp://localhost:5601/<YOUR KIBANA BASE PATH>`\r\n```\r\nconst index = 'test-data-8k-fields';\r\n const timestamp = new Date('2024-08-21T20:36:37.114Z');\r\n const response = await esClient.indices.create({\r\n index,\r\n mappings: addPropertiesToMapping(\r\n getEcsMapping(),\r\n generateLargeMappingProperties({ size: 10000 })\r\n ),\r\n settings: getSettings({ maxFields: 20000 }),\r\n });\r\n\r\n const docs = range(10).map((idx) => {\r\n const doc = buildLargeDocument({\r\n numFields: 8000,\r\n fieldSize: 5,\r\n });\r\n addTimestampToDoc({\r\n timestamp: new Date(timestamp.getTime() + idx),\r\n doc,\r\n });\r\n addFieldToDoc({\r\n fieldName: 'host.name',\r\n fieldValue: 'test host',\r\n doc,\r\n });\r\n return doc;\r\n });\r\n\r\n await esClient.bulk({\r\n index,\r\n operations: docs.flatMap((doc) => [{ index: {} }, doc]),\r\n });\r\n```\r\n\r\nTo index nested test documents, replace the `buildLargeDocument` call in\r\nthe code above with the function call below. `levels` controls the depth\r\nof nesting, so the total number of fields will be `fieldsPerObject ^\r\nlevels` (10,000 as shown below).\r\n```\r\nbuildLargeNestedDocument({ fieldsPerObject: 10, levels: 4, fieldSize: 5 });\r\n addTimestampToDoc({\r\n timestamp: new Date(timestamp.getTime() + idx),\r\n doc,\r\n });\r\n```\r\n\r\nTo measure rule performance, run Kibana with `node --inspect\r\nscripts/kibana --dev` instead of the usual `yarn start`. Go to\r\n`chrome://inspect/#devices` in your browser, open Dev tools and start a\r\nperformance profile capture, then run a rule preview with the following\r\ncode in `scratchpad.ts` instead of the document indexing code above:\r\n```\r\nconst previewPromises = range(1).map(\r\n (idx) => () =>\r\n detectionsClient.rulePreview({\r\n body: {\r\n ...getBasicRuleMetadata(),\r\n type: 'query',\r\n timeframeEnd: '2024-08-21T20:37:37.114Z',\r\n invocationCount: 1,\r\n from: 'now-6m',\r\n interval: '5m',\r\n index: [index],\r\n query: '*',\r\n },\r\n })\r\n );\r\n\r\n const results = (await concurrentlyExec(previewPromises, 50)).map(\r\n (result) => result.data.logs\r\n );\r\n console.log(JSON.stringify(results));\r\n```\r\nTo test multiple simultaneous rule executions, change `range(1)` to\r\n`range(number of simultaneous rule executions)`, but test results below\r\nall use a single rule execution. I ran the test multiple times\r\nconsecutively to verify that results were relatively consistent, times\r\nwere always within about 10% of the values reported here.\r\n\r\n### 8k fields\r\n`main` - 500ms to build 10 alerts, 1300ms rule duration as reported by\r\nrule preview\r\n<img width=\"1558\" alt=\"Screenshot 2024-09-11 at 9 20 57 AM\"\r\nsrc=\"https://github.com/user-attachments/assets/2a72e0d3-74cc-41c4-b51d-80e4ca058978\">\r\n\r\n`optimize-alert-creation` - 126ms to build 10 alerts, 1055ms rule\r\nduration as reported by rule preview\r\n<img width=\"1550\" alt=\"Screenshot 2024-09-11 at 2 33 24 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/c1217a47-872b-4291-a56e-ab48aead21a6\">\r\n\r\n\r\n### 1k fields\r\n`main` - 400ms to build 10 alerts, 850ms total rule duration as reported\r\nby rule preview\r\n<img width=\"1552\" alt=\"Screenshot 2024-09-11 at 12 20 45 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/470c1b05-672b-4e0c-9c77-2e3a82e9e6d6\">\r\nYellow blocks at the bottom of the graph are GC activity - seems to be a\r\nvery inefficient GC activity pattern, lots of separate collections\r\n\r\n`optimize-alert-creation` - 26ms to build 10 alerts, 578ms total rule\r\nduration as reported by rule preview\r\n<img width=\"1550\" alt=\"Screenshot 2024-09-11 at 2 28 57 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/76523362-b3dd-40cb-92e6-1dcaeabffef1\">\r\n\r\n### 100 fields\r\n`main` - 30ms to build 10 alerts\r\n<img width=\"1554\" alt=\"Screenshot 2024-09-11 at 12 24 00 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/dde0a8ce-11bd-44e9-b441-c0b2cdf2fe1b\">\r\n\r\n`optimize-alert-creation` - 12ms to build 10 alerts\r\n<img width=\"1551\" alt=\"Screenshot 2024-09-11 at 2 27 26 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/abbb561a-9222-458a-932c-cd233ce4163d\">\r\n\r\n### 10k fields, 4 levels of nesting (10 fields per object on each level)\r\n\r\n`main` - **16s to build 10 alerts** (!!)\r\n<img width=\"1554\" alt=\"Screenshot 2024-09-11 at 12 24 00 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/eee1acbe-9d82-46ce-85dc-b106d33b599e\">\r\n\r\n`optimize-alert-creation` - 800ms to build 10 alerts (20x speedup)\r\n<img width=\"1552\" alt=\"Screenshot 2024-09-11 at 1 01 12 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/f8a3e655-8628-4c37-bb62-9ad246c14393\">\r\n\r\n### 1k fields, 3 levels of nesting (10 fields per object on each level)\r\n`main` - 200ms to build 10 alerts\r\n<img width=\"1550\" alt=\"Screenshot 2024-09-11 at 12 42 55 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/2530b71d-39be-4575-83e8-0de3977dc71c\">\r\n\r\n`optimize-alert-creation` - 80ms to build 10 alerts\r\n<img width=\"1553\" alt=\"Screenshot 2024-09-11 at 2 24 15 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/1c4ffd36-97d5-49ac-ba48-0db05287455a\">\r\n\r\n---------\r\n\r\nCo-authored-by: Vitalii Dmyterko <[email protected]>","sha":"a0421ab81fc41ae13d2269f81c885865e93e85dd"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/192287","number":192287,"mergeCommit":{"message":"[Security Solution][Detection Engine] Reduce CPU usage when creating alerts from source documents (#192287)\n\n## Summary\r\n\r\nParent issue: https://github.com/elastic/security-team/issues/10106\r\n\r\ntl;dr: This PR reduces the CPU time used by Kibana to transform source\r\ndocuments into alerts by ~2x for relatively small docs and up to 20x in\r\nsome pathological cases.\r\n\r\n| Test Case | Time on `main` | Time on PR branch | Speedup |\r\n| ---- | ---- | ---- | ---- |\r\n| 8k fields | 500ms | 126ms | 4x |\r\n| 1k fields | 400ms | 26ms | 16x |\r\n| 100 fields | 30ms | 12ms | 2.5x |\r\n| 10k fields nested | 16s | 800ms | 20x |\r\n| 1k fields nested | 200ms | 80ms | 2.5x |\r\n\r\nDetection rules generally aim to keep computationally expensive tasks in\r\nElasticsearch since Kibana's NodeJS architecture is not well suited to\r\nCPU bound logic. However, we are (for most rule types) currently unable\r\nto avoid loading full documents from source indices in ES into Kibana,\r\ntransforming them into alerts, and sending those transformed documents\r\nback to Elasticsearch for indexing into the alerts index. When the\r\nsource documents contain many fields, this flow is expensive and\r\ncontributes to event loop blockages.\r\n\r\nThis PR refactors and optimizes the alert creation process to simplify\r\nthe code and reduce the CPU burden. Optimizations include:\r\n- Mutating the source document object into an alert instead of copying\r\nit. The elasticsearch-js client already creates an object in memory for\r\nthe entire source document in the query response, so we can simply use\r\nthat instead of making another copy.\r\n- Use `Object.keys` to iterate over all properties of the object,\r\nfetching each property individually, instead of `Object.entries`.\r\n- Traverse the document only once - previously we iterated over at least\r\nthe top level keys 4 times, in `filterSource`, `stripNonEcsFields`,\r\n`additionalAlertFields`, and `buildFieldsKeyAsArrayMap`. Now\r\n`traverseDoc` handles 3 of those cases together in one pass, and we\r\navoid needing `buildFieldsKeyAsArrayMap` by fetching values on demand\r\nwith any mix of dot and nested notation.\r\n- `buildFieldsKeyAsArrayMap` turns a key in dot notation into an array\r\nthe represents the nested notation path to the object, accounting for\r\nany intermediate keys that might have dots in them. However, building\r\nthis map for an entire large object is expensive, and I found that it's\r\nsignificantly faster for us to just check for potential intermediate\r\ndot-notation keys when fetching instead of precomputing the map.\r\n- Happy path optimization: for `mergeMissingFieldsWithSource` (the\r\ndefault merge option) typically the vast majority of fields will already\r\nbe in `_source` and we can skip the rest of the logic for merging that\r\nfield from `fields` into `_source`. Now we'll check if a field key is\r\nalready present in `_source` as the very first operation when merging\r\nfields into `_source` and skip to the next field if it is.\r\n- `filterFieldEntry` - `isIgnored` splits ignored fields into a map of\r\nordinary field names and an array of regular expressions so we can more\r\nefficiently check if a field key is on the ignore list. The ignore list\r\nincludes ~80 field names from the legacy signal mappings, plus any\r\nuser-added ignore fields. This function is also called much less often\r\nnow that we only call `filterFieldEntry` **after** checking to ensure\r\nthat the field key is not already in `_source`.\r\n\r\nSupporting changes (necessary, but not optimizations on their own):\r\n- `robustGet`, `robustSet`, `robustPathIsValid` - \"robust\" functions in\r\nthe sense that they can handle any mix of dot and nested notation.\r\nWhereas lodash `get` would not find `my-value` in `{ a: { 'b.c':\r\n'my-value' }}` given the key `a.b.c`, `robustGet` will find it by\r\npopping prefixes from the key and checking each combination until it\r\nfind the value or verifies that there is no valid path matching the key.\r\n- Empty arrays are now treated as \"no value\" when creating alerts in\r\norder to make the logic more consistent. All empty arrays are silently\r\n(not reported in \"removed fields\") removed from alert documents. This\r\nresolves an inconsistency where an empty array in an incompatible field\r\n(e.g. `agent.name.text: []`) would be reported as a removed empty array,\r\nwhereas an array with values in an incompatible field (e.g.\r\n`agent.name.text: ['agent1', 'agent2']`) would report two individual\r\nvalues removed from `agent.name.text` but not report the array as\r\nremoved, even though the whole field (array included) is gone.\r\n - The new behavior would be: \r\n- `agent.name.text: []` -> completely removed without reporting the\r\narray removed, instead of removed with reporting the array removed\r\n- `agent.name.text: ['agent1', 'agent2']` -> same as before: completely\r\nremoved, reporting `agent1` and `agent2` removed, not reporting the\r\narray removed\r\n - See tests in `strip_non_ecs_fields.test.ts` for examples\r\n- `event.*` fields are copied from source docs to alerts with the\r\noriginal notation instead of transforming to dot notation. This is more\r\nin line with how we preserve the notation for the rest of the fields\r\nfrom `source`. The exception is still `event.kind`, which we have always\r\npopulated as `event.kind: 'signal'` for alerts and we use dot notation\r\nfor that field.\r\n\r\n## Test Setup and Results\r\n\r\nTo index test documents, add the following code in\r\n`security_solution/scripts/quickstart/scratchpad.ts`. Modify the index\r\nname and `numFields` as needed for different test variations. Run the\r\nscript with `node\r\nx-pack/plugins/security_solution/scripts/quickstart/run.js --kibana\r\nhttp://localhost:5601/<YOUR KIBANA BASE PATH>`\r\n```\r\nconst index = 'test-data-8k-fields';\r\n const timestamp = new Date('2024-08-21T20:36:37.114Z');\r\n const response = await esClient.indices.create({\r\n index,\r\n mappings: addPropertiesToMapping(\r\n getEcsMapping(),\r\n generateLargeMappingProperties({ size: 10000 })\r\n ),\r\n settings: getSettings({ maxFields: 20000 }),\r\n });\r\n\r\n const docs = range(10).map((idx) => {\r\n const doc = buildLargeDocument({\r\n numFields: 8000,\r\n fieldSize: 5,\r\n });\r\n addTimestampToDoc({\r\n timestamp: new Date(timestamp.getTime() + idx),\r\n doc,\r\n });\r\n addFieldToDoc({\r\n fieldName: 'host.name',\r\n fieldValue: 'test host',\r\n doc,\r\n });\r\n return doc;\r\n });\r\n\r\n await esClient.bulk({\r\n index,\r\n operations: docs.flatMap((doc) => [{ index: {} }, doc]),\r\n });\r\n```\r\n\r\nTo index nested test documents, replace the `buildLargeDocument` call in\r\nthe code above with the function call below. `levels` controls the depth\r\nof nesting, so the total number of fields will be `fieldsPerObject ^\r\nlevels` (10,000 as shown below).\r\n```\r\nbuildLargeNestedDocument({ fieldsPerObject: 10, levels: 4, fieldSize: 5 });\r\n addTimestampToDoc({\r\n timestamp: new Date(timestamp.getTime() + idx),\r\n doc,\r\n });\r\n```\r\n\r\nTo measure rule performance, run Kibana with `node --inspect\r\nscripts/kibana --dev` instead of the usual `yarn start`. Go to\r\n`chrome://inspect/#devices` in your browser, open Dev tools and start a\r\nperformance profile capture, then run a rule preview with the following\r\ncode in `scratchpad.ts` instead of the document indexing code above:\r\n```\r\nconst previewPromises = range(1).map(\r\n (idx) => () =>\r\n detectionsClient.rulePreview({\r\n body: {\r\n ...getBasicRuleMetadata(),\r\n type: 'query',\r\n timeframeEnd: '2024-08-21T20:37:37.114Z',\r\n invocationCount: 1,\r\n from: 'now-6m',\r\n interval: '5m',\r\n index: [index],\r\n query: '*',\r\n },\r\n })\r\n );\r\n\r\n const results = (await concurrentlyExec(previewPromises, 50)).map(\r\n (result) => result.data.logs\r\n );\r\n console.log(JSON.stringify(results));\r\n```\r\nTo test multiple simultaneous rule executions, change `range(1)` to\r\n`range(number of simultaneous rule executions)`, but test results below\r\nall use a single rule execution. I ran the test multiple times\r\nconsecutively to verify that results were relatively consistent, times\r\nwere always within about 10% of the values reported here.\r\n\r\n### 8k fields\r\n`main` - 500ms to build 10 alerts, 1300ms rule duration as reported by\r\nrule preview\r\n<img width=\"1558\" alt=\"Screenshot 2024-09-11 at 9 20 57 AM\"\r\nsrc=\"https://github.com/user-attachments/assets/2a72e0d3-74cc-41c4-b51d-80e4ca058978\">\r\n\r\n`optimize-alert-creation` - 126ms to build 10 alerts, 1055ms rule\r\nduration as reported by rule preview\r\n<img width=\"1550\" alt=\"Screenshot 2024-09-11 at 2 33 24 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/c1217a47-872b-4291-a56e-ab48aead21a6\">\r\n\r\n\r\n### 1k fields\r\n`main` - 400ms to build 10 alerts, 850ms total rule duration as reported\r\nby rule preview\r\n<img width=\"1552\" alt=\"Screenshot 2024-09-11 at 12 20 45 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/470c1b05-672b-4e0c-9c77-2e3a82e9e6d6\">\r\nYellow blocks at the bottom of the graph are GC activity - seems to be a\r\nvery inefficient GC activity pattern, lots of separate collections\r\n\r\n`optimize-alert-creation` - 26ms to build 10 alerts, 578ms total rule\r\nduration as reported by rule preview\r\n<img width=\"1550\" alt=\"Screenshot 2024-09-11 at 2 28 57 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/76523362-b3dd-40cb-92e6-1dcaeabffef1\">\r\n\r\n### 100 fields\r\n`main` - 30ms to build 10 alerts\r\n<img width=\"1554\" alt=\"Screenshot 2024-09-11 at 12 24 00 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/dde0a8ce-11bd-44e9-b441-c0b2cdf2fe1b\">\r\n\r\n`optimize-alert-creation` - 12ms to build 10 alerts\r\n<img width=\"1551\" alt=\"Screenshot 2024-09-11 at 2 27 26 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/abbb561a-9222-458a-932c-cd233ce4163d\">\r\n\r\n### 10k fields, 4 levels of nesting (10 fields per object on each level)\r\n\r\n`main` - **16s to build 10 alerts** (!!)\r\n<img width=\"1554\" alt=\"Screenshot 2024-09-11 at 12 24 00 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/eee1acbe-9d82-46ce-85dc-b106d33b599e\">\r\n\r\n`optimize-alert-creation` - 800ms to build 10 alerts (20x speedup)\r\n<img width=\"1552\" alt=\"Screenshot 2024-09-11 at 1 01 12 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/f8a3e655-8628-4c37-bb62-9ad246c14393\">\r\n\r\n### 1k fields, 3 levels of nesting (10 fields per object on each level)\r\n`main` - 200ms to build 10 alerts\r\n<img width=\"1550\" alt=\"Screenshot 2024-09-11 at 12 42 55 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/2530b71d-39be-4575-83e8-0de3977dc71c\">\r\n\r\n`optimize-alert-creation` - 80ms to build 10 alerts\r\n<img width=\"1553\" alt=\"Screenshot 2024-09-11 at 2 24 15 PM\"\r\nsrc=\"https://github.com/user-attachments/assets/1c4ffd36-97d5-49ac-ba48-0db05287455a\">\r\n\r\n---------\r\n\r\nCo-authored-by: Vitalii Dmyterko <[email protected]>","sha":"a0421ab81fc41ae13d2269f81c885865e93e85dd"}},{"branch":"8.x","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}] BACKPORT--> Co-authored-by: Marshall Main <[email protected]>
Summary
Parent issue: https://github.com/elastic/security-team/issues/10106
tl;dr: This PR reduces the CPU time used by Kibana to transform source documents into alerts by ~2x for relatively small docs and up to 20x in some pathological cases.
main
Detection rules generally aim to keep computationally expensive tasks in Elasticsearch since Kibana's NodeJS architecture is not well suited to CPU bound logic. However, we are (for most rule types) currently unable to avoid loading full documents from source indices in ES into Kibana, transforming them into alerts, and sending those transformed documents back to Elasticsearch for indexing into the alerts index. When the source documents contain many fields, this flow is expensive and contributes to event loop blockages.
This PR refactors and optimizes the alert creation process to simplify the code and reduce the CPU burden. Optimizations include:
Object.keys
to iterate over all properties of the object, fetching each property individually, instead ofObject.entries
.filterSource
,stripNonEcsFields
,additionalAlertFields
, andbuildFieldsKeyAsArrayMap
. NowtraverseDoc
handles 3 of those cases together in one pass, and we avoid needingbuildFieldsKeyAsArrayMap
by fetching values on demand with any mix of dot and nested notation.buildFieldsKeyAsArrayMap
turns a key in dot notation into an array the represents the nested notation path to the object, accounting for any intermediate keys that might have dots in them. However, building this map for an entire large object is expensive, and I found that it's significantly faster for us to just check for potential intermediate dot-notation keys when fetching instead of precomputing the map.mergeMissingFieldsWithSource
(the default merge option) typically the vast majority of fields will already be in_source
and we can skip the rest of the logic for merging that field fromfields
into_source
. Now we'll check if a field key is already present in_source
as the very first operation when merging fields into_source
and skip to the next field if it is.filterFieldEntry
-isIgnored
splits ignored fields into a map of ordinary field names and an array of regular expressions so we can more efficiently check if a field key is on the ignore list. The ignore list includes ~80 field names from the legacy signal mappings, plus any user-added ignore fields. This function is also called much less often now that we only callfilterFieldEntry
after checking to ensure that the field key is not already in_source
.Supporting changes (necessary, but not optimizations on their own):
robustGet
,robustSet
,robustPathIsValid
- "robust" functions in the sense that they can handle any mix of dot and nested notation. Whereas lodashget
would not findmy-value
in{ a: { 'b.c': 'my-value' }}
given the keya.b.c
,robustGet
will find it by popping prefixes from the key and checking each combination until it find the value or verifies that there is no valid path matching the key.agent.name.text: []
) would be reported as a removed empty array, whereas an array with values in an incompatible field (e.g.agent.name.text: ['agent1', 'agent2']
) would report two individual values removed fromagent.name.text
but not report the array as removed, even though the whole field (array included) is gone.agent.name.text: []
-> completely removed without reporting the array removed, instead of removed with reporting the array removedagent.name.text: ['agent1', 'agent2']
-> same as before: completely removed, reportingagent1
andagent2
removed, not reporting the array removedstrip_non_ecs_fields.test.ts
for examplesevent.*
fields are copied from source docs to alerts with the original notation instead of transforming to dot notation. This is more in line with how we preserve the notation for the rest of the fields fromsource
. The exception is stillevent.kind
, which we have always populated asevent.kind: 'signal'
for alerts and we use dot notation for that field.Test Setup and Results
To index test documents, add the following code in
security_solution/scripts/quickstart/scratchpad.ts
. Modify the index name andnumFields
as needed for different test variations. Run the script withnode x-pack/plugins/security_solution/scripts/quickstart/run.js --kibana http://localhost:5601/<YOUR KIBANA BASE PATH>
To index nested test documents, replace the
buildLargeDocument
call in the code above with the function call below.levels
controls the depth of nesting, so the total number of fields will befieldsPerObject ^ levels
(10,000 as shown below).To measure rule performance, run Kibana with
node --inspect scripts/kibana --dev
instead of the usualyarn start
. Go tochrome://inspect/#devices
in your browser, open Dev tools and start a performance profile capture, then run a rule preview with the following code inscratchpad.ts
instead of the document indexing code above:To test multiple simultaneous rule executions, change
range(1)
torange(number of simultaneous rule executions)
, but test results below all use a single rule execution. I ran the test multiple times consecutively to verify that results were relatively consistent, times were always within about 10% of the values reported here.8k fields
main
- 500ms to build 10 alerts, 1300ms rule duration as reported by rule previewoptimize-alert-creation
- 126ms to build 10 alerts, 1055ms rule duration as reported by rule preview1k fields
main
- 400ms to build 10 alerts, 850ms total rule duration as reported by rule previewYellow blocks at the bottom of the graph are GC activity - seems to be a very inefficient GC activity pattern, lots of separate collections
optimize-alert-creation
- 26ms to build 10 alerts, 578ms total rule duration as reported by rule preview100 fields
main
- 30ms to build 10 alertsoptimize-alert-creation
- 12ms to build 10 alerts10k fields, 4 levels of nesting (10 fields per object on each level)
main
- 16s to build 10 alerts (!!)optimize-alert-creation
- 800ms to build 10 alerts (20x speedup)1k fields, 3 levels of nesting (10 fields per object on each level)
main
- 200ms to build 10 alertsoptimize-alert-creation
- 80ms to build 10 alerts