Skip to content

Commit

Permalink
[Security Solution][Detection Engine] Reduce CPU usage when creating …
Browse files Browse the repository at this point in the history
…alerts from source documents (#192287)

## Summary

Parent issue: elastic/security-team#10106

tl;dr: This PR reduces the CPU time used by Kibana to transform source
documents into alerts by ~2x for relatively small docs and up to 20x in
some pathological cases.

| Test Case | Time on `main` | Time on PR branch | Speedup |
| ---- | ---- | ---- | ---- |
| 8k fields | 500ms | 126ms | 4x |
| 1k fields | 400ms | 26ms | 16x |
| 100 fields | 30ms | 12ms | 2.5x |
| 10k fields nested | 16s | 800ms | 20x |
| 1k fields nested | 200ms | 80ms | 2.5x |

Detection rules generally aim to keep computationally expensive tasks in
Elasticsearch since Kibana's NodeJS architecture is not well suited to
CPU bound logic. However, we are (for most rule types) currently unable
to avoid loading full documents from source indices in ES into Kibana,
transforming them into alerts, and sending those transformed documents
back to Elasticsearch for indexing into the alerts index. When the
source documents contain many fields, this flow is expensive and
contributes to event loop blockages.

This PR refactors and optimizes the alert creation process to simplify
the code and reduce the CPU burden. Optimizations include:
- Mutating the source document object into an alert instead of copying
it. The elasticsearch-js client already creates an object in memory for
the entire source document in the query response, so we can simply use
that instead of making another copy.
- Use `Object.keys` to iterate over all properties of the object,
fetching each property individually, instead of `Object.entries`.
- Traverse the document only once - previously we iterated over at least
the top level keys 4 times, in `filterSource`, `stripNonEcsFields`,
`additionalAlertFields`, and `buildFieldsKeyAsArrayMap`. Now
`traverseDoc` handles 3 of those cases together in one pass, and we
avoid needing `buildFieldsKeyAsArrayMap` by fetching values on demand
with any mix of dot and nested notation.
- `buildFieldsKeyAsArrayMap` turns a key in dot notation into an array
the represents the nested notation path to the object, accounting for
any intermediate keys that might have dots in them. However, building
this map for an entire large object is expensive, and I found that it's
significantly faster for us to just check for potential intermediate
dot-notation keys when fetching instead of precomputing the map.
- Happy path optimization: for `mergeMissingFieldsWithSource` (the
default merge option) typically the vast majority of fields will already
be in `_source` and we can skip the rest of the logic for merging that
field from `fields` into `_source`. Now we'll check if a field key is
already present in `_source` as the very first operation when merging
fields into `_source` and skip to the next field if it is.
- `filterFieldEntry` - `isIgnored` splits ignored fields into a map of
ordinary field names and an array of regular expressions so we can more
efficiently check if a field key is on the ignore list. The ignore list
includes ~80 field names from the legacy signal mappings, plus any
user-added ignore fields. This function is also called much less often
now that we only call `filterFieldEntry` **after** checking to ensure
that the field key is not already in `_source`.

Supporting changes (necessary, but not optimizations on their own):
- `robustGet`, `robustSet`, `robustPathIsValid` - "robust" functions in
the sense that they can handle any mix of dot and nested notation.
Whereas lodash `get` would not find `my-value` in `{ a: { 'b.c':
'my-value' }}` given the key `a.b.c`, `robustGet` will find it by
popping prefixes from the key and checking each combination until it
find the value or verifies that there is no valid path matching the key.
- Empty arrays are now treated as "no value" when creating alerts in
order to make the logic more consistent. All empty arrays are silently
(not reported in "removed fields") removed from alert documents. This
resolves an inconsistency where an empty array in an incompatible field
(e.g. `agent.name.text: []`) would be reported as a removed empty array,
whereas an array with values in an incompatible field (e.g.
`agent.name.text: ['agent1', 'agent2']`) would report two individual
values removed from `agent.name.text` but not report the array as
removed, even though the whole field (array included) is gone.
  - The new behavior would be:
- `agent.name.text: []` -> completely removed without reporting the
array removed, instead of removed with reporting the array removed
- `agent.name.text: ['agent1', 'agent2']` -> same as before: completely
removed, reporting `agent1` and `agent2` removed, not reporting the
array removed
  - See tests in `strip_non_ecs_fields.test.ts` for examples
- `event.*` fields are copied from source docs to alerts with the
original notation instead of transforming to dot notation. This is more
in line with how we preserve the notation for the rest of the fields
from `source`. The exception is still `event.kind`, which we have always
populated as `event.kind: 'signal'` for alerts and we use dot notation
for that field.

## Test Setup and Results

To index test documents, add the following code in
`security_solution/scripts/quickstart/scratchpad.ts`. Modify the index
name and `numFields` as needed for different test variations. Run the
script with `node
x-pack/plugins/security_solution/scripts/quickstart/run.js --kibana
http://localhost:5601/<YOUR KIBANA BASE PATH>`
```
const index = 'test-data-8k-fields';
      const timestamp = new Date('2024-08-21T20:36:37.114Z');
      const response = await esClient.indices.create({
        index,
        mappings: addPropertiesToMapping(
          getEcsMapping(),
          generateLargeMappingProperties({ size: 10000 })
        ),
        settings: getSettings({ maxFields: 20000 }),
      });

      const docs = range(10).map((idx) => {
        const doc = buildLargeDocument({
          numFields: 8000,
          fieldSize: 5,
        });
        addTimestampToDoc({
          timestamp: new Date(timestamp.getTime() + idx),
          doc,
        });
        addFieldToDoc({
          fieldName: 'host.name',
          fieldValue: 'test host',
          doc,
        });
        return doc;
      });

      await esClient.bulk({
        index,
        operations: docs.flatMap((doc) => [{ index: {} }, doc]),
      });
```

To index nested test documents, replace the `buildLargeDocument` call in
the code above with the function call below. `levels` controls the depth
of nesting, so the total number of fields will be `fieldsPerObject ^
levels` (10,000 as shown below).
```
buildLargeNestedDocument({ fieldsPerObject: 10, levels: 4, fieldSize: 5 });
        addTimestampToDoc({
          timestamp: new Date(timestamp.getTime() + idx),
          doc,
        });
```

To measure rule performance, run Kibana with `node --inspect
scripts/kibana --dev` instead of the usual `yarn start`. Go to
`chrome://inspect/#devices` in your browser, open Dev tools and start a
performance profile capture, then run a rule preview with the following
code in `scratchpad.ts` instead of the document indexing code above:
```
const previewPromises = range(1).map(
        (idx) => () =>
          detectionsClient.rulePreview({
            body: {
              ...getBasicRuleMetadata(),
              type: 'query',
              timeframeEnd: '2024-08-21T20:37:37.114Z',
              invocationCount: 1,
              from: 'now-6m',
              interval: '5m',
              index: [index],
              query: '*',
            },
          })
      );

      const results = (await concurrentlyExec(previewPromises, 50)).map(
        (result) => result.data.logs
      );
      console.log(JSON.stringify(results));
```
To test multiple simultaneous rule executions, change `range(1)` to
`range(number of simultaneous rule executions)`, but test results below
all use a single rule execution. I ran the test multiple times
consecutively to verify that results were relatively consistent, times
were always within about 10% of the values reported here.

### 8k fields
`main` - 500ms to build 10 alerts, 1300ms rule duration as reported by
rule preview
<img width="1558" alt="Screenshot 2024-09-11 at 9 20 57 AM"
src="https://github.com/user-attachments/assets/2a72e0d3-74cc-41c4-b51d-80e4ca058978">

`optimize-alert-creation` - 126ms to build 10 alerts, 1055ms rule
duration as reported by rule preview
<img width="1550" alt="Screenshot 2024-09-11 at 2 33 24 PM"
src="https://github.com/user-attachments/assets/c1217a47-872b-4291-a56e-ab48aead21a6">

### 1k fields
`main` - 400ms to build 10 alerts, 850ms total rule duration as reported
by rule preview
<img width="1552" alt="Screenshot 2024-09-11 at 12 20 45 PM"
src="https://github.com/user-attachments/assets/470c1b05-672b-4e0c-9c77-2e3a82e9e6d6">
Yellow blocks at the bottom of the graph are GC activity - seems to be a
very inefficient GC activity pattern, lots of separate collections

`optimize-alert-creation` - 26ms to build 10 alerts, 578ms total rule
duration as reported by rule preview
<img width="1550" alt="Screenshot 2024-09-11 at 2 28 57 PM"
src="https://github.com/user-attachments/assets/76523362-b3dd-40cb-92e6-1dcaeabffef1">

### 100 fields
`main` - 30ms to build 10 alerts
<img width="1554" alt="Screenshot 2024-09-11 at 12 24 00 PM"
src="https://github.com/user-attachments/assets/dde0a8ce-11bd-44e9-b441-c0b2cdf2fe1b">

`optimize-alert-creation` - 12ms to build 10 alerts
<img width="1551" alt="Screenshot 2024-09-11 at 2 27 26 PM"
src="https://github.com/user-attachments/assets/abbb561a-9222-458a-932c-cd233ce4163d">

### 10k fields, 4 levels of nesting (10 fields per object on each level)

`main` - **16s to build 10 alerts** (!!)
<img width="1554" alt="Screenshot 2024-09-11 at 12 24 00 PM"
src="https://github.com/user-attachments/assets/eee1acbe-9d82-46ce-85dc-b106d33b599e">

`optimize-alert-creation` - 800ms to build 10 alerts (20x speedup)
<img width="1552" alt="Screenshot 2024-09-11 at 1 01 12 PM"
src="https://github.com/user-attachments/assets/f8a3e655-8628-4c37-bb62-9ad246c14393">

### 1k fields, 3 levels of nesting (10 fields per object on each level)
`main` - 200ms to build 10 alerts
<img width="1550" alt="Screenshot 2024-09-11 at 12 42 55 PM"
src="https://github.com/user-attachments/assets/2530b71d-39be-4575-83e8-0de3977dc71c">

`optimize-alert-creation` - 80ms to build 10 alerts
<img width="1553" alt="Screenshot 2024-09-11 at 2 24 15 PM"
src="https://github.com/user-attachments/assets/1c4ffd36-97d5-49ac-ba48-0db05287455a">

---------

Co-authored-by: Vitalii Dmyterko <[email protected]>
(cherry picked from commit a0421ab)
  • Loading branch information
marshallmain committed Sep 18, 2024
1 parent c8043c9 commit 4be1e88
Show file tree
Hide file tree
Showing 57 changed files with 3,095 additions and 1,761 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,30 @@ export const buildLargeDocument = ({
return doc;
};

export const buildLargeNestedDocument = ({
fieldsPerObject,
levels,
fieldSize,
}: {
fieldsPerObject: number;
levels: number;
fieldSize: number;
}): Record<string, unknown> => {
if (levels === 1) {
return buildLargeDocument({ numFields: fieldsPerObject, fieldSize });
} else {
const doc: Record<string, unknown> = {};
range(fieldsPerObject).forEach((idx) => {
doc[`level_${levels}_field${idx}`] = buildLargeNestedDocument({
fieldsPerObject,
levels: levels - 1,
fieldSize,
});
});
return doc;
}
};

export const addTimestampToDoc = ({
timestamp = new Date(),
doc,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ import type {
import { mappingFromFieldMap } from '@kbn/alerting-plugin/common';
import { ecsFieldMap } from '@kbn/alerts-as-data-utils';

export const getEcsMapping = () => mappingFromFieldMap(ecsFieldMap);
export const getEcsMapping = () => mappingFromFieldMap(ecsFieldMap, false);

export interface GenerateLargeMappingPropertiesProps {
size: number;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
* 2.0.
*/

import { isEmpty } from 'lodash';
import { isEmpty, partition } from 'lodash';
import agent from 'elastic-apm-node';

import type * as estypes from '@elastic/elasticsearch/lib/api/typesWithBodyKey';
Expand Down Expand Up @@ -375,8 +375,17 @@ export const createSecurityRuleTypeWrapper: CreateSecurityRuleTypeWrapper =
);

const legacySignalFields: string[] = Object.keys(aadFieldConversion);
const [ignoreFieldsRegexes, ignoreFieldsStandard] = partition(
[...ignoreFields, ...legacySignalFields],
(field: string) => field.startsWith('/') && field.endsWith('/')
);
const ignoreFieldsObject: Record<string, boolean> = {};
ignoreFieldsStandard.forEach((field) => {
ignoreFieldsObject[field] = true;
});
const wrapHits = wrapHitsFactory({
ignoreFields: [...ignoreFields, ...legacySignalFields],
ignoreFields: ignoreFieldsObject,
ignoreFieldsRegexes,
mergeStrategy,
completeRule,
spaceId,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ import { getAlertDetailsUrl } from '../../../../../common/utils/alert_detail_pat
import { DEFAULT_ALERTS_INDEX } from '../../../../../common/constants';
import type { ConfigType } from '../../../../config';
import type { Ancestor, SignalSource, SignalSourceHit } from '../types';
import { buildAlert, buildAncestors, generateAlertId } from '../factories/utils/build_alert';
import { buildBulkBody } from '../factories/utils/build_bulk_body';
import { buildAlertFields, buildAncestors, generateAlertId } from '../factories/utils/build_alert';
import { transformHitToAlert } from '../factories/utils/transform_hit_to_alert';
import type { EqlSequence } from '../../../../../common/detection_engine/types';
import { generateBuildingBlockIds } from '../factories/utils/generate_building_block_ids';
import type { BuildReasonMessage } from '../utils/reason_formatters';
Expand Down Expand Up @@ -59,20 +59,21 @@ export const buildAlertGroupFromSequence = (
let baseAlerts: BaseFieldsLatest[] = [];
try {
baseAlerts = sequence.events.map((event) =>
buildBulkBody(
transformHitToAlert({
spaceId,
completeRule,
event,
doc: event,
mergeStrategy,
[],
false,
ignoreFields: {},
ignoreFieldsRegexes: [],
applyOverrides: false,
buildReasonMessage,
indicesToQuery,
alertTimestampOverride,
ruleExecutionLogger,
'placeholder-alert-uuid', // This is overriden below
publicBaseUrl
)
alertUuid: 'placeholder-alert-uuid', // This is overriden below
publicBaseUrl,
})
);
} catch (error) {
ruleExecutionLogger.error(error);
Expand Down Expand Up @@ -153,16 +154,16 @@ export const buildAlertRoot = (
severity: completeRule.ruleParams.severity,
mergedDoc: mergedAlerts as SignalSourceHit,
});
const doc = buildAlert(
wrappedBuildingBlocks,
const doc = buildAlertFields({
docs: wrappedBuildingBlocks,
completeRule,
spaceId,
reason,
indicesToQuery,
'placeholder-uuid', // These will be overriden below
alertUuid: 'placeholder-uuid', // These will be overriden below
publicBaseUrl, // Not necessary now, but when the ID is created ahead of time this can be passed
alertTimestampOverride
);
alertTimestampOverride,
});
const alertId = generateAlertId(doc);
const alertUrl = getAlertDetailsUrl({
alertId,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ import type { ConfigType } from '../../../../config';
import type { CompleteRule, EsqlRuleParams } from '../../rule_schema';
import { buildReasonMessageForNewTermsAlert } from '../utils/reason_formatters';
import type { IRuleExecutionLogForExecutors } from '../../rule_monitoring';
import { buildBulkBody } from '../factories/utils/build_bulk_body';
import { transformHitToAlert } from '../factories/utils/transform_hit_to_alert';
import type { SignalSource } from '../types';
import { generateAlertId } from './utils';

Expand Down Expand Up @@ -55,20 +55,21 @@ export const wrapEsqlAlerts = ({
index: i,
});

const baseAlert: BaseFieldsLatest = buildBulkBody(
const baseAlert: BaseFieldsLatest = transformHitToAlert({
spaceId,
completeRule,
event,
doc: event,
mergeStrategy,
[],
true,
buildReasonMessageForNewTermsAlert,
[],
ignoreFields: {},
ignoreFieldsRegexes: [],
applyOverrides: true,
buildReasonMessage: buildReasonMessageForNewTermsAlert,
indicesToQuery: [],
alertTimestampOverride,
ruleExecutionLogger,
id,
publicBaseUrl
);
alertUuid: id,
publicBaseUrl,
});

return {
_id: id,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ import type { ConfigType } from '../../../../config';
import type { CompleteRule, EsqlRuleParams } from '../../rule_schema';
import { buildReasonMessageForNewTermsAlert } from '../utils/reason_formatters';
import type { IRuleExecutionLogForExecutors } from '../../rule_monitoring';
import { buildBulkBody } from '../factories/utils/build_bulk_body';
import { transformHitToAlert } from '../factories/utils/transform_hit_to_alert';
import type { SignalSource } from '../types';
import { getSuppressionAlertFields, getSuppressionTerms } from '../utils';
import { generateAlertId } from './utils';
Expand Down Expand Up @@ -73,20 +73,21 @@ export const wrapSuppressedEsqlAlerts = ({

const instanceId = objectHash([suppressionTerms, completeRule.alertId, spaceId]);

const baseAlert: BaseFieldsLatest = buildBulkBody(
const baseAlert: BaseFieldsLatest = transformHitToAlert({
spaceId,
completeRule,
event,
doc: event,
mergeStrategy,
[],
true,
buildReasonMessageForNewTermsAlert,
[],
ignoreFields: {},
ignoreFieldsRegexes: [],
applyOverrides: true,
buildReasonMessage: buildReasonMessageForNewTermsAlert,
indicesToQuery: [],
alertTimestampOverride,
ruleExecutionLogger,
id,
publicBaseUrl
);
alertUuid: id,
publicBaseUrl,
});

return {
_id: id,
Expand Down
Loading

0 comments on commit 4be1e88

Please sign in to comment.