Skip to content

Commit

Permalink
[8.x] [Auto Import] Use larger number of samples on the backend (#196233
Browse files Browse the repository at this point in the history
) (#196386)

# Backport

This will backport the following commits from `main` to `8.x`:
- [[Auto Import] Use larger number of samples on the backend
(#196233)](#196233)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Ilya
Nikokoshev","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-10-15T16:22:05Z","message":"[Auto
Import] Use larger number of samples on the backend (#196233)\n\n##
Release Notes\r\n\r\nAutomatic Import now analyses larger number of
samples to generate an\r\nintegration.\r\n\r\n## Summary\r\n\r\nCloses
https://github.com/elastic/security-team/issues/9844\r\n\r\n**Added:
Backend Sampling**\r\n\r\nWe pass 100 rows (these numeric values are
adjustable) to the backend\r\n[^1]\r\n\r\n[^1]: As before,
deterministically selected on the frontend,
see\r\nhttps://github.com//pull/191598\r\n\r\n\r\nThe
Categorization chain now processes the samples in batches,\r\nperforming
after initial categorization a number of review cycles (but\r\nnot more
than 5, tuned so that we stay under the 2 minute limit for a\r\nsingle
API call).\r\n\r\nTo decide when to stop processing we keep the list of
_stable_ samples\r\nas follows:\r\n\r\n1. The list is initially
empty.\r\n2. For each review we select a random subset of 40 samples,
preferring\r\nto pick up the not-stable samples.\r\n3. After each review
– when the LLM potentially gives us new or changes\r\nthe old processors
– we compare the new pipeline results with the old\r\npipeline
results.\r\n4. Those reviewed samples that did not change their
categorization are\r\nadded to the stable list.\r\n5. Any samples that
have changed their categorization are removed from\r\nthe stable
list.\r\n6. If all samples are stable, we finish
processing.\r\n\r\n**Removed: User Notification**\r\n\r\nUsing 100
samples provides a balance between expected complexity and\r\ntime
budget we work with. We might want to change it in the
future,\r\npossibly dynamically, making the specific number of no
importance to the\r\nuser. Thus we remove the truncation
notification.\r\n\r\n**Unchanged:**\r\n\r\n- No batching is made in the
related chain: it seems to work as-is.\r\n\r\n**Refactored:**\r\n\r\n-
We centralize the sizing constants in
the\r\n`x-pack/plugins/integration_assistant/common/constants.ts`
file.\r\n- We remove the unused state key `formattedSamples` and
combine\r\n`modelJSONInput` back into `modelInput`.\r\n\r\n> [!NOTE]
\r\n> I had difficulty generating new graph diagrams, so they
remain\r\nunchanged.","sha":"fc3ce5475a73aad1abdbf857bc8787cd0f10aaed","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:enhancement","enhancement","v9.0.0","backport:prev-minor","8.16
candidate","Team:Security-Scalability","Feature:AutomaticImport"],"title":"[Auto
Import] Use larger number of samples on the
backend","number":196233,"url":"https://github.com/elastic/kibana/pull/196233","mergeCommit":{"message":"[Auto
Import] Use larger number of samples on the backend (#196233)\n\n##
Release Notes\r\n\r\nAutomatic Import now analyses larger number of
samples to generate an\r\nintegration.\r\n\r\n## Summary\r\n\r\nCloses
https://github.com/elastic/security-team/issues/9844\r\n\r\n**Added:
Backend Sampling**\r\n\r\nWe pass 100 rows (these numeric values are
adjustable) to the backend\r\n[^1]\r\n\r\n[^1]: As before,
deterministically selected on the frontend,
see\r\nhttps://github.com//pull/191598\r\n\r\n\r\nThe
Categorization chain now processes the samples in batches,\r\nperforming
after initial categorization a number of review cycles (but\r\nnot more
than 5, tuned so that we stay under the 2 minute limit for a\r\nsingle
API call).\r\n\r\nTo decide when to stop processing we keep the list of
_stable_ samples\r\nas follows:\r\n\r\n1. The list is initially
empty.\r\n2. For each review we select a random subset of 40 samples,
preferring\r\nto pick up the not-stable samples.\r\n3. After each review
– when the LLM potentially gives us new or changes\r\nthe old processors
– we compare the new pipeline results with the old\r\npipeline
results.\r\n4. Those reviewed samples that did not change their
categorization are\r\nadded to the stable list.\r\n5. Any samples that
have changed their categorization are removed from\r\nthe stable
list.\r\n6. If all samples are stable, we finish
processing.\r\n\r\n**Removed: User Notification**\r\n\r\nUsing 100
samples provides a balance between expected complexity and\r\ntime
budget we work with. We might want to change it in the
future,\r\npossibly dynamically, making the specific number of no
importance to the\r\nuser. Thus we remove the truncation
notification.\r\n\r\n**Unchanged:**\r\n\r\n- No batching is made in the
related chain: it seems to work as-is.\r\n\r\n**Refactored:**\r\n\r\n-
We centralize the sizing constants in
the\r\n`x-pack/plugins/integration_assistant/common/constants.ts`
file.\r\n- We remove the unused state key `formattedSamples` and
combine\r\n`modelJSONInput` back into `modelInput`.\r\n\r\n> [!NOTE]
\r\n> I had difficulty generating new graph diagrams, so they
remain\r\nunchanged.","sha":"fc3ce5475a73aad1abdbf857bc8787cd0f10aaed"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/196233","number":196233,"mergeCommit":{"message":"[Auto
Import] Use larger number of samples on the backend (#196233)\n\n##
Release Notes\r\n\r\nAutomatic Import now analyses larger number of
samples to generate an\r\nintegration.\r\n\r\n## Summary\r\n\r\nCloses
https://github.com/elastic/security-team/issues/9844\r\n\r\n**Added:
Backend Sampling**\r\n\r\nWe pass 100 rows (these numeric values are
adjustable) to the backend\r\n[^1]\r\n\r\n[^1]: As before,
deterministically selected on the frontend,
see\r\nhttps://github.com//pull/191598\r\n\r\n\r\nThe
Categorization chain now processes the samples in batches,\r\nperforming
after initial categorization a number of review cycles (but\r\nnot more
than 5, tuned so that we stay under the 2 minute limit for a\r\nsingle
API call).\r\n\r\nTo decide when to stop processing we keep the list of
_stable_ samples\r\nas follows:\r\n\r\n1. The list is initially
empty.\r\n2. For each review we select a random subset of 40 samples,
preferring\r\nto pick up the not-stable samples.\r\n3. After each review
– when the LLM potentially gives us new or changes\r\nthe old processors
– we compare the new pipeline results with the old\r\npipeline
results.\r\n4. Those reviewed samples that did not change their
categorization are\r\nadded to the stable list.\r\n5. Any samples that
have changed their categorization are removed from\r\nthe stable
list.\r\n6. If all samples are stable, we finish
processing.\r\n\r\n**Removed: User Notification**\r\n\r\nUsing 100
samples provides a balance between expected complexity and\r\ntime
budget we work with. We might want to change it in the
future,\r\npossibly dynamically, making the specific number of no
importance to the\r\nuser. Thus we remove the truncation
notification.\r\n\r\n**Unchanged:**\r\n\r\n- No batching is made in the
related chain: it seems to work as-is.\r\n\r\n**Refactored:**\r\n\r\n-
We centralize the sizing constants in
the\r\n`x-pack/plugins/integration_assistant/common/constants.ts`
file.\r\n- We remove the unused state key `formattedSamples` and
combine\r\n`modelJSONInput` back into `modelInput`.\r\n\r\n> [!NOTE]
\r\n> I had difficulty generating new graph diagrams, so they
remain\r\nunchanged.","sha":"fc3ce5475a73aad1abdbf857bc8787cd0f10aaed"}}]}]
BACKPORT-->

Co-authored-by: Ilya Nikokoshev <[email protected]>
  • Loading branch information
kibanamachine and ilyannn authored Oct 15, 2024
1 parent 51b8359 commit a4938bc
Show file tree
Hide file tree
Showing 31 changed files with 534 additions and 190 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,6 @@ export const testPipelineInvalidEcs: { pipelineResults: object[]; errors: object
export const categorizationTestState = {
rawSamples: ['{"test1": "test1"}'],
samples: ['{ "test1": "test1" }'],
formattedSamples: '{"test1": "test1"}',
ecsTypes: 'testtypes',
ecsCategories: 'testcategories',
exAnswer: 'testanswer',
Expand All @@ -173,9 +172,8 @@ export const categorizationTestState = {
previousError: 'testprevious',
previousInvalidCategorization: 'testinvalid',
pipelineResults: [{ test: 'testresult' }],
finalized: false,
hasTriedOnce: false,
reviewed: false,
previousPipelineResults: [{ test: 'testresult' }],
lastReviewedSamples: [],
currentPipeline: { test: 'testpipeline' },
currentProcessors: [
{
Expand All @@ -193,6 +191,9 @@ export const categorizationTestState = {
initialPipeline: categorizationInitialPipeline,
results: { test: 'testresults' },
samplesFormat: { name: SamplesFormatName.Values.json },
stableSamples: [],
reviewCount: 0,
finalized: false,
};

export const categorizationMockProcessors = [
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,6 @@ export const testPipelineValidResult: { pipelineResults: object[]; errors: objec
export const relatedTestState = {
rawSamples: ['{"test1": "test1"}'],
samples: ['{ "test1": "test1" }'],
formattedSamples: '{"test1": "test1"}',
ecs: 'testtypes',
exAnswer: 'testanswer',
packageName: 'testpackage',
Expand Down
8 changes: 8 additions & 0 deletions x-pack/plugins/integration_assistant/common/constants.ts
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,11 @@ export enum GenerationErrorCode {
UNSUPPORTED_LOG_SAMPLES_FORMAT = 'unsupported-log-samples-format',
UNPARSEABLE_CSV_DATA = 'unparseable-csv-data',
}

// Size limits
export const FRONTEND_SAMPLE_ROWS = 100;
export const LOG_FORMAT_DETECTION_SAMPLE_ROWS = 5;
export const CATEGORIZATION_INITIAL_BATCH_SIZE = 60;
export const CATEROGIZATION_REVIEW_BATCH_SIZE = 40;
export const CATEGORIZATION_REVIEW_MAX_CYCLES = 5;
export const CATEGORIZATION_RECURSION_LIMIT = 50;
2 changes: 2 additions & 0 deletions x-pack/plugins/integration_assistant/common/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ export {
} from './api/analyze_logs/analyze_logs_route.gen';
export { CelInputRequestBody, CelInputResponse } from './api/cel/cel_input_route.gen';

export { partialShuffleArray } from './utils';

export type {
DataStream,
InputType,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ import { TestProvider } from '../../../../../mocks/test_provider';
import { parseNDJSON, parseJSONArray, SampleLogsInput } from './sample_logs_input';
import { ActionsProvider } from '../../state';
import { mockActions } from '../../mocks/state';
import { mockServices } from '../../../../../services/mocks/services';

const wrapper: React.FC<React.PropsWithChildren<{}>> = ({ children }) => (
<TestProvider>
Expand Down Expand Up @@ -165,25 +164,6 @@ describe('SampleLogsInput', () => {
samplesFormat: { name: 'json', json_path: [] },
});
});

describe('when the file has too many rows', () => {
const tooLargeLogsSample = Array(6).fill(logsSampleRaw).join(','); // 12 entries
beforeEach(async () => {
await changeFile(input, new File([`[${tooLargeLogsSample}]`], 'test.json', { type }));
});

it('should truncate the logs sample', () => {
expect(mockActions.setIntegrationSettings).toBeCalledWith({
logSamples: tooLargeLogsSample.split(',').slice(0, 2),
samplesFormat: { name: 'json', json_path: [] },
});
});
it('should add a notification toast', () => {
expect(mockServices.notifications.toasts.addInfo).toBeCalledWith(
`The logs sample has been truncated to 10 rows.`
);
});
});
});

describe('when the file is a json array under a key', () => {
Expand Down Expand Up @@ -236,25 +216,6 @@ describe('SampleLogsInput', () => {
samplesFormat: { name: 'ndjson', multiline: false },
});
});

describe('when the file has too many rows', () => {
const tooLargeLogsSample = Array(6).fill(simpleNDJSON).join('\n'); // 12 entries
beforeEach(async () => {
await changeFile(input, new File([tooLargeLogsSample], 'test.json', { type }));
});

it('should truncate the logs sample', () => {
expect(mockActions.setIntegrationSettings).toBeCalledWith({
logSamples: tooLargeLogsSample.split('\n').slice(0, 2),
samplesFormat: { name: 'ndjson', multiline: false },
});
});
it('should add a notification toast', () => {
expect(mockServices.notifications.toasts.addInfo).toBeCalledWith(
`The logs sample has been truncated to 10 rows.`
);
});
});
});

describe('when the file is a an ndjson with a single record', () => {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,12 @@
import React, { useCallback, useState } from 'react';
import { EuiCallOut, EuiFilePicker, EuiFormRow, EuiSpacer, EuiText } from '@elastic/eui';
import { isPlainObject } from 'lodash/fp';
import { useKibana } from '@kbn/kibana-react-plugin/public';
import type { IntegrationSettings } from '../../types';
import * as i18n from './translations';
import { useActions } from '../../state';
import type { SamplesFormat } from '../../../../../../common';
import { partialShuffleArray } from './utils';

const MaxLogsSampleRows = 10;
import { partialShuffleArray } from '../../../../../../common';
import { FRONTEND_SAMPLE_ROWS } from '../../../../../../common/constants';

/**
* Parse the logs sample file content as newiline-delimited JSON (NDJSON).
Expand Down Expand Up @@ -83,8 +81,8 @@ export const parseJSONArray = (
* @returns Whether the array was truncated.
*/
function trimShuffleLogsSample<T>(array: T[]): boolean {
const willTruncate = array.length > MaxLogsSampleRows;
const numElements = willTruncate ? MaxLogsSampleRows : array.length;
const willTruncate = array.length > FRONTEND_SAMPLE_ROWS;
const numElements = willTruncate ? FRONTEND_SAMPLE_ROWS : array.length;

partialShuffleArray(array, 1, numElements);

Expand Down Expand Up @@ -215,7 +213,6 @@ interface SampleLogsInputProps {
}

export const SampleLogsInput = React.memo<SampleLogsInputProps>(({ integrationSettings }) => {
const { notifications } = useKibana().services;
const { setIntegrationSettings } = useActions();
const [isParsing, setIsParsing] = useState(false);
const [sampleFileError, setSampleFileError] = useState<string>();
Expand Down Expand Up @@ -266,11 +263,7 @@ export const SampleLogsInput = React.memo<SampleLogsInputProps>(({ integrationSe
return;
}

const { samplesFormat, logSamples, isTruncated } = prepareResult;

if (isTruncated) {
notifications?.toasts.addInfo(i18n.LOGS_SAMPLE_TRUNCATED(MaxLogsSampleRows));
}
const { samplesFormat, logSamples } = prepareResult;

setIntegrationSettings({
...integrationSettings,
Expand All @@ -293,7 +286,7 @@ export const SampleLogsInput = React.memo<SampleLogsInputProps>(({ integrationSe

reader.readAsText(logsSampleFile);
},
[integrationSettings, setIntegrationSettings, notifications?.toasts, setIsParsing]
[integrationSettings, setIntegrationSettings, setIsParsing]
);
return (
<EuiFormRow
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,11 +110,6 @@ export const LOGS_SAMPLE_DESCRIPTION = i18n.translate(
defaultMessage: 'Drag and drop a file or Browse files.',
}
);
export const LOGS_SAMPLE_TRUNCATED = (maxRows: number) =>
i18n.translate('xpack.integrationAssistant.step.dataStream.logsSample.truncatedWarning', {
values: { maxRows },
defaultMessage: `The logs sample has been truncated to {maxRows} rows.`,
});
export const LOGS_SAMPLE_ERROR = {
CAN_NOT_READ: i18n.translate(
'xpack.integrationAssistant.step.dataStream.logsSample.errorCanNotRead',
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ import { combineProcessors } from '../../util/processors';
import { CATEGORIZATION_EXAMPLE_PROCESSORS } from './constants';
import { CATEGORIZATION_MAIN_PROMPT } from './prompts';
import type { CategorizationNodeParams } from './types';
import { selectResults } from './util';
import { CATEGORIZATION_INITIAL_BATCH_SIZE } from '../../../common/constants';

export async function handleCategorization({
state,
Expand All @@ -19,8 +21,15 @@ export async function handleCategorization({
const categorizationMainPrompt = CATEGORIZATION_MAIN_PROMPT;
const outputParser = new JsonOutputParser();
const categorizationMainGraph = categorizationMainPrompt.pipe(model).pipe(outputParser);

const [pipelineResults, _] = selectResults(
state.pipelineResults,
CATEGORIZATION_INITIAL_BATCH_SIZE,
new Set(state.stableSamples)
);

const currentProcessors = (await categorizationMainGraph.invoke({
pipeline_results: JSON.stringify(state.pipelineResults, null, 2),
pipeline_results: JSON.stringify(pipelineResults, null, 2),
example_processors: CATEGORIZATION_EXAMPLE_PROCESSORS,
ex_answer: state?.exAnswer,
ecs_categories: state?.ecsCategories,
Expand All @@ -36,7 +45,7 @@ export async function handleCategorization({
return {
currentPipeline,
currentProcessors,
hasTriedOnce: true,
lastReviewedSamples: [],
lastExecutedChain: 'categorization',
};
}
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/

export const ECS_CATEGORIES = {
api: 'Covers events from API calls, including those from OS and network protocols. Allowed event.type combinations: access, admin, allowed, change, creation, deletion, denied, end, info, start, user',
authentication:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,6 @@ export async function handleErrors({
return {
currentPipeline,
currentProcessors,
reviewed: false,
lastExecutedChain: 'error',
};
}
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import { handleReview } from './review';
import { handleCategorization } from './categorization';
import { handleErrors } from './errors';
import { handleInvalidCategorization } from './invalid';
import { handleUpdateStableSamples } from './stable';
import { testPipeline, combineProcessors } from '../../util';
import {
ActionsClientChatOpenAI,
Expand All @@ -39,6 +40,7 @@ jest.mock('./errors');
jest.mock('./review');
jest.mock('./categorization');
jest.mock('./invalid');
jest.mock('./stable');

jest.mock('../../util/pipeline', () => ({
testPipeline: jest.fn(),
Expand Down Expand Up @@ -74,7 +76,8 @@ describe('runCategorizationGraph', () => {
return {
currentPipeline,
currentProcessors,
reviewed: false,
stableSamples: [],
reviewCount: 0,
finalized: false,
lastExecutedChain: 'categorization',
};
Expand All @@ -90,7 +93,8 @@ describe('runCategorizationGraph', () => {
return {
currentPipeline,
currentProcessors,
reviewed: false,
stableSamples: [],
reviewCount: 0,
finalized: false,
lastExecutedChain: 'error',
};
Expand All @@ -106,7 +110,8 @@ describe('runCategorizationGraph', () => {
return {
currentPipeline,
currentProcessors,
reviewed: false,
stableSamples: [],
reviewCount: 0,
finalized: false,
lastExecutedChain: 'invalidCategorization',
};
Expand All @@ -122,11 +127,29 @@ describe('runCategorizationGraph', () => {
return {
currentProcessors,
currentPipeline,
reviewed: true,
stableSamples: [],
reviewCount: 0,
finalized: false,
lastExecutedChain: 'review',
};
});
// After the review it should route to modelOutput and finish.
(handleUpdateStableSamples as jest.Mock)
.mockResolvedValueOnce({
stableSamples: [],
finalized: false,
lastExecutedChain: 'handleUpdateStableSamples',
})
.mockResolvedValueOnce({
stableSamples: [],
finalized: false,
lastExecutedChain: 'handleUpdateStableSamples',
})
.mockResolvedValueOnce({
stableSamples: [0],
finalized: false,
lastExecutedChain: 'handleUpdateStableSamples',
});
});

it('Ensures that the graph compiles', async () => {
Expand Down
Loading

0 comments on commit a4938bc

Please sign in to comment.