-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Auto Import] Use larger number of samples on the backend #196233
[Auto Import] Use larger number of samples on the backend #196233
Conversation
Pinging @elastic/security-scalability (Team:Security-Scalability) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested locally.. Looks good overall.
@@ -7,11 +7,11 @@ | |||
|
|||
import React from 'react'; | |||
import { act, fireEvent, render, waitFor, type RenderResult } from '@testing-library/react'; | |||
import '@testing-library/jest-dom'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this?
) | ||
); | ||
|
||
newStableSamples.sort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why sort? To persist order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly to make this readable when debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the numbers are also converted to strings and so the sort order is like 1,10,100,2,21,22,23,24,3,30
etc., Not really readable in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it was better than nothing 😄 Anyway, agreed, I've removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably Need to add chunking to related graph too. The run here seems to pick up all the pipeline_results
to find one related field. We can reduce number of tokens passed here.
This will only reduce the number of tokens if we do not need new back-and-forth cycles. In the case you linked we know there was a single related field, but if we implement the algorithm to always pass 20 samples and extrapolate the results to all 100 samples, there will likely be integrations where we miss the related fields in these 20 samples. Then the cost will be at least:
which would be much larger that current costs. The current way is also using less tokens than ECS Mapping and Categorization. I do agree we can think about reducing the number of tokens, but I think a much better way is to include additional information when doing ECS Mappinig. We can just ask during that mapping if the field is likely to contain an IP, host or user name and prune our all the other fiekds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. We can merge this PR now. Tested flows locally and everything seems to be working fine.
We can experiment with related graph in a different PR if you wish to.
Starting backport for target branches: 8.x |
💚 Build Succeeded
Metrics [docs]Public APIs missing comments
Async chunks
History
cc @ilyannn |
…6233) ## Release Notes Automatic Import now analyses larger number of samples to generate an integration. ## Summary Closes elastic/security-team#9844 **Added: Backend Sampling** We pass 100 rows (these numeric values are adjustable) to the backend [^1] [^1]: As before, deterministically selected on the frontend, see elastic#191598 The Categorization chain now processes the samples in batches, performing after initial categorization a number of review cycles (but not more than 5, tuned so that we stay under the 2 minute limit for a single API call). To decide when to stop processing we keep the list of _stable_ samples as follows: 1. The list is initially empty. 2. For each review we select a random subset of 40 samples, preferring to pick up the not-stable samples. 3. After each review – when the LLM potentially gives us new or changes the old processors – we compare the new pipeline results with the old pipeline results. 4. Those reviewed samples that did not change their categorization are added to the stable list. 5. Any samples that have changed their categorization are removed from the stable list. 6. If all samples are stable, we finish processing. **Removed: User Notification** Using 100 samples provides a balance between expected complexity and time budget we work with. We might want to change it in the future, possibly dynamically, making the specific number of no importance to the user. Thus we remove the truncation notification. **Unchanged:** - No batching is made in the related chain: it seems to work as-is. **Refactored:** - We centralize the sizing constants in the `x-pack/plugins/integration_assistant/common/constants.ts` file. - We remove the unused state key `formattedSamples` and combine `modelJSONInput` back into `modelInput`. > [!NOTE] > I had difficulty generating new graph diagrams, so they remain unchanged. (cherry picked from commit fc3ce54)
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
) (#196386) # Backport This will backport the following commits from `main` to `8.x`: - [[Auto Import] Use larger number of samples on the backend (#196233)](#196233) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Ilya Nikokoshev","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-10-15T16:22:05Z","message":"[Auto Import] Use larger number of samples on the backend (#196233)\n\n## Release Notes\r\n\r\nAutomatic Import now analyses larger number of samples to generate an\r\nintegration.\r\n\r\n## Summary\r\n\r\nCloses https://github.com/elastic/security-team/issues/9844\r\n\r\n**Added: Backend Sampling**\r\n\r\nWe pass 100 rows (these numeric values are adjustable) to the backend\r\n[^1]\r\n\r\n[^1]: As before, deterministically selected on the frontend, see\r\nhttps://github.com//pull/191598\r\n\r\n\r\nThe Categorization chain now processes the samples in batches,\r\nperforming after initial categorization a number of review cycles (but\r\nnot more than 5, tuned so that we stay under the 2 minute limit for a\r\nsingle API call).\r\n\r\nTo decide when to stop processing we keep the list of _stable_ samples\r\nas follows:\r\n\r\n1. The list is initially empty.\r\n2. For each review we select a random subset of 40 samples, preferring\r\nto pick up the not-stable samples.\r\n3. After each review – when the LLM potentially gives us new or changes\r\nthe old processors – we compare the new pipeline results with the old\r\npipeline results.\r\n4. Those reviewed samples that did not change their categorization are\r\nadded to the stable list.\r\n5. Any samples that have changed their categorization are removed from\r\nthe stable list.\r\n6. If all samples are stable, we finish processing.\r\n\r\n**Removed: User Notification**\r\n\r\nUsing 100 samples provides a balance between expected complexity and\r\ntime budget we work with. We might want to change it in the future,\r\npossibly dynamically, making the specific number of no importance to the\r\nuser. Thus we remove the truncation notification.\r\n\r\n**Unchanged:**\r\n\r\n- No batching is made in the related chain: it seems to work as-is.\r\n\r\n**Refactored:**\r\n\r\n- We centralize the sizing constants in the\r\n`x-pack/plugins/integration_assistant/common/constants.ts` file.\r\n- We remove the unused state key `formattedSamples` and combine\r\n`modelJSONInput` back into `modelInput`.\r\n\r\n> [!NOTE] \r\n> I had difficulty generating new graph diagrams, so they remain\r\nunchanged.","sha":"fc3ce5475a73aad1abdbf857bc8787cd0f10aaed","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:enhancement","enhancement","v9.0.0","backport:prev-minor","8.16 candidate","Team:Security-Scalability","Feature:AutomaticImport"],"title":"[Auto Import] Use larger number of samples on the backend","number":196233,"url":"https://github.com/elastic/kibana/pull/196233","mergeCommit":{"message":"[Auto Import] Use larger number of samples on the backend (#196233)\n\n## Release Notes\r\n\r\nAutomatic Import now analyses larger number of samples to generate an\r\nintegration.\r\n\r\n## Summary\r\n\r\nCloses https://github.com/elastic/security-team/issues/9844\r\n\r\n**Added: Backend Sampling**\r\n\r\nWe pass 100 rows (these numeric values are adjustable) to the backend\r\n[^1]\r\n\r\n[^1]: As before, deterministically selected on the frontend, see\r\nhttps://github.com//pull/191598\r\n\r\n\r\nThe Categorization chain now processes the samples in batches,\r\nperforming after initial categorization a number of review cycles (but\r\nnot more than 5, tuned so that we stay under the 2 minute limit for a\r\nsingle API call).\r\n\r\nTo decide when to stop processing we keep the list of _stable_ samples\r\nas follows:\r\n\r\n1. The list is initially empty.\r\n2. For each review we select a random subset of 40 samples, preferring\r\nto pick up the not-stable samples.\r\n3. After each review – when the LLM potentially gives us new or changes\r\nthe old processors – we compare the new pipeline results with the old\r\npipeline results.\r\n4. Those reviewed samples that did not change their categorization are\r\nadded to the stable list.\r\n5. Any samples that have changed their categorization are removed from\r\nthe stable list.\r\n6. If all samples are stable, we finish processing.\r\n\r\n**Removed: User Notification**\r\n\r\nUsing 100 samples provides a balance between expected complexity and\r\ntime budget we work with. We might want to change it in the future,\r\npossibly dynamically, making the specific number of no importance to the\r\nuser. Thus we remove the truncation notification.\r\n\r\n**Unchanged:**\r\n\r\n- No batching is made in the related chain: it seems to work as-is.\r\n\r\n**Refactored:**\r\n\r\n- We centralize the sizing constants in the\r\n`x-pack/plugins/integration_assistant/common/constants.ts` file.\r\n- We remove the unused state key `formattedSamples` and combine\r\n`modelJSONInput` back into `modelInput`.\r\n\r\n> [!NOTE] \r\n> I had difficulty generating new graph diagrams, so they remain\r\nunchanged.","sha":"fc3ce5475a73aad1abdbf857bc8787cd0f10aaed"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/196233","number":196233,"mergeCommit":{"message":"[Auto Import] Use larger number of samples on the backend (#196233)\n\n## Release Notes\r\n\r\nAutomatic Import now analyses larger number of samples to generate an\r\nintegration.\r\n\r\n## Summary\r\n\r\nCloses https://github.com/elastic/security-team/issues/9844\r\n\r\n**Added: Backend Sampling**\r\n\r\nWe pass 100 rows (these numeric values are adjustable) to the backend\r\n[^1]\r\n\r\n[^1]: As before, deterministically selected on the frontend, see\r\nhttps://github.com//pull/191598\r\n\r\n\r\nThe Categorization chain now processes the samples in batches,\r\nperforming after initial categorization a number of review cycles (but\r\nnot more than 5, tuned so that we stay under the 2 minute limit for a\r\nsingle API call).\r\n\r\nTo decide when to stop processing we keep the list of _stable_ samples\r\nas follows:\r\n\r\n1. The list is initially empty.\r\n2. For each review we select a random subset of 40 samples, preferring\r\nto pick up the not-stable samples.\r\n3. After each review – when the LLM potentially gives us new or changes\r\nthe old processors – we compare the new pipeline results with the old\r\npipeline results.\r\n4. Those reviewed samples that did not change their categorization are\r\nadded to the stable list.\r\n5. Any samples that have changed their categorization are removed from\r\nthe stable list.\r\n6. If all samples are stable, we finish processing.\r\n\r\n**Removed: User Notification**\r\n\r\nUsing 100 samples provides a balance between expected complexity and\r\ntime budget we work with. We might want to change it in the future,\r\npossibly dynamically, making the specific number of no importance to the\r\nuser. Thus we remove the truncation notification.\r\n\r\n**Unchanged:**\r\n\r\n- No batching is made in the related chain: it seems to work as-is.\r\n\r\n**Refactored:**\r\n\r\n- We centralize the sizing constants in the\r\n`x-pack/plugins/integration_assistant/common/constants.ts` file.\r\n- We remove the unused state key `formattedSamples` and combine\r\n`modelJSONInput` back into `modelInput`.\r\n\r\n> [!NOTE] \r\n> I had difficulty generating new graph diagrams, so they remain\r\nunchanged.","sha":"fc3ce5475a73aad1abdbf857bc8787cd0f10aaed"}}]}] BACKPORT--> Co-authored-by: Ilya Nikokoshev <[email protected]>
Release Notes
Automatic Import now analyses larger number of samples to generate an integration.
Summary
Closes https://github.com/elastic/security-team/issues/9844
Important
This PR also contains functionality of #196228 and #196207; they should be merged before this one.
Added: Backend Sampling
We pass 100 rows (these numeric values are adjustable) to the backend 1
The Categorization chain now processes the samples in batches, performing after initial categorization a number of review cycles (but not more than 5, tuned so that we stay under the 2 minute limit for a single API call).
To decide when to stop processing we keep the list of stable samples as follows:
Removed: User Notification
Using 100 samples provides a balance between expected complexity and time budget we work with. We might want to change it in the future, possibly dynamically, making the specific number of no importance to the user. Thus we remove the truncation notification.
Unchanged:
Refactored:
x-pack/plugins/integration_assistant/common/constants.ts
file.formattedSamples
and combinemodelJSONInput
back intomodelInput
.Note
I had difficulty generating new graph diagrams, so they remain unchanged.
Testing
Postgres
25 samples, 1 review cycle, 50s for categorization: ai_postgres_202410150832-1.0.0.zip
(generated ingest pipeline)
```yaml --- description: Pipeline to process ai_postgres_202410150832 audit logs processors: - set: tag: set_ecs_version field: ecs.version value: 8.11.0 - set: tag: copy_original_message field: originalMessage copy_from: message - csv: tag: parse_csv field: message target_fields: - ai_postgres_202410150832.audit.timestamp - ai_postgres_202410150832.audit.database - ai_postgres_202410150832.audit.user - ai_postgres_202410150832.audit.process_id - ai_postgres_202410150832.audit.client_address - ai_postgres_202410150832.audit.session_id - ai_postgres_202410150832.audit.line_num - ai_postgres_202410150832.audit.command_tag - ai_postgres_202410150832.audit.session_start_time - ai_postgres_202410150832.audit.virtual_transaction_id - ai_postgres_202410150832.audit.transaction_id - ai_postgres_202410150832.audit.error_severity - ai_postgres_202410150832.audit.sql_state_code - ai_postgres_202410150832.audit.message - ai_postgres_202410150832.audit.column15 - ai_postgres_202410150832.audit.column16 - ai_postgres_202410150832.audit.column17 - ai_postgres_202410150832.audit.column18 - ai_postgres_202410150832.audit.column19 - ai_postgres_202410150832.audit.column20 - ai_postgres_202410150832.audit.column21 - ai_postgres_202410150832.audit.application_name - ai_postgres_202410150832.audit.backend_type - ai_postgres_202410150832.audit.column24 description: Parse CSV input - rename: ignore_missing: true if: ctx.event?.original == null tag: rename_message field: originalMessage target_field: event.original - remove: ignore_missing: true if: ctx.event?.original != null tag: remove_copied_message field: originalMessage - remove: ignore_missing: true tag: remove_message field: message - rename: ignore_missing: true field: ai_postgres_202410150832.audit.transaction_id target_field: transaction.id - convert: ignore_failure: true ignore_missing: true field: ai_postgres_202410150832.audit.process_id target_field: process.pid type: long - rename: ignore_missing: true field: ai_postgres_202410150832.audit.error_severity target_field: log.level - script: tag: script_convert_array_to_string description: Ensures the date processor does not receive an array value. lang: painless source: | if (ctx.ai_postgres_202410150832?.audit?.session_start_time != null && ctx.ai_postgres_202410150832.audit.session_start_time instanceof ArrayList){ ctx.ai_postgres_202410150832.audit.session_start_time = ctx.ai_postgres_202410150832.audit.session_start_time[0]; } - date: if: ctx.ai_postgres_202410150832?.audit?.session_start_time != null tag: date_processor_ai_postgres_202410150832.audit.session_start_time field: ai_postgres_202410150832.audit.session_start_time target_field: event.start formats: - yyyy-MM-dd HH:mm:ss z - rename: ignore_missing: true field: ai_postgres_202410150832.audit.message target_field: message - script: tag: script_convert_array_to_string description: Ensures the date processor does not receive an array value. lang: painless source: | if (ctx.ai_postgres_202410150832?.audit?.timestamp != null && ctx.ai_postgres_202410150832.audit.timestamp instanceof ArrayList){ ctx.ai_postgres_202410150832.audit.timestamp = ctx.ai_postgres_202410150832.audit.timestamp[0]; } - date: if: ctx.ai_postgres_202410150832?.audit?.timestamp != null tag: date_processor_ai_postgres_202410150832.audit.timestamp field: ai_postgres_202410150832.audit.timestamp target_field: '@timestamp' formats: - yyyy-MM-dd HH:mm:ss.SSS z - rename: ignore_missing: true field: ai_postgres_202410150832.audit.database target_field: destination.domain - rename: ignore_missing: true field: ai_postgres_202410150832.audit.client_address target_field: source.address - rename: ignore_missing: true field: ai_postgres_202410150832.audit.user target_field: user.name - script: tag: script_drop_null_empty_values description: Drops null/empty values recursively. lang: painless source: | boolean dropEmptyFields(Object object) { if (object == null || object == "") { return true; } else if (object instanceof Map) { ((Map) object).values().removeIf(value -> dropEmptyFields(value)); return (((Map) object).size() == 0); } else if (object instanceof List) { ((List) object).removeIf(value -> dropEmptyFields(value)); return (((List) object).length == 0); } return false; } dropEmptyFields(ctx); - geoip: ignore_missing: true tag: geoip_source_ip field: source.ip target_field: source.geo - geoip: ignore_missing: true tag: geoip_source_asn database_file: GeoLite2-ASN.mmdb field: source.ip target_field: source.as properties: - asn - organization_name - rename: ignore_missing: true tag: rename_source_as_asn field: source.as.asn target_field: source.as.number - rename: ignore_missing: true tag: rename_source_as_organization_name field: source.as.organization_name target_field: source.as.organization.name - geoip: ignore_missing: true tag: geoip_destination_ip field: destination.ip target_field: destination.geo - geoip: ignore_missing: true tag: geoip_destination_asn database_file: GeoLite2-ASN.mmdb field: destination.ip target_field: destination.as properties: - asn - organization_name - rename: ignore_missing: true tag: rename_destination_as_asn field: destination.as.asn target_field: destination.as.number - rename: ignore_missing: true tag: rename_destination_as_organization_name field: destination.as.organization_name target_field: destination.as.organization.name - append: if: ctx.ai_postgres_202410150832?.audit?.column24 == 'checkpointer' field: event.category value: - database allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.column24 == 'checkpointer' field: event.type value: - info allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.column24 == 'client backend' field: event.category value: - database allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.column24 == 'client backend' field: event.type value: - access allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.column24 == 'postmaster' field: event.category value: - database allow_duplicates: false - append: if: ctx.message?.contains('starting PostgreSQL') field: event.type value: - info allow_duplicates: false - append: if: >- ctx.ai_postgres_202410150832?.audit?.column24 == 'postmaster' && !ctx.message?.contains('starting PostgreSQL') field: event.type value: - info allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.column24 == 'startup' field: event.category value: - database allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.column24 == 'startup' field: event.type value: - info allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.command_tag == 'authentication' field: event.category value: - database allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.command_tag == 'authentication' field: event.type value: - info allow_duplicates: false - append: if: ctx.message?.contains('connection received:') field: event.category value: - database allow_duplicates: false - append: if: ctx.message?.contains('connection received:') field: event.type value: - info allow_duplicates: false - append: if: ctx.message?.contains('disconnection:') field: event.category value: - database allow_duplicates: false - append: if: ctx.message?.contains('disconnection:') field: event.type value: - info allow_duplicates: false - append: if: >- ctx.message?.contains('parameter') && ctx.message?.contains('changed to') field: event.category value: - configuration allow_duplicates: false - append: if: >- ctx.message?.contains('parameter') && ctx.message?.contains('changed to') field: event.type value: - change allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.column24 == 'not initialized' field: event.type value: - info allow_duplicates: false - append: if: ctx.ai_postgres_202410150832?.audit?.command_tag == 'authentication' field: event.type value: - access - info allow_duplicates: false - append: if: ctx.source?.address != null field: related.ip value: '{{{source.address}}}' allow_duplicates: false - append: if: ctx.user?.name != null field: related.user value: '{{{user.name}}}' allow_duplicates: false - append: if: ctx.destination?.domain != null field: related.hosts value: '{{{destination.domain}}}' allow_duplicates: false - remove: ignore_missing: true tag: remove_fields field: - ai_postgres_202410150832.audit.process_id - remove: ignore_failure: true ignore_missing: true if: ctx?.tags == null || !(ctx.tags.contains("preserve_original_event")) tag: remove_original_event field: event.original on_failure: - append: field: error.message value: >- Processor {{{_ingest.on_failure_processor_type}}} with tag {{{_ingest.on_failure_processor_tag}}} in pipeline {{{_ingest.on_failure_pipeline}}} failed with message: {{{_ingest.on_failure_message}}} - set: field: event.kind value: pipeline_error ```(example event)
Teleport Audit Events
28 samples, 1 review cycle, 50s for categorization: ai_teleport_202410150835-1.0.0.zip
(generated ingest pipeline)
(example event)
PAN-OS Traffic
100 samples, 4 review cycles, 120s, then 100s for categorization: ai_panw_202410150813-1.0.0.zip
(generated ingest pipeline)
(example event)
Checklist
For maintainers
Footnotes
As before, deterministically selected on the frontend, see https://github.com/elastic/kibana/pull/191598 ↩