Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[8.x] [Auto Import] Use larger number of samples on the backend (#196233
) (#196386) # Backport This will backport the following commits from `main` to `8.x`: - [[Auto Import] Use larger number of samples on the backend (#196233)](#196233) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Ilya Nikokoshev","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-10-15T16:22:05Z","message":"[Auto Import] Use larger number of samples on the backend (#196233)\n\n## Release Notes\r\n\r\nAutomatic Import now analyses larger number of samples to generate an\r\nintegration.\r\n\r\n## Summary\r\n\r\nCloses https://github.com/elastic/security-team/issues/9844\r\n\r\n**Added: Backend Sampling**\r\n\r\nWe pass 100 rows (these numeric values are adjustable) to the backend\r\n[^1]\r\n\r\n[^1]: As before, deterministically selected on the frontend, see\r\nhttps://github.com//pull/191598\r\n\r\n\r\nThe Categorization chain now processes the samples in batches,\r\nperforming after initial categorization a number of review cycles (but\r\nnot more than 5, tuned so that we stay under the 2 minute limit for a\r\nsingle API call).\r\n\r\nTo decide when to stop processing we keep the list of _stable_ samples\r\nas follows:\r\n\r\n1. The list is initially empty.\r\n2. For each review we select a random subset of 40 samples, preferring\r\nto pick up the not-stable samples.\r\n3. After each review – when the LLM potentially gives us new or changes\r\nthe old processors – we compare the new pipeline results with the old\r\npipeline results.\r\n4. Those reviewed samples that did not change their categorization are\r\nadded to the stable list.\r\n5. Any samples that have changed their categorization are removed from\r\nthe stable list.\r\n6. If all samples are stable, we finish processing.\r\n\r\n**Removed: User Notification**\r\n\r\nUsing 100 samples provides a balance between expected complexity and\r\ntime budget we work with. We might want to change it in the future,\r\npossibly dynamically, making the specific number of no importance to the\r\nuser. Thus we remove the truncation notification.\r\n\r\n**Unchanged:**\r\n\r\n- No batching is made in the related chain: it seems to work as-is.\r\n\r\n**Refactored:**\r\n\r\n- We centralize the sizing constants in the\r\n`x-pack/plugins/integration_assistant/common/constants.ts` file.\r\n- We remove the unused state key `formattedSamples` and combine\r\n`modelJSONInput` back into `modelInput`.\r\n\r\n> [!NOTE] \r\n> I had difficulty generating new graph diagrams, so they remain\r\nunchanged.","sha":"fc3ce5475a73aad1abdbf857bc8787cd0f10aaed","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:enhancement","enhancement","v9.0.0","backport:prev-minor","8.16 candidate","Team:Security-Scalability","Feature:AutomaticImport"],"title":"[Auto Import] Use larger number of samples on the backend","number":196233,"url":"https://github.com/elastic/kibana/pull/196233","mergeCommit":{"message":"[Auto Import] Use larger number of samples on the backend (#196233)\n\n## Release Notes\r\n\r\nAutomatic Import now analyses larger number of samples to generate an\r\nintegration.\r\n\r\n## Summary\r\n\r\nCloses https://github.com/elastic/security-team/issues/9844\r\n\r\n**Added: Backend Sampling**\r\n\r\nWe pass 100 rows (these numeric values are adjustable) to the backend\r\n[^1]\r\n\r\n[^1]: As before, deterministically selected on the frontend, see\r\nhttps://github.com//pull/191598\r\n\r\n\r\nThe Categorization chain now processes the samples in batches,\r\nperforming after initial categorization a number of review cycles (but\r\nnot more than 5, tuned so that we stay under the 2 minute limit for a\r\nsingle API call).\r\n\r\nTo decide when to stop processing we keep the list of _stable_ samples\r\nas follows:\r\n\r\n1. The list is initially empty.\r\n2. For each review we select a random subset of 40 samples, preferring\r\nto pick up the not-stable samples.\r\n3. After each review – when the LLM potentially gives us new or changes\r\nthe old processors – we compare the new pipeline results with the old\r\npipeline results.\r\n4. Those reviewed samples that did not change their categorization are\r\nadded to the stable list.\r\n5. Any samples that have changed their categorization are removed from\r\nthe stable list.\r\n6. If all samples are stable, we finish processing.\r\n\r\n**Removed: User Notification**\r\n\r\nUsing 100 samples provides a balance between expected complexity and\r\ntime budget we work with. We might want to change it in the future,\r\npossibly dynamically, making the specific number of no importance to the\r\nuser. Thus we remove the truncation notification.\r\n\r\n**Unchanged:**\r\n\r\n- No batching is made in the related chain: it seems to work as-is.\r\n\r\n**Refactored:**\r\n\r\n- We centralize the sizing constants in the\r\n`x-pack/plugins/integration_assistant/common/constants.ts` file.\r\n- We remove the unused state key `formattedSamples` and combine\r\n`modelJSONInput` back into `modelInput`.\r\n\r\n> [!NOTE] \r\n> I had difficulty generating new graph diagrams, so they remain\r\nunchanged.","sha":"fc3ce5475a73aad1abdbf857bc8787cd0f10aaed"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/196233","number":196233,"mergeCommit":{"message":"[Auto Import] Use larger number of samples on the backend (#196233)\n\n## Release Notes\r\n\r\nAutomatic Import now analyses larger number of samples to generate an\r\nintegration.\r\n\r\n## Summary\r\n\r\nCloses https://github.com/elastic/security-team/issues/9844\r\n\r\n**Added: Backend Sampling**\r\n\r\nWe pass 100 rows (these numeric values are adjustable) to the backend\r\n[^1]\r\n\r\n[^1]: As before, deterministically selected on the frontend, see\r\nhttps://github.com//pull/191598\r\n\r\n\r\nThe Categorization chain now processes the samples in batches,\r\nperforming after initial categorization a number of review cycles (but\r\nnot more than 5, tuned so that we stay under the 2 minute limit for a\r\nsingle API call).\r\n\r\nTo decide when to stop processing we keep the list of _stable_ samples\r\nas follows:\r\n\r\n1. The list is initially empty.\r\n2. For each review we select a random subset of 40 samples, preferring\r\nto pick up the not-stable samples.\r\n3. After each review – when the LLM potentially gives us new or changes\r\nthe old processors – we compare the new pipeline results with the old\r\npipeline results.\r\n4. Those reviewed samples that did not change their categorization are\r\nadded to the stable list.\r\n5. Any samples that have changed their categorization are removed from\r\nthe stable list.\r\n6. If all samples are stable, we finish processing.\r\n\r\n**Removed: User Notification**\r\n\r\nUsing 100 samples provides a balance between expected complexity and\r\ntime budget we work with. We might want to change it in the future,\r\npossibly dynamically, making the specific number of no importance to the\r\nuser. Thus we remove the truncation notification.\r\n\r\n**Unchanged:**\r\n\r\n- No batching is made in the related chain: it seems to work as-is.\r\n\r\n**Refactored:**\r\n\r\n- We centralize the sizing constants in the\r\n`x-pack/plugins/integration_assistant/common/constants.ts` file.\r\n- We remove the unused state key `formattedSamples` and combine\r\n`modelJSONInput` back into `modelInput`.\r\n\r\n> [!NOTE] \r\n> I had difficulty generating new graph diagrams, so they remain\r\nunchanged.","sha":"fc3ce5475a73aad1abdbf857bc8787cd0f10aaed"}}]}] BACKPORT--> Co-authored-by: Ilya Nikokoshev <[email protected]>
- Loading branch information