Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[8.11] [Security Solution] [Elastic AI Assistant] Hybrid (vector + te…
…rms) search for improved ES|QL query generation (#168995) (#169054) # Backport This will backport the following commits from `main` to `8.11`: - [[Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation (#168995)](#168995) <!--- Backport version: 8.9.7 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Andrew Macri","email":"[email protected]"},"sourceCommit":{"committedDate":"2023-10-17T00:54:40Z","message":"[Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation (#168995)\n\n## [Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation\r\n\r\nThis PR implements hybrid (vector + terms) search to improve the quality of `ES|QL` queries generated by the Elastic AI Assistant.\r\n\r\nThe hybrid search combines (from a single request to Elasticsearch):\r\n\r\n- Vector search results from ELSER that vary depending on the query specified by the user\r\n- Terms search results that return a set of Knowledge Base (KB) documents marked as \"required\" for a topic\r\n\r\nThe hybrid search results, when provided as context to an LLM, improve the quality of generated `ES|QL` queries by combining `ES|QL` parser grammar and documentation specific to the question asked by a user with additional examples of valid `ES|QL` queries that aren't specific to the query.\r\n\r\n## Details\r\n\r\n### Indexing additional `metadata`\r\n\r\nThe `loadESQL` function in `x-pack/plugins/elastic_assistant/server/lib/langchain/content_loaders/esql_loader.ts` loads a directory containing 13 valid, and one invalid example of `ES|QL` queries:\r\n\r\n```typescript\r\n const rawExampleQueries = await exampleQueriesLoader.load();\r\n\r\n // Add additional metadata to the example queries that indicates they are required KB documents:\r\n const requiredExampleQueries = addRequiredKbResourceMetadata({\r\n docs: rawExampleQueries,\r\n kbResource: ESQL_RESOURCE,\r\n });\r\n```\r\n\r\nThe `addRequiredKbResourceMetadata` function adds two additional fields to the `metadata` property of the document:\r\n\r\n- `kbResource` - a `keyword` field that specifies the category of knowledge, e.g. `esql`\r\n- `required` - a `boolean` field that when `true`, indicates the document should be returned in all searches for the `kbResource`\r\n\r\nThe additional metadata fields are shown below in the following abridged sample document:\r\n\r\n```\r\n{\r\n \"_index\": \".kibana-elastic-ai-assistant-kb\",\r\n \"_id\": \"e297e2d9-fb0e-4638-b4be-af31d1b31b9f\",\r\n \"_version\": 1,\r\n \"_seq_no\": 129,\r\n \"_primary_term\": 1,\r\n \"found\": true,\r\n \"_source\": {\r\n \"metadata\": {\r\n \"source\": \"/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc\",\r\n \"required\": true,\r\n \"kbResource\": \"esql\"\r\n },\r\n \"vector\": {\r\n \"tokens\": {\r\n \"serial\": 0.5612584,\r\n \"syntax\": 0.006727545,\r\n \"user\": 1.1184403,\r\n // ...additional tokens\r\n },\r\n \"model_id\": \".elser_model_2\"\r\n },\r\n \"text\": \"\"\"[[esql-example-queries]]\r\n\r\nThe following is an example ES|QL query:\r\n\r\n\\`\\`\\`\r\nFROM logs-*\r\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\", \"192.168.0.0/16\")\r\n| STATS destcount = COUNT(destination.ip) by user.name, host.name\r\n| ENRICH ldap_lookup_new ON user.name\r\n| WHERE group.name IS NOT NULL\r\n| EVAL follow_up = CASE(\r\n destcount >= 100, \"true\",\r\n \"false\")\r\n| SORT destcount desc\r\n| KEEP destcount, host.name, user.name, group.name, follow_up\r\n\\`\\`\\`\r\n\"\"\"\r\n }\r\n}\r\n```\r\n\r\n### Hybrid search\r\n\r\nThe `ElasticsearchStore.similaritySearch` function is invoked by LangChain's `VectorStoreRetriever.getRelevantDocuments` function when the `RetrievalQAChain` searches for documents.\r\n\r\nA single request to Elasticsearch performs a hybrid search that combines the vector and terms searches into a single request with an [msearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html):\r\n\r\n```typescript\r\n // requiredDocs is an array of filters that can be used in a `bool` Elasticsearch DSL query to filter in/out required KB documents:\r\n const requiredDocs = getRequiredKbDocsTermsQueryDsl(this.kbResource);\r\n\r\n // The `k` parameter is typically provided by LangChain's `VectorStoreRetriever._getRelevantDocuments`, which calls this function:\r\n const vectorSearchQuerySize = k ?? FALLBACK_SIMILARITY_SEARCH_SIZE;\r\n\r\n // build a vector search query:\r\n const vectorSearchQuery = getVectorSearchQuery({\r\n filter,\r\n modelId: this.model,\r\n mustNotTerms: requiredDocs,\r\n query,\r\n });\r\n\r\n // build a (separate) terms search query:\r\n const termsSearchQuery = getTermsSearchQuery(requiredDocs);\r\n\r\n // combine the vector search query and the terms search queries into a single multi-search query:\r\n const mSearchQueryBody = getMsearchQueryBody({\r\n index: this.index,\r\n termsSearchQuery,\r\n termsSearchQuerySize: TERMS_QUERY_SIZE,\r\n vectorSearchQuery,\r\n vectorSearchQuerySize,\r\n });\r\n\r\n try {\r\n // execute both queries via a single multi-search request:\r\n const result = await this.esClient.msearch<MsearchResponse>(mSearchQueryBody);\r\n\r\n // flatten the results of the combined queries into a single array of hits:\r\n const results: FlattenedHit[] = result.responses.flatMap((response) =>\r\n // ...\r\n```\r\n\r\n## Desk testing\r\n\r\n1. Delete any previous instances of the Knowledge Base by executing the following query in Kibana's `Dev Tools`:\r\n\r\n```\r\n\r\nDELETE .kibana-elastic-ai-assistant-kb\r\n\r\n```\r\n\r\n2. In the Security Solution, open the Elastic AI Assistant\r\n\r\n3. In the assistant, click the `Settings` gear\r\n\r\n4. Click the `Knowledge Base` icon to view the KB settings\r\n\r\n5. Toggle the `Knowledge Base` setting `off` if it's already on\r\n\r\n6. Toggle the `Knowledge Base` setting `on` to load the KB documents\r\n\r\n7. Click the `Save` button to close settings\r\n\r\n8. Enter the following prompt, then press Enter:\r\n\r\n```\r\nGenerate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names.\r\n```\r\n\r\n**Expected result**\r\n\r\nA response similar to the following is returned:\r\n\r\n```\r\nFROM logs-*\r\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\", \"192.168.0.0/16\")\r\n| STATS destcount = COUNT(destination.ip) BY user.name\r\n| ENRICH ldap_lookup ON user.name\r\n| EVAL follow_up = CASE(\r\n destcount >= 100, \"true\",\r\n \"false\")\r\n| SORT destcount DESC\r\n| KEEP destcount, user.name, group.name, follow_up\r\n```","sha":"d0e99258c68d57bc83788724814783ece176aa78","branchLabelMapping":{"^v8.12.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team: SecuritySolution","Feature:Elastic AI Assistant","v8.11.0","v8.12.0"],"number":168995,"url":"https://github.com/elastic/kibana/pull/168995","mergeCommit":{"message":"[Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation (#168995)\n\n## [Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation\r\n\r\nThis PR implements hybrid (vector + terms) search to improve the quality of `ES|QL` queries generated by the Elastic AI Assistant.\r\n\r\nThe hybrid search combines (from a single request to Elasticsearch):\r\n\r\n- Vector search results from ELSER that vary depending on the query specified by the user\r\n- Terms search results that return a set of Knowledge Base (KB) documents marked as \"required\" for a topic\r\n\r\nThe hybrid search results, when provided as context to an LLM, improve the quality of generated `ES|QL` queries by combining `ES|QL` parser grammar and documentation specific to the question asked by a user with additional examples of valid `ES|QL` queries that aren't specific to the query.\r\n\r\n## Details\r\n\r\n### Indexing additional `metadata`\r\n\r\nThe `loadESQL` function in `x-pack/plugins/elastic_assistant/server/lib/langchain/content_loaders/esql_loader.ts` loads a directory containing 13 valid, and one invalid example of `ES|QL` queries:\r\n\r\n```typescript\r\n const rawExampleQueries = await exampleQueriesLoader.load();\r\n\r\n // Add additional metadata to the example queries that indicates they are required KB documents:\r\n const requiredExampleQueries = addRequiredKbResourceMetadata({\r\n docs: rawExampleQueries,\r\n kbResource: ESQL_RESOURCE,\r\n });\r\n```\r\n\r\nThe `addRequiredKbResourceMetadata` function adds two additional fields to the `metadata` property of the document:\r\n\r\n- `kbResource` - a `keyword` field that specifies the category of knowledge, e.g. `esql`\r\n- `required` - a `boolean` field that when `true`, indicates the document should be returned in all searches for the `kbResource`\r\n\r\nThe additional metadata fields are shown below in the following abridged sample document:\r\n\r\n```\r\n{\r\n \"_index\": \".kibana-elastic-ai-assistant-kb\",\r\n \"_id\": \"e297e2d9-fb0e-4638-b4be-af31d1b31b9f\",\r\n \"_version\": 1,\r\n \"_seq_no\": 129,\r\n \"_primary_term\": 1,\r\n \"found\": true,\r\n \"_source\": {\r\n \"metadata\": {\r\n \"source\": \"/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc\",\r\n \"required\": true,\r\n \"kbResource\": \"esql\"\r\n },\r\n \"vector\": {\r\n \"tokens\": {\r\n \"serial\": 0.5612584,\r\n \"syntax\": 0.006727545,\r\n \"user\": 1.1184403,\r\n // ...additional tokens\r\n },\r\n \"model_id\": \".elser_model_2\"\r\n },\r\n \"text\": \"\"\"[[esql-example-queries]]\r\n\r\nThe following is an example ES|QL query:\r\n\r\n\\`\\`\\`\r\nFROM logs-*\r\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\", \"192.168.0.0/16\")\r\n| STATS destcount = COUNT(destination.ip) by user.name, host.name\r\n| ENRICH ldap_lookup_new ON user.name\r\n| WHERE group.name IS NOT NULL\r\n| EVAL follow_up = CASE(\r\n destcount >= 100, \"true\",\r\n \"false\")\r\n| SORT destcount desc\r\n| KEEP destcount, host.name, user.name, group.name, follow_up\r\n\\`\\`\\`\r\n\"\"\"\r\n }\r\n}\r\n```\r\n\r\n### Hybrid search\r\n\r\nThe `ElasticsearchStore.similaritySearch` function is invoked by LangChain's `VectorStoreRetriever.getRelevantDocuments` function when the `RetrievalQAChain` searches for documents.\r\n\r\nA single request to Elasticsearch performs a hybrid search that combines the vector and terms searches into a single request with an [msearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html):\r\n\r\n```typescript\r\n // requiredDocs is an array of filters that can be used in a `bool` Elasticsearch DSL query to filter in/out required KB documents:\r\n const requiredDocs = getRequiredKbDocsTermsQueryDsl(this.kbResource);\r\n\r\n // The `k` parameter is typically provided by LangChain's `VectorStoreRetriever._getRelevantDocuments`, which calls this function:\r\n const vectorSearchQuerySize = k ?? FALLBACK_SIMILARITY_SEARCH_SIZE;\r\n\r\n // build a vector search query:\r\n const vectorSearchQuery = getVectorSearchQuery({\r\n filter,\r\n modelId: this.model,\r\n mustNotTerms: requiredDocs,\r\n query,\r\n });\r\n\r\n // build a (separate) terms search query:\r\n const termsSearchQuery = getTermsSearchQuery(requiredDocs);\r\n\r\n // combine the vector search query and the terms search queries into a single multi-search query:\r\n const mSearchQueryBody = getMsearchQueryBody({\r\n index: this.index,\r\n termsSearchQuery,\r\n termsSearchQuerySize: TERMS_QUERY_SIZE,\r\n vectorSearchQuery,\r\n vectorSearchQuerySize,\r\n });\r\n\r\n try {\r\n // execute both queries via a single multi-search request:\r\n const result = await this.esClient.msearch<MsearchResponse>(mSearchQueryBody);\r\n\r\n // flatten the results of the combined queries into a single array of hits:\r\n const results: FlattenedHit[] = result.responses.flatMap((response) =>\r\n // ...\r\n```\r\n\r\n## Desk testing\r\n\r\n1. Delete any previous instances of the Knowledge Base by executing the following query in Kibana's `Dev Tools`:\r\n\r\n```\r\n\r\nDELETE .kibana-elastic-ai-assistant-kb\r\n\r\n```\r\n\r\n2. In the Security Solution, open the Elastic AI Assistant\r\n\r\n3. In the assistant, click the `Settings` gear\r\n\r\n4. Click the `Knowledge Base` icon to view the KB settings\r\n\r\n5. Toggle the `Knowledge Base` setting `off` if it's already on\r\n\r\n6. Toggle the `Knowledge Base` setting `on` to load the KB documents\r\n\r\n7. Click the `Save` button to close settings\r\n\r\n8. Enter the following prompt, then press Enter:\r\n\r\n```\r\nGenerate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names.\r\n```\r\n\r\n**Expected result**\r\n\r\nA response similar to the following is returned:\r\n\r\n```\r\nFROM logs-*\r\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\", \"192.168.0.0/16\")\r\n| STATS destcount = COUNT(destination.ip) BY user.name\r\n| ENRICH ldap_lookup ON user.name\r\n| EVAL follow_up = CASE(\r\n destcount >= 100, \"true\",\r\n \"false\")\r\n| SORT destcount DESC\r\n| KEEP destcount, user.name, group.name, follow_up\r\n```","sha":"d0e99258c68d57bc83788724814783ece176aa78"}},"sourceBranch":"main","suggestedTargetBranches":["8.11"],"targetPullRequestStates":[{"branch":"8.11","label":"v8.11.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.12.0","labelRegex":"^v8.12.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/168995","number":168995,"mergeCommit":{"message":"[Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation (#168995)\n\n## [Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation\r\n\r\nThis PR implements hybrid (vector + terms) search to improve the quality of `ES|QL` queries generated by the Elastic AI Assistant.\r\n\r\nThe hybrid search combines (from a single request to Elasticsearch):\r\n\r\n- Vector search results from ELSER that vary depending on the query specified by the user\r\n- Terms search results that return a set of Knowledge Base (KB) documents marked as \"required\" for a topic\r\n\r\nThe hybrid search results, when provided as context to an LLM, improve the quality of generated `ES|QL` queries by combining `ES|QL` parser grammar and documentation specific to the question asked by a user with additional examples of valid `ES|QL` queries that aren't specific to the query.\r\n\r\n## Details\r\n\r\n### Indexing additional `metadata`\r\n\r\nThe `loadESQL` function in `x-pack/plugins/elastic_assistant/server/lib/langchain/content_loaders/esql_loader.ts` loads a directory containing 13 valid, and one invalid example of `ES|QL` queries:\r\n\r\n```typescript\r\n const rawExampleQueries = await exampleQueriesLoader.load();\r\n\r\n // Add additional metadata to the example queries that indicates they are required KB documents:\r\n const requiredExampleQueries = addRequiredKbResourceMetadata({\r\n docs: rawExampleQueries,\r\n kbResource: ESQL_RESOURCE,\r\n });\r\n```\r\n\r\nThe `addRequiredKbResourceMetadata` function adds two additional fields to the `metadata` property of the document:\r\n\r\n- `kbResource` - a `keyword` field that specifies the category of knowledge, e.g. `esql`\r\n- `required` - a `boolean` field that when `true`, indicates the document should be returned in all searches for the `kbResource`\r\n\r\nThe additional metadata fields are shown below in the following abridged sample document:\r\n\r\n```\r\n{\r\n \"_index\": \".kibana-elastic-ai-assistant-kb\",\r\n \"_id\": \"e297e2d9-fb0e-4638-b4be-af31d1b31b9f\",\r\n \"_version\": 1,\r\n \"_seq_no\": 129,\r\n \"_primary_term\": 1,\r\n \"found\": true,\r\n \"_source\": {\r\n \"metadata\": {\r\n \"source\": \"/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc\",\r\n \"required\": true,\r\n \"kbResource\": \"esql\"\r\n },\r\n \"vector\": {\r\n \"tokens\": {\r\n \"serial\": 0.5612584,\r\n \"syntax\": 0.006727545,\r\n \"user\": 1.1184403,\r\n // ...additional tokens\r\n },\r\n \"model_id\": \".elser_model_2\"\r\n },\r\n \"text\": \"\"\"[[esql-example-queries]]\r\n\r\nThe following is an example ES|QL query:\r\n\r\n\\`\\`\\`\r\nFROM logs-*\r\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\", \"192.168.0.0/16\")\r\n| STATS destcount = COUNT(destination.ip) by user.name, host.name\r\n| ENRICH ldap_lookup_new ON user.name\r\n| WHERE group.name IS NOT NULL\r\n| EVAL follow_up = CASE(\r\n destcount >= 100, \"true\",\r\n \"false\")\r\n| SORT destcount desc\r\n| KEEP destcount, host.name, user.name, group.name, follow_up\r\n\\`\\`\\`\r\n\"\"\"\r\n }\r\n}\r\n```\r\n\r\n### Hybrid search\r\n\r\nThe `ElasticsearchStore.similaritySearch` function is invoked by LangChain's `VectorStoreRetriever.getRelevantDocuments` function when the `RetrievalQAChain` searches for documents.\r\n\r\nA single request to Elasticsearch performs a hybrid search that combines the vector and terms searches into a single request with an [msearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html):\r\n\r\n```typescript\r\n // requiredDocs is an array of filters that can be used in a `bool` Elasticsearch DSL query to filter in/out required KB documents:\r\n const requiredDocs = getRequiredKbDocsTermsQueryDsl(this.kbResource);\r\n\r\n // The `k` parameter is typically provided by LangChain's `VectorStoreRetriever._getRelevantDocuments`, which calls this function:\r\n const vectorSearchQuerySize = k ?? FALLBACK_SIMILARITY_SEARCH_SIZE;\r\n\r\n // build a vector search query:\r\n const vectorSearchQuery = getVectorSearchQuery({\r\n filter,\r\n modelId: this.model,\r\n mustNotTerms: requiredDocs,\r\n query,\r\n });\r\n\r\n // build a (separate) terms search query:\r\n const termsSearchQuery = getTermsSearchQuery(requiredDocs);\r\n\r\n // combine the vector search query and the terms search queries into a single multi-search query:\r\n const mSearchQueryBody = getMsearchQueryBody({\r\n index: this.index,\r\n termsSearchQuery,\r\n termsSearchQuerySize: TERMS_QUERY_SIZE,\r\n vectorSearchQuery,\r\n vectorSearchQuerySize,\r\n });\r\n\r\n try {\r\n // execute both queries via a single multi-search request:\r\n const result = await this.esClient.msearch<MsearchResponse>(mSearchQueryBody);\r\n\r\n // flatten the results of the combined queries into a single array of hits:\r\n const results: FlattenedHit[] = result.responses.flatMap((response) =>\r\n // ...\r\n```\r\n\r\n## Desk testing\r\n\r\n1. Delete any previous instances of the Knowledge Base by executing the following query in Kibana's `Dev Tools`:\r\n\r\n```\r\n\r\nDELETE .kibana-elastic-ai-assistant-kb\r\n\r\n```\r\n\r\n2. In the Security Solution, open the Elastic AI Assistant\r\n\r\n3. In the assistant, click the `Settings` gear\r\n\r\n4. Click the `Knowledge Base` icon to view the KB settings\r\n\r\n5. Toggle the `Knowledge Base` setting `off` if it's already on\r\n\r\n6. Toggle the `Knowledge Base` setting `on` to load the KB documents\r\n\r\n7. Click the `Save` button to close settings\r\n\r\n8. Enter the following prompt, then press Enter:\r\n\r\n```\r\nGenerate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names.\r\n```\r\n\r\n**Expected result**\r\n\r\nA response similar to the following is returned:\r\n\r\n```\r\nFROM logs-*\r\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\", \"192.168.0.0/16\")\r\n| STATS destcount = COUNT(destination.ip) BY user.name\r\n| ENRICH ldap_lookup ON user.name\r\n| EVAL follow_up = CASE(\r\n destcount >= 100, \"true\",\r\n \"false\")\r\n| SORT destcount DESC\r\n| KEEP destcount, user.name, group.name, follow_up\r\n```","sha":"d0e99258c68d57bc83788724814783ece176aa78"}}]}] BACKPORT--> Co-authored-by: Andrew Macri <[email protected]>
- Loading branch information