Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) se…
…arch for improved ES|QL query generation (#168995) ## [Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation This PR implements hybrid (vector + terms) search to improve the quality of `ES|QL` queries generated by the Elastic AI Assistant. The hybrid search combines (from a single request to Elasticsearch): - Vector search results from ELSER that vary depending on the query specified by the user - Terms search results that return a set of Knowledge Base (KB) documents marked as "required" for a topic The hybrid search results, when provided as context to an LLM, improve the quality of generated `ES|QL` queries by combining `ES|QL` parser grammar and documentation specific to the question asked by a user with additional examples of valid `ES|QL` queries that aren't specific to the query. ## Details ### Indexing additional `metadata` The `loadESQL` function in `x-pack/plugins/elastic_assistant/server/lib/langchain/content_loaders/esql_loader.ts` loads a directory containing 13 valid, and one invalid example of `ES|QL` queries: ```typescript const rawExampleQueries = await exampleQueriesLoader.load(); // Add additional metadata to the example queries that indicates they are required KB documents: const requiredExampleQueries = addRequiredKbResourceMetadata({ docs: rawExampleQueries, kbResource: ESQL_RESOURCE, }); ``` The `addRequiredKbResourceMetadata` function adds two additional fields to the `metadata` property of the document: - `kbResource` - a `keyword` field that specifies the category of knowledge, e.g. `esql` - `required` - a `boolean` field that when `true`, indicates the document should be returned in all searches for the `kbResource` The additional metadata fields are shown below in the following abridged sample document: ``` { "_index": ".kibana-elastic-ai-assistant-kb", "_id": "e297e2d9-fb0e-4638-b4be-af31d1b31b9f", "_version": 1, "_seq_no": 129, "_primary_term": 1, "found": true, "_source": { "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc", "required": true, "kbResource": "esql" }, "vector": { "tokens": { "serial": 0.5612584, "syntax": 0.006727545, "user": 1.1184403, // ...additional tokens }, "model_id": ".elser_model_2" }, "text": """[[esql-example-queries]] The following is an example ES|QL query: \`\`\` FROM logs-* | WHERE NOT CIDR_MATCH(destination.ip, "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16") | STATS destcount = COUNT(destination.ip) by user.name, host.name | ENRICH ldap_lookup_new ON user.name | WHERE group.name IS NOT NULL | EVAL follow_up = CASE( destcount >= 100, "true", "false") | SORT destcount desc | KEEP destcount, host.name, user.name, group.name, follow_up \`\`\` """ } } ``` ### Hybrid search The `ElasticsearchStore.similaritySearch` function is invoked by LangChain's `VectorStoreRetriever.getRelevantDocuments` function when the `RetrievalQAChain` searches for documents. A single request to Elasticsearch performs a hybrid search that combines the vector and terms searches into a single request with an [msearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html): ```typescript // requiredDocs is an array of filters that can be used in a `bool` Elasticsearch DSL query to filter in/out required KB documents: const requiredDocs = getRequiredKbDocsTermsQueryDsl(this.kbResource); // The `k` parameter is typically provided by LangChain's `VectorStoreRetriever._getRelevantDocuments`, which calls this function: const vectorSearchQuerySize = k ?? FALLBACK_SIMILARITY_SEARCH_SIZE; // build a vector search query: const vectorSearchQuery = getVectorSearchQuery({ filter, modelId: this.model, mustNotTerms: requiredDocs, query, }); // build a (separate) terms search query: const termsSearchQuery = getTermsSearchQuery(requiredDocs); // combine the vector search query and the terms search queries into a single multi-search query: const mSearchQueryBody = getMsearchQueryBody({ index: this.index, termsSearchQuery, termsSearchQuerySize: TERMS_QUERY_SIZE, vectorSearchQuery, vectorSearchQuerySize, }); try { // execute both queries via a single multi-search request: const result = await this.esClient.msearch<MsearchResponse>(mSearchQueryBody); // flatten the results of the combined queries into a single array of hits: const results: FlattenedHit[] = result.responses.flatMap((response) => // ... ``` ## Desk testing 1. Delete any previous instances of the Knowledge Base by executing the following query in Kibana's `Dev Tools`: ``` DELETE .kibana-elastic-ai-assistant-kb ``` 2. In the Security Solution, open the Elastic AI Assistant 3. In the assistant, click the `Settings` gear 4. Click the `Knowledge Base` icon to view the KB settings 5. Toggle the `Knowledge Base` setting `off` if it's already on 6. Toggle the `Knowledge Base` setting `on` to load the KB documents 7. Click the `Save` button to close settings 8. Enter the following prompt, then press Enter: ``` Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called "follow_up" that contains a value of "true", otherwise, it should contain "false". The user names should also be enriched with their respective group names. ``` **Expected result** A response similar to the following is returned: ``` FROM logs-* | WHERE NOT CIDR_MATCH(destination.ip, "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16") | STATS destcount = COUNT(destination.ip) BY user.name | ENRICH ldap_lookup ON user.name | EVAL follow_up = CASE( destcount >= 100, "true", "false") | SORT destcount DESC | KEEP destcount, user.name, group.name, follow_up ``` (cherry picked from commit d0e9925)
- Loading branch information