-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation #168995
[Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation #168995
Conversation
… search for improved ES|QL query generation This PR implements hybrid (vector + terms) search to improve the quality of `ES|QL` queries generated by the Elastic AI Assistant. The hybrid search combines (from a single request to Elasticsearch): - Vector search results from ELSER that vary depending on the query specified by the user - Terms search results that return a set of Knowledge Base (KB) documents marked as "required" for a topic The hybrid search results, when provided as context to an LLM, improve the quality of generated `ES|QL` queries by combining `ES|QL` parser grammar and documentation specific to the question asked by a user with additional examples of valid `ES|QL` queries that aren't specific to the query. ## Details ### Indexing additional `metadata` The `loadESQL` function in `x-pack/plugins/elastic_assistant/server/lib/langchain/content_loaders/esql_loader.ts` loads a directory containing 13 valid, and one invalid example of `ES|QL` queries: ```typescript const rawExampleQueries = await exampleQueriesLoader.load(); // Add additional metadata to the example queries that indicates they are required KB documents: const requiredExampleQueries = addRequiredKbResourceMetadata({ docs: rawExampleQueries, kbResource: ESQL_RESOURCE, }); ``` The `addRequiredKbResourceMetadata` function adds two additional fields to the `metadata` property of the document: - `kbResource` - a `keyword` field that specifies the category of knowledge, e.g. `esql` - `required` - a `boolean` field that when `true`, indicates the document should be returned in all searches for the `kbResource` The additional metadata fields are shown below in the following abridged sample document: ``` { "_index": ".kibana-elastic-ai-assistant-kb", "_id": "e297e2d9-fb0e-4638-b4be-af31d1b31b9f", "_version": 1, "_seq_no": 129, "_primary_term": 1, "found": true, "_source": { "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc", "required": true, "kbResource": "esql" }, "vector": { "tokens": { "serial": 0.5612584, "syntax": 0.006727545, "user": 1.1184403, // ...additional tokens }, "model_id": ".elser_model_2" }, "text": """[[esql-example-queries]] The following is an example ES|QL query: \`\`\` FROM logs-* | WHERE NOT CIDR_MATCH(destination.ip, "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16") | STATS destcount = COUNT(destination.ip) by user.name, host.name | ENRICH ldap_lookup_new ON user.name | WHERE group.name IS NOT NULL | EVAL follow_up = CASE( destcount >= 100, "true", "false") | SORT destcount desc | KEEP destcount, host.name, user.name, group.name, follow_up \`\`\` """ } } ``` ### Hybrid search The `ElasticsearchStore.similaritySearch` function is invoked by LangChain's `VectorStoreRetriever.getRelevantDocuments` function when the `RetrievalQAChain` searches for documents. A single request to Elasticsearch performs a hybrid search that combines the vector and terms searches into a single request with an [msearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html): ```typescript // requiredDocs is an array of filters that can be used in a `bool` Elasticsearch DSL query to filter in/out required KB documents: const requiredDocs = getRequiredKbDocsTermsQueryDsl(this.kbResource); // The `k` parameter is typically provided by LangChain's `VectorStoreRetriever._getRelevantDocuments`, which calls this function: const vectorSearchQuerySize = k ?? FALLBACK_SIMILARITY_SEARCH_SIZE; // build a vector search query: const vectorSearchQuery = getVectorSearchQuery({ filter, modelId: this.model, mustNotTerms: requiredDocs, query, }); // build a (separate) terms search query: const termsSearchQuery = getTermsSearchQuery(requiredDocs); // combine the vector search query and the terms search queries into a single multi-search query: const mSearchQueryBody = getMsearchQueryBody({ index: this.index, termsSearchQuery, termsSearchQuerySize: TERMS_QUERY_SIZE, vectorSearchQuery, vectorSearchQuerySize, }); try { // execute both queries via a single multi-search request: const result = await this.esClient.msearch<MsearchResponse>(mSearchQueryBody); // flatten the results of the combined queries into a single array of hits: const results: FlattenedHit[] = result.responses.flatMap((response) => // ... ``` ## Desk testing 1. Delete any previous instances of the Knowledge Base by executing the following query in Kibana's `Dev Tools`: ``` DELETE .kibana-elastic-ai-assistant-kb ``` 2. In the Security Solution, open the Elastic AI Assistant 3. In the assistant, click the `Settings` gear 4. Click the `Knowledge Base` icon to view the KB settings 5. Toggle the `Knowledge Base` setting `off` if it's already on 6. Toggle the `Knowledge Base` setting `on` to load the KB documents 7. Click the `Save` button to close settings 8. Enter the following prompt, then press Enter: ``` Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called "follow_up" that contains a value of "true", otherwise, it should contain "false". The user names should also be enriched with their respective group names. ``` **Expected result** A response similar to the following is returned: ``` FROM logs-* | WHERE NOT CIDR_MATCH(destination.ip, "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16") | STATS destcount = COUNT(destination.ip) BY user.name | ENRICH ldap_lookup ON user.name | EVAL follow_up = CASE( destcount >= 100, "true", "false") | SORT destcount DESC | KEEP destcount, user.name, group.name, follow_up ``` ### Reference: Annotated `verbose: true` output The following output, annotated with `// comments` was generating by setting `verbose: true` in the following code in `x-pack/plugins/elastic_assistant/server/lib/langchain/execute_custom_llm_chain/index.ts`: ```typescript const executor = await initializeAgentExecutorWithOptions(tools, llm, { agentType: 'chat-conversational-react-description', memory, verbose: true, // <-- }); ``` <details> <summary>Annotated verbose output</summary> ```json // The chain starts with just the input from the user: a system prompt, plus the user's input: [chain/start] [1:chain:AgentExecutor] Entering Chain run with input: { "input": "You are a helpful, expert assistant who answers questions about Elastic Security. Do not answer questions unrelated to Elastic Security.\nIf you answer a question related to KQL, EQL, or ES|QL, it should be immediately usable within an Elastic Security timeline; please always format the output correctly with back ticks. Any answer provided for Query DSL should also be usable in a security timeline. This means you should only ever include the \"filter\" portion of the query.\nUse the following context to answer questions:\n\n\n\nGenerate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names.", "chat_history": [] } // The input from the previous step is unchanged in this one: [chain/start] [1:chain:AgentExecutor > 2:chain:LLMChain] Entering Chain run with input: { "input": "You are a helpful, expert assistant who answers questions about Elastic Security. Do not answer questions unrelated to Elastic Security.\nIf you answer a question related to KQL, EQL, or ES|QL, it should be immediately usable within an Elastic Security timeline; please always format the output correctly with back ticks. Any answer provided for Query DSL should also be usable in a security timeline. This means you should only ever include the \"filter\" portion of the query.\nUse the following context to answer questions:\n\n\n\nGenerate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names.", "chat_history": [], "agent_scratchpad": [], "stop": [ "Observation:" ] } // The "prompts" array below contains content written by LangChain inform the LLM about the available tools, including the ES|QL knowledge base, and "teach" it how to use them: [llm/start] [1:chain:AgentExecutor > 2:chain:LLMChain > 3:llm:ActionsClientLlm] Entering LLM run with input: { "prompts": [ "[{\"lc\":1,\"type\":\"constructor\",\"id\":[\"langchain\",\"schema\",\"SystemMessage\"],\"kwargs\":{\"content\":\"Assistant is a large language model trained by OpenAI.\\n\\nAssistant is designed to be able to assist with a wide range of tasks, from answering simple questions to providing in-depth explanations and discussions on a wide range of topics. As a language model, Assistant is able to generate human-like text based on the input it receives, allowing it to engage in natural-sounding conversations and provide responses that are coherent and relevant to the topic at hand.\\n\\nAssistant is constantly learning and improving, and its capabilities are constantly evolving. It is able to process and understand large amounts of text, and can use this knowledge to provide accurate and informative responses to a wide range of questions. Additionally, Assistant is able to generate its own text based on the input it receives, allowing it to engage in discussions and provide explanations and descriptions on a wide range of topics.\\n\\nOverall, Assistant is a powerful system that can help with a wide range of tasks and provide valuable insights and information on a wide range of topics. Whether you need help with a specific question or just want to have a conversation about a particular topic, Assistant is here to assist. However, above all else, all responses must adhere to the format of RESPONSE FORMAT INSTRUCTIONS.\",\"additional_kwargs\":{}}},{\"lc\":1,\"type\":\"constructor\",\"id\":[\"langchain\",\"schema\",\"HumanMessage\"],\"kwargs\":{\"content\":\"TOOLS\\n------\\nAssistant can ask the user to use tools to look up information that may be helpful in answering the users original question. The tools the human can use are:\\n\\nesql-language-knowledge-base: Call this for knowledge on how to build an ESQL query, or answer questions about the ES|QL query language.\\n\\nRESPONSE FORMAT INSTRUCTIONS\\n----------------------------\\n\\nOutput a JSON markdown code snippet containing a valid JSON object in one of two formats:\\n\\n**Option 1:**\\nUse this if you want the human to use a tool.\\nMarkdown code snippet formatted in the following schema:\\n\\n```json\\n{\\n \\\"action\\\": string, // The action to take. Must be one of [esql-language-knowledge-base]\\n \\\"action_input\\\": string // The input to the action. May be a stringified object.\\n}\\n```\\n\\n**Option #2:**\\nUse this if you want to respond directly and conversationally to the human. Markdown code snippet formatted in the following schema:\\n\\n```json\\n{\\n \\\"action\\\": \\\"Final Answer\\\",\\n \\\"action_input\\\": string // You should put what you want to return to use here and make sure to use valid json newline characters.\\n}\\n```\\n\\nFor both options, remember to always include the surrounding markdown code snippet delimiters (begin with \\\"```json\\\" and end with \\\"```\\\")!\\n\\n\\nUSER'S INPUT\\n--------------------\\nHere is the user's input (remember to respond with a markdown code snippet of a json blob with a single action, and NOTHING else):\\n\\nYou are a helpful, expert assistant who answers questions about Elastic Security. Do not answer questions unrelated to Elastic Security.\\nIf you answer a question related to KQL, EQL, or ES|QL, it should be immediately usable within an Elastic Security timeline; please always format the output correctly with back ticks. Any answer provided for Query DSL should also be usable in a security timeline. This means you should only ever include the \\\"filter\\\" portion of the query.\\nUse the following context to answer questions:\\n\\n\\n\\nGenerate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \\\"follow_up\\\" that contains a value of \\\"true\\\", otherwise, it should contain \\\"false\\\". The user names should also be enriched with their respective group names.\",\"additional_kwargs\":{}}}]" ] } // The LLM then uses the prompt above, to generate a response (below), which is then passed to the Chain: [llm/end] [1:chain:AgentExecutor > 2:chain:LLMChain > 3:llm:ActionsClientLlm] [5.48s] Exiting LLM run with output: { "generations": [ [ { "text": "```json\n{\n \"action\": \"esql-language-knowledge-base\",\n \"action_input\": \"Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \\\"follow_up\\\" that contains a value of \\\"true\\\", otherwise, it should contain \\\"false\\\". The user names should also be enriched with their respective group names.\"\n}\n```" } ] ] } // It's worth noting that the LLM **ONLY** provided the actual question posed by the user. The LLM correctly omitted all the other instructions, including the system prompt, because the question asked by the user is the most relevant piece of information for the LLM to use to generate a response. [chain/end] [1:chain:AgentExecutor > 2:chain:LLMChain] [5.49s] Exiting Chain run with output: { "text": "```json\n{\n \"action\": \"esql-language-knowledge-base\",\n \"action_input\": \"Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \\\"follow_up\\\" that contains a value of \\\"true\\\", otherwise, it should contain \\\"false\\\". The user names should also be enriched with their respective group names.\"\n}\n```" } // In this step, the `AgentExecutor` takes the output from the previous step, and passes it to the `ChainTool`: [agent/action] [1:chain:AgentExecutor] Agent selected action: { "tool": "esql-language-knowledge-base", "toolInput": "Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names.", "log": "```json\n{\n \"action\": \"esql-language-knowledge-base\",\n \"action_input\": \"Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \\\"follow_up\\\" that contains a value of \\\"true\\\", otherwise, it should contain \\\"false\\\". The user names should also be enriched with their respective group names.\"\n}\n```" } // The `ChainTool` then passes the input to the `RetrievalQAChain`: [tool/start] [1:chain:AgentExecutor > 4:tool:ChainTool] Entering Tool run with input: "Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called "follow_up" that contains a value of "true", otherwise, it should contain "false". The user names should also be enriched with their respective group names." // The `RetrievalQAChain` then passes the input to the `VectorStoreRetriever`: [chain/start] [1:chain:AgentExecutor > 4:tool:ChainTool > 5:chain:RetrievalQAChain] Entering Chain run with input: { "query": "Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names." } // The `VectorStoreRetriever` then passes the input to the `ElasticsearchStore`, and calls the `similaritySearch` method, in this example with a `k` value of `4`, which means that the `ElasticsearchStore` will return the top 4 results: [retriever/start] [1:chain:AgentExecutor > 4:tool:ChainTool > 5:chain:RetrievalQAChain > 6:retriever:VectorStoreRetriever] Entering Retriever run with input: { "query": "Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names." } // The `VectorStoreRetriever]` returned 18 results, the first 4 results are from ELSER, because the LangChain `RetrievalQAChain` is configured to return 4 results. The other 14 results matched a terms query where "metadata.kbResource": "esql" AND "metadata.required": true: [retriever/end] [1:chain:AgentExecutor > 4:tool:ChainTool > 5:chain:RetrievalQAChain > 6:retriever:VectorStoreRetriever] [23ms] Exiting Retriever run with output: { "documents": [ { "pageContent": "[[esql]]\n= {esql}\n\n:esql-tests: {xes-repo-dir}/../../plugin/esql/qa\n:esql-specs: {esql-tests}/testFixtures/src/main/resources\n\n[partintro]\n--\n\npreview::[]\n\nThe {es} Query Language ({esql}) is a query language that enables the iterative\nexploration of data.\n\nAn {esql} query consists of a series of commands, separated by pipes. Each query\nstarts with a <<esql-source-commands,source command>>. A source command produces\na table, typically with data from {es}.\n\nimage::images/esql/source-command.svg[A source command producing a table from {es},align=\"center\"]\n\nA source command can be followed by one or more\n<<esql-processing-commands,processing commands>>. Processing commands change an\ninput table by adding, removing, or changing rows and columns.\n\nimage::images/esql/processing-command.svg[A processing command changing an input table,align=\"center\"]\n\nYou can chain processing commands, separated by a pipe character: `|`. Each\nprocessing command works on the output table of the previous command.\n\nimage::images/esql/chaining-processing-commands.svg[Processing commands can be chained,align=\"center\"]\n\nThe result of a query is the table produced by the final processing command.\n\n[discrete]\n[[esql-console]]\n=== Run an {esql} query\n\n[discrete]\n==== The {esql} API\n\nUse the `_query` endpoint to run an {esql} query:\n\n[source,console]\n----\nPOST /_query\n{\n \"query\": \"\"\"\n FROM library\n | EVAL year = DATE_TRUNC(1 YEARS, release_date)\n | STATS MAX(page_count) BY year\n | SORT year\n | LIMIT 5\n \"\"\"\n}\n----\n// TEST[setup:library]\n\nThe results come back in rows:\n\n[source,console-result]\n----\n{\n \"columns\": [\n { \"name\": \"MAX(page_count)\", \"type\": \"integer\"},\n { \"name\": \"year\" , \"type\": \"date\"}\n ],\n \"values\": [\n [268, \"1932-01-01T00:00:00.000Z\"],\n [224, \"1951-01-01T00:00:00.000Z\"],\n [227, \"1953-01-01T00:00:00.000Z\"],\n [335, \"1959-01-01T00:00:00.000Z\"],\n [604, \"1965-01-01T00:00:00.000Z\"]\n ]\n}\n----\n\nBy default, results are returned as JSON. To return results formatted as text,\nCSV, or TSV, use the `format` parameter:\n\n[source,console]\n----\nPOST /_query?format=txt\n{\n \"query\": \"\"\"\n FROM library\n | EVAL year = DATE_TRUNC(1 YEARS, release_date)\n | STATS MAX(page_count) BY year\n | SORT year\n | LIMIT 5\n \"\"\"\n}\n----\n// TEST[setup:library]\n\n[discrete]\n==== {kib}\n\n{esql} can be used in Discover to explore a data set, and in Lens to visualize it.\nFirst, enable the `enableTextBased` setting in *Advanced Settings*. Next, in\nDiscover or Lens, from the data view dropdown, select *{esql}*.\n\nNOTE: {esql} queries in Discover and Lens are subject to the time range selected\nwith the time filter.\n\n[discrete]\n[[esql-limitations]]\n=== Limitations\n\n{esql} currently supports the following <<mapping-types,field types>>:\n\n- `alias`\n- `boolean`\n- `date`\n- `double` (`float`, `half_float`, `scaled_float` are represented as `double`)\n- `ip`\n- `keyword` family including `keyword`, `constant_keyword`, and `wildcard`\n- `int` (`short` and `byte` are represented as `int`)\n- `long`\n- `null`\n- `text`\n- `unsigned_long`\n- `version`\n--\n\ninclude::esql-get-started.asciidoc[]\n\ninclude::esql-syntax.asciidoc[]\n\ninclude::esql-source-commands.asciidoc[]\n\ninclude::esql-processing-commands.asciidoc[]\n\ninclude::esql-functions.asciidoc[]\n\ninclude::aggregation-functions.asciidoc[]\n\ninclude::multivalued-fields.asciidoc[]\n\ninclude::task-management.asciidoc[]\n\n:esql-tests!:\n:esql-specs!:\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/index.asciidoc" } }, { "pageContent": "[[esql-from]]\n=== `FROM`\n\nThe `FROM` source command returns a table with up to 10,000 documents from a\ndata stream, index, or alias. Each row in the resulting table represents a\ndocument. Each column corresponds to a field, and can be accessed by the name\nof that field.\n\n[source,esql]\n----\nFROM employees\n----\n\nYou can use <<api-date-math-index-names,date math>> to refer to indices, aliases\nand data streams. This can be useful for time series data, for example to access\ntoday's index:\n\n[source,esql]\n----\nFROM <logs-{now/d}>\n----\n\nUse comma-separated lists or wildcards to query multiple data streams, indices,\nor aliases:\n\n[source,esql]\n----\nFROM employees-00001,employees-*\n----\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/source_commands/from.asciidoc" } }, { "pageContent": "[[esql-agg-count]]\n=== `COUNT`\nCounts field values.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats.csv-spec[tag=count]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats.csv-spec[tag=count-result]\n|===\n\nCan take any field type as input and the result is always a `long` not matter\nthe input type.\n\nNOTE: There isn't yet a `COUNT(*)`. Please count a single valued field if you\n need a count of rows.\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/aggregation_functions/count.asciidoc" } }, { "pageContent": "[[esql-agg-count-distinct]]\n=== `COUNT_DISTINCT`\nThe approximate number of distinct values.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-result]\n|===\n\nCan take any field type as input and the result is always a `long` not matter\nthe input type.\n\n==== Counts are approximate\n\nComputing exact counts requires loading values into a set and returning its\nsize. This doesn't scale when working on high-cardinality sets and/or large\nvalues as the required memory usage and the need to communicate those\nper-shard sets between nodes would utilize too many resources of the cluster.\n\nThis `COUNT_DISTINCT` function is based on the\nhttps://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++]\nalgorithm, which counts based on the hashes of the values with some interesting\nproperties:\n\ninclude::../../aggregations/metrics/cardinality-aggregation.asciidoc[tag=explanation]\n\n==== Precision is configurable\n\nThe `COUNT_DISTINCT` function takes an optional second parameter to configure the\nprecision discussed previously.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision-result]\n|===\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/aggregation_functions/count_distinct.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM logs-*\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\", \"192.168.0.0/16\")\n| STATS destcount = COUNT(destination.ip) by user.name, host.name\n| ENRICH ldap_lookup_new ON user.name\n| WHERE group.name IS NOT NULL\n| EVAL follow_up = CASE(\n destcount >= 100, \"true\",\n \"false\")\n| SORT destcount desc\n| KEEP destcount, host.name, user.name, group.name, follow_up\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| grok dns.question.name \"%{DATA}\\\\.%{GREEDYDATA:dns.question.registered_domain:string}\"\n| stats unique_queries = count_distinct(dns.question.name) by dns.question.registered_domain, process.name\n| where unique_queries > 5\n| sort unique_queries desc\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0002.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| where event.code is not null\n| stats event_code_count = count(event.code) by event.code,host.name\n| enrich win_events on event.code with EVENT_DESCRIPTION\n| where EVENT_DESCRIPTION is not null and host.name is not null\n| rename EVENT_DESCRIPTION as event.description\n| sort event_code_count desc\n| keep event_code_count,event.code,host.name,event.description\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0003.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| where event.category == \"file\" and event.action == \"creation\"\n| stats filecount = count(file.name) by process.name,host.name\n| dissect process.name \"%{process}.%{extension}\"\n| eval proclength = length(process.name)\n| where proclength > 10\n| sort filecount,proclength desc\n| limit 10\n| keep host.name,process.name,filecount,process,extension,fullproc,proclength\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0004.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| where process.name == \"curl.exe\"\n| stats bytes = sum(destination.bytes) by destination.address\n| eval kb = bytes/1024\n| sort kb desc\n| limit 10\n| keep kb,destination.address\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0005.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM metrics-apm*\n| WHERE metricset.name == \"transaction\" AND metricset.interval == \"1m\"\n| EVAL bucket = AUTO_BUCKET(transaction.duration.histogram, 50, <start-date>, <end-date>)\n| STATS avg_duration = AVG(transaction.duration.histogram) BY bucket\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0006.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM packetbeat-*\n| STATS doc_count = COUNT(destination.domain) BY destination.domain\n| SORT doc_count DESC\n| LIMIT 10\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0007.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM employees\n| EVAL hire_date_formatted = DATE_FORMAT(hire_date, \"MMMM yyyy\")\n| SORT hire_date\n| KEEP emp_no, hire_date_formatted\n| LIMIT 5\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0008.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is NOT an example of an ES|QL query:\n\n```\nPagination is not supported\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0009.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM logs-*\n| WHERE @timestamp >= NOW() - 15 minutes\n| EVAL bucket = DATE_TRUNC(1 minute, @timestamp)\n| STATS avg_cpu = AVG(system.cpu.total.norm.pct) BY bucket, host.name\n| LIMIT 10\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0010.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM traces-apm*\n| WHERE @timestamp >= NOW() - 24 hours\n| EVAL successful = CASE(event.outcome == \"success\", 1, 0),\n failed = CASE(event.outcome == \"failure\", 1, 0)\n| STATS success_rate = AVG(successful),\n avg_duration = AVG(transaction.duration),\n total_requests = COUNT(transaction.id) BY service.name\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0011.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM metricbeat*\n| EVAL cpu_pct_normalized = (system.cpu.user.pct + system.cpu.system.pct) / system.cpu.cores\n| STATS AVG(cpu_pct_normalized) BY host.name\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0012.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM postgres-logs\n| DISSECT message \"%{} duration: %{query_duration} ms\"\n| EVAL query_duration_num = TO_DOUBLE(query_duration)\n| STATS avg_duration = AVG(query_duration_num)\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0013.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM nyc_taxis\n| WHERE DATE_EXTRACT(drop_off_time, \"hour\") >= 6 AND DATE_EXTRACT(drop_off_time, \"hour\") < 10\n| LIMIT 10\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0014.asciidoc" } } ] } // The search results are then transformed into documents: [chain/start] [1:chain:AgentExecutor > 4:tool:ChainTool > 5:chain:RetrievalQAChain > 7:chain:StuffDocumentsChain] Entering Chain run with input: { "question": "Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names.", "input_documents": [ { "pageContent": "[[esql]]\n= {esql}\n\n:esql-tests: {xes-repo-dir}/../../plugin/esql/qa\n:esql-specs: {esql-tests}/testFixtures/src/main/resources\n\n[partintro]\n--\n\npreview::[]\n\nThe {es} Query Language ({esql}) is a query language that enables the iterative\nexploration of data.\n\nAn {esql} query consists of a series of commands, separated by pipes. Each query\nstarts with a <<esql-source-commands,source command>>. A source command produces\na table, typically with data from {es}.\n\nimage::images/esql/source-command.svg[A source command producing a table from {es},align=\"center\"]\n\nA source command can be followed by one or more\n<<esql-processing-commands,processing commands>>. Processing commands change an\ninput table by adding, removing, or changing rows and columns.\n\nimage::images/esql/processing-command.svg[A processing command changing an input table,align=\"center\"]\n\nYou can chain processing commands, separated by a pipe character: `|`. Each\nprocessing command works on the output table of the previous command.\n\nimage::images/esql/chaining-processing-commands.svg[Processing commands can be chained,align=\"center\"]\n\nThe result of a query is the table produced by the final processing command.\n\n[discrete]\n[[esql-console]]\n=== Run an {esql} query\n\n[discrete]\n==== The {esql} API\n\nUse the `_query` endpoint to run an {esql} query:\n\n[source,console]\n----\nPOST /_query\n{\n \"query\": \"\"\"\n FROM library\n | EVAL year = DATE_TRUNC(1 YEARS, release_date)\n | STATS MAX(page_count) BY year\n | SORT year\n | LIMIT 5\n \"\"\"\n}\n----\n// TEST[setup:library]\n\nThe results come back in rows:\n\n[source,console-result]\n----\n{\n \"columns\": [\n { \"name\": \"MAX(page_count)\", \"type\": \"integer\"},\n { \"name\": \"year\" , \"type\": \"date\"}\n ],\n \"values\": [\n [268, \"1932-01-01T00:00:00.000Z\"],\n [224, \"1951-01-01T00:00:00.000Z\"],\n [227, \"1953-01-01T00:00:00.000Z\"],\n [335, \"1959-01-01T00:00:00.000Z\"],\n [604, \"1965-01-01T00:00:00.000Z\"]\n ]\n}\n----\n\nBy default, results are returned as JSON. To return results formatted as text,\nCSV, or TSV, use the `format` parameter:\n\n[source,console]\n----\nPOST /_query?format=txt\n{\n \"query\": \"\"\"\n FROM library\n | EVAL year = DATE_TRUNC(1 YEARS, release_date)\n | STATS MAX(page_count) BY year\n | SORT year\n | LIMIT 5\n \"\"\"\n}\n----\n// TEST[setup:library]\n\n[discrete]\n==== {kib}\n\n{esql} can be used in Discover to explore a data set, and in Lens to visualize it.\nFirst, enable the `enableTextBased` setting in *Advanced Settings*. Next, in\nDiscover or Lens, from the data view dropdown, select *{esql}*.\n\nNOTE: {esql} queries in Discover and Lens are subject to the time range selected\nwith the time filter.\n\n[discrete]\n[[esql-limitations]]\n=== Limitations\n\n{esql} currently supports the following <<mapping-types,field types>>:\n\n- `alias`\n- `boolean`\n- `date`\n- `double` (`float`, `half_float`, `scaled_float` are represented as `double`)\n- `ip`\n- `keyword` family including `keyword`, `constant_keyword`, and `wildcard`\n- `int` (`short` and `byte` are represented as `int`)\n- `long`\n- `null`\n- `text`\n- `unsigned_long`\n- `version`\n--\n\ninclude::esql-get-started.asciidoc[]\n\ninclude::esql-syntax.asciidoc[]\n\ninclude::esql-source-commands.asciidoc[]\n\ninclude::esql-processing-commands.asciidoc[]\n\ninclude::esql-functions.asciidoc[]\n\ninclude::aggregation-functions.asciidoc[]\n\ninclude::multivalued-fields.asciidoc[]\n\ninclude::task-management.asciidoc[]\n\n:esql-tests!:\n:esql-specs!:\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/index.asciidoc" } }, { "pageContent": "[[esql-from]]\n=== `FROM`\n\nThe `FROM` source command returns a table with up to 10,000 documents from a\ndata stream, index, or alias. Each row in the resulting table represents a\ndocument. Each column corresponds to a field, and can be accessed by the name\nof that field.\n\n[source,esql]\n----\nFROM employees\n----\n\nYou can use <<api-date-math-index-names,date math>> to refer to indices, aliases\nand data streams. This can be useful for time series data, for example to access\ntoday's index:\n\n[source,esql]\n----\nFROM <logs-{now/d}>\n----\n\nUse comma-separated lists or wildcards to query multiple data streams, indices,\nor aliases:\n\n[source,esql]\n----\nFROM employees-00001,employees-*\n----\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/source_commands/from.asciidoc" } }, { "pageContent": "[[esql-agg-count]]\n=== `COUNT`\nCounts field values.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats.csv-spec[tag=count]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats.csv-spec[tag=count-result]\n|===\n\nCan take any field type as input and the result is always a `long` not matter\nthe input type.\n\nNOTE: There isn't yet a `COUNT(*)`. Please count a single valued field if you\n need a count of rows.\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/aggregation_functions/count.asciidoc" } }, { "pageContent": "[[esql-agg-count-distinct]]\n=== `COUNT_DISTINCT`\nThe approximate number of distinct values.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-result]\n|===\n\nCan take any field type as input and the result is always a `long` not matter\nthe input type.\n\n==== Counts are approximate\n\nComputing exact counts requires loading values into a set and returning its\nsize. This doesn't scale when working on high-cardinality sets and/or large\nvalues as the required memory usage and the need to communicate those\nper-shard sets between nodes would utilize too many resources of the cluster.\n\nThis `COUNT_DISTINCT` function is based on the\nhttps://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++]\nalgorithm, which counts based on the hashes of the values with some interesting\nproperties:\n\ninclude::../../aggregations/metrics/cardinality-aggregation.asciidoc[tag=explanation]\n\n==== Precision is configurable\n\nThe `COUNT_DISTINCT` function takes an optional second parameter to configure the\nprecision discussed previously.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision-result]\n|===\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/aggregation_functions/count_distinct.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM logs-*\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\", \"192.168.0.0/16\")\n| STATS destcount = COUNT(destination.ip) by user.name, host.name\n| ENRICH ldap_lookup_new ON user.name\n| WHERE group.name IS NOT NULL\n| EVAL follow_up = CASE(\n destcount >= 100, \"true\",\n \"false\")\n| SORT destcount desc\n| KEEP destcount, host.name, user.name, group.name, follow_up\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| grok dns.question.name \"%{DATA}\\\\.%{GREEDYDATA:dns.question.registered_domain:string}\"\n| stats unique_queries = count_distinct(dns.question.name) by dns.question.registered_domain, process.name\n| where unique_queries > 5\n| sort unique_queries desc\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0002.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| where event.code is not null\n| stats event_code_count = count(event.code) by event.code,host.name\n| enrich win_events on event.code with EVENT_DESCRIPTION\n| where EVENT_DESCRIPTION is not null and host.name is not null\n| rename EVENT_DESCRIPTION as event.description\n| sort event_code_count desc\n| keep event_code_count,event.code,host.name,event.description\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0003.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| where event.category == \"file\" and event.action == \"creation\"\n| stats filecount = count(file.name) by process.name,host.name\n| dissect process.name \"%{process}.%{extension}\"\n| eval proclength = length(process.name)\n| where proclength > 10\n| sort filecount,proclength desc\n| limit 10\n| keep host.name,process.name,filecount,process,extension,fullproc,proclength\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0004.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| where process.name == \"curl.exe\"\n| stats bytes = sum(destination.bytes) by destination.address\n| eval kb = bytes/1024\n| sort kb desc\n| limit 10\n| keep kb,destination.address\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0005.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM metrics-apm*\n| WHERE metricset.name == \"transaction\" AND metricset.interval == \"1m\"\n| EVAL bucket = AUTO_BUCKET(transaction.duration.histogram, 50, <start-date>, <end-date>)\n| STATS avg_duration = AVG(transaction.duration.histogram) BY bucket\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0006.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM packetbeat-*\n| STATS doc_count = COUNT(destination.domain) BY destination.domain\n| SORT doc_count DESC\n| LIMIT 10\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0007.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM employees\n| EVAL hire_date_formatted = DATE_FORMAT(hire_date, \"MMMM yyyy\")\n| SORT hire_date\n| KEEP emp_no, hire_date_formatted\n| LIMIT 5\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0008.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is NOT an example of an ES|QL query:\n\n```\nPagination is not supported\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0009.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM logs-*\n| WHERE @timestamp >= NOW() - 15 minutes\n| EVAL bucket = DATE_TRUNC(1 minute, @timestamp)\n| STATS avg_cpu = AVG(system.cpu.total.norm.pct) BY bucket, host.name\n| LIMIT 10\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0010.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM traces-apm*\n| WHERE @timestamp >= NOW() - 24 hours\n| EVAL successful = CASE(event.outcome == \"success\", 1, 0),\n failed = CASE(event.outcome == \"failure\", 1, 0)\n| STATS success_rate = AVG(successful),\n avg_duration = AVG(transaction.duration),\n total_requests = COUNT(transaction.id) BY service.name\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0011.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM metricbeat*\n| EVAL cpu_pct_normalized = (system.cpu.user.pct + system.cpu.system.pct) / system.cpu.cores\n| STATS AVG(cpu_pct_normalized) BY host.name\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0012.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM postgres-logs\n| DISSECT message \"%{} duration: %{query_duration} ms\"\n| EVAL query_duration_num = TO_DOUBLE(query_duration)\n| STATS avg_duration = AVG(query_duration_num)\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0013.asciidoc" } }, { "pageContent": "[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM nyc_taxis\n| WHERE DATE_EXTRACT(drop_off_time, \"hour\") >= 6 AND DATE_EXTRACT(drop_off_time, \"hour\") < 10\n| LIMIT 10\n```\n", "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0014.asciidoc" } } ], "query": "Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names." } // The `pageContent`, but not the `metadata`, is then passed back to the `LLMChain`: [chain/start] [1:chain:AgentExecutor > 4:tool:ChainTool > 5:chain:RetrievalQAChain > 7:chain:StuffDocumentsChain > 8:chain:LLMChain] Entering Chain run with input: { "question": "Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names.", "query": "Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names.", "context": "[[esql]]\n= {esql}\n\n:esql-tests: {xes-repo-dir}/../../plugin/esql/qa\n:esql-specs: {esql-tests}/testFixtures/src/main/resources\n\n[partintro]\n--\n\npreview::[]\n\nThe {es} Query Language ({esql}) is a query language that enables the iterative\nexploration of data.\n\nAn {esql} query consists of a series of commands, separated by pipes. Each query\nstarts with a <<esql-source-commands,source command>>. A source command produces\na table, typically with data from {es}.\n\nimage::images/esql/source-command.svg[A source command producing a table from {es},align=\"center\"]\n\nA source command can be followed by one or more\n<<esql-processing-commands,processing commands>>. Processing commands change an\ninput table by adding, removing, or changing rows and columns.\n\nimage::images/esql/processing-command.svg[A processing command changing an input table,align=\"center\"]\n\nYou can chain processing commands, separated by a pipe character: `|`. Each\nprocessing command works on the output table of the previous command.\n\nimage::images/esql/chaining-processing-commands.svg[Processing commands can be chained,align=\"center\"]\n\nThe result of a query is the table produced by the final processing command.\n\n[discrete]\n[[esql-console]]\n=== Run an {esql} query\n\n[discrete]\n==== The {esql} API\n\nUse the `_query` endpoint to run an {esql} query:\n\n[source,console]\n----\nPOST /_query\n{\n \"query\": \"\"\"\n FROM library\n | EVAL year = DATE_TRUNC(1 YEARS, release_date)\n | STATS MAX(page_count) BY year\n | SORT year\n | LIMIT 5\n \"\"\"\n}\n----\n// TEST[setup:library]\n\nThe results come back in rows:\n\n[source,console-result]\n----\n{\n \"columns\": [\n { \"name\": \"MAX(page_count)\", \"type\": \"integer\"},\n { \"name\": \"year\" , \"type\": \"date\"}\n ],\n \"values\": [\n [268, \"1932-01-01T00:00:00.000Z\"],\n [224, \"1951-01-01T00:00:00.000Z\"],\n [227, \"1953-01-01T00:00:00.000Z\"],\n [335, \"1959-01-01T00:00:00.000Z\"],\n [604, \"1965-01-01T00:00:00.000Z\"]\n ]\n}\n----\n\nBy default, results are returned as JSON. To return results formatted as text,\nCSV, or TSV, use the `format` parameter:\n\n[source,console]\n----\nPOST /_query?format=txt\n{\n \"query\": \"\"\"\n FROM library\n | EVAL year = DATE_TRUNC(1 YEARS, release_date)\n | STATS MAX(page_count) BY year\n | SORT year\n | LIMIT 5\n \"\"\"\n}\n----\n// TEST[setup:library]\n\n[discrete]\n==== {kib}\n\n{esql} can be used in Discover to explore a data set, and in Lens to visualize it.\nFirst, enable the `enableTextBased` setting in *Advanced Settings*. Next, in\nDiscover or Lens, from the data view dropdown, select *{esql}*.\n\nNOTE: {esql} queries in Discover and Lens are subject to the time range selected\nwith the time filter.\n\n[discrete]\n[[esql-limitations]]\n=== Limitations\n\n{esql} currently supports the following <<mapping-types,field types>>:\n\n- `alias`\n- `boolean`\n- `date`\n- `double` (`float`, `half_float`, `scaled_float` are represented as `double`)\n- `ip`\n- `keyword` family including `keyword`, `constant_keyword`, and `wildcard`\n- `int` (`short` and `byte` are represented as `int`)\n- `long`\n- `null`\n- `text`\n- `unsigned_long`\n- `version`\n--\n\ninclude::esql-get-started.asciidoc[]\n\ninclude::esql-syntax.asciidoc[]\n\ninclude::esql-source-commands.asciidoc[]\n\ninclude::esql-processing-commands.asciidoc[]\n\ninclude::esql-functions.asciidoc[]\n\ninclude::aggregation-functions.asciidoc[]\n\ninclude::multivalued-fields.asciidoc[]\n\ninclude::task-management.asciidoc[]\n\n:esql-tests!:\n:esql-specs!:\n\n\n[[esql-from]]\n=== `FROM`\n\nThe `FROM` source command returns a table with up to 10,000 documents from a\ndata stream, index, or alias. Each row in the resulting table represents a\ndocument. Each column corresponds to a field, and can be accessed by the name\nof that field.\n\n[source,esql]\n----\nFROM employees\n----\n\nYou can use <<api-date-math-index-names,date math>> to refer to indices, aliases\nand data streams. This can be useful for time series data, for example to access\ntoday's index:\n\n[source,esql]\n----\nFROM <logs-{now/d}>\n----\n\nUse comma-separated lists or wildcards to query multiple data streams, indices,\nor aliases:\n\n[source,esql]\n----\nFROM employees-00001,employees-*\n----\n\n\n[[esql-agg-count]]\n=== `COUNT`\nCounts field values.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats.csv-spec[tag=count]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats.csv-spec[tag=count-result]\n|===\n\nCan take any field type as input and the result is always a `long` not matter\nthe input type.\n\nNOTE: There isn't yet a `COUNT(*)`. Please count a single valued field if you\n need a count of rows.\n\n\n[[esql-agg-count-distinct]]\n=== `COUNT_DISTINCT`\nThe approximate number of distinct values.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-result]\n|===\n\nCan take any field type as input and the result is always a `long` not matter\nthe input type.\n\n==== Counts are approximate\n\nComputing exact counts requires loading values into a set and returning its\nsize. This doesn't scale when working on high-cardinality sets and/or large\nvalues as the required memory usage and the need to communicate those\nper-shard sets between nodes would utilize too many resources of the cluster.\n\nThis `COUNT_DISTINCT` function is based on the\nhttps://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf[HyperLogLog++]\nalgorithm, which counts based on the hashes of the values with some interesting\nproperties:\n\ninclude::../../aggregations/metrics/cardinality-aggregation.asciidoc[tag=explanation]\n\n==== Precision is configurable\n\nThe `COUNT_DISTINCT` function takes an optional second parameter to configure the\nprecision discussed previously.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats_count_distinct.csv-spec[tag=count-distinct-precision-result]\n|===\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM logs-*\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\", \"192.168.0.0/16\")\n| STATS destcount = COUNT(destination.ip) by user.name, host.name\n| ENRICH ldap_lookup_new ON user.name\n| WHERE group.name IS NOT NULL\n| EVAL follow_up = CASE(\n destcount >= 100, \"true\",\n \"false\")\n| SORT destcount desc\n| KEEP destcount, host.name, user.name, group.name, follow_up\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| grok dns.question.name \"%{DATA}\\\\.%{GREEDYDATA:dns.question.registered_domain:string}\"\n| stats unique_queries = count_distinct(dns.question.name) by dns.question.registered_domain, process.name\n| where unique_queries > 5\n| sort unique_queries desc\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| where event.code is not null\n| stats event_code_count = count(event.code) by event.code,host.name\n| enrich win_events on event.code with EVENT_DESCRIPTION\n| where EVENT_DESCRIPTION is not null and host.name is not null\n| rename EVENT_DESCRIPTION as event.description\n| sort event_code_count desc\n| keep event_code_count,event.code,host.name,event.description\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| where event.category == \"file\" and event.action == \"creation\"\n| stats filecount = count(file.name) by process.name,host.name\n| dissect process.name \"%{process}.%{extension}\"\n| eval proclength = length(process.name)\n| where proclength > 10\n| sort filecount,proclength desc\n| limit 10\n| keep host.name,process.name,filecount,process,extension,fullproc,proclength\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nfrom logs-*\n| where process.name == \"curl.exe\"\n| stats bytes = sum(destination.bytes) by destination.address\n| eval kb = bytes/1024\n| sort kb desc\n| limit 10\n| keep kb,destination.address\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM metrics-apm*\n| WHERE metricset.name == \"transaction\" AND metricset.interval == \"1m\"\n| EVAL bucket = AUTO_BUCKET(transaction.duration.histogram, 50, <start-date>, <end-date>)\n| STATS avg_duration = AVG(transaction.duration.histogram) BY bucket\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM packetbeat-*\n| STATS doc_count = COUNT(destination.domain) BY destination.domain\n| SORT doc_count DESC\n| LIMIT 10\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM employees\n| EVAL hire_date_formatted = DATE_FORMAT(hire_date, \"MMMM yyyy\")\n| SORT hire_date\n| KEEP emp_no, hire_date_formatted\n| LIMIT 5\n```\n\n\n[[esql-example-queries]]\n\nThe following is NOT an example of an ES|QL query:\n\n```\nPagination is not supported\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM logs-*\n| WHERE @timestamp >= NOW() - 15 minutes\n| EVAL bucket = DATE_TRUNC(1 minute, @timestamp)\n| STATS avg_cpu = AVG(system.cpu.total.norm.pct) BY bucket, host.name\n| LIMIT 10\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM traces-apm*\n| WHERE @timestamp >= NOW() - 24 hours\n| EVAL successful = CASE(event.outcome == \"success\", 1, 0),\n failed = CASE(event.outcome == \"failure\", 1, 0)\n| STATS success_rate = AVG(successful),\n avg_duration = AVG(transaction.duration),\n total_requests = COUNT(transaction.id) BY service.name\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM metricbeat*\n| EVAL cpu_pct_normalized = (system.cpu.user.pct + system.cpu.system.pct) / system.cpu.cores\n| STATS AVG(cpu_pct_normalized) BY host.name\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM postgres-logs\n| DISSECT message \"%{} duration: %{query_duration} ms\"\n| EVAL query_duration_num = TO_DOUBLE(query_duration)\n| STATS avg_duration = AVG(query_duration_num)\n```\n\n\n[[esql-example-queries]]\n\nThe following is an example ES|QL query:\n\n```\nFROM nyc_taxis\n| WHERE DATE_EXTRACT(drop_off_time, \"hour\") >= 6 AND DATE_EXTRACT(drop_off_time, \"hour\") < 10\n| LIMIT 10\n```\n" } // The `LLMChain` then generates a new prompt based on the `pageContent` and passes it to the `ActionsClientLlm`, so the LLM can produce the final answer: [llm/start] [1:chain:AgentExecutor > 4:tool:ChainTool > 5:chain:RetrievalQAChain > 7:chain:StuffDocumentsChain > 8:chain:LLMChain > 9:llm:ActionsClientLlm] Entering LLM run with input: { "prompts": [ "[{\"lc\":1,\"type\":\"constructor\",\"id\":[\"langchain\",\"schema\",\"SystemMessage\"],\"kwargs\":{\"content\":\"Use the following pieces of context to answer the users question. \\nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\\n----------------\\n[[esql]]\\n= {esql}\\n\\n:esql-tests: {xes-repo-dir}/../../plugin/esql/qa\\n:esql-specs: {esql-tests}/testFixtures/src/main/resources\\n\\n[partintro]\\n--\\n\\npreview::[]\\n\\nThe {es} Query Language ({esql}) is a query language that enables the iterative\\nexploration of data.\\n\\nAn {esql} query consists of a series of commands, separated by pipes. Each query\\nstarts with a <<esql-source-commands,source command>>. A source command produces\\na table, typically with data from {es}.\\n\\nimage::images/esql/source-command.svg[A source command producing a table from {es},align=\\\"center\\\"]\\n\\nA source command can be followed by one or more\\n<<esql-processing-commands,processing commands>>. Processing commands change an\\ninput table by adding, removing, or changing rows and columns.\\n\\nimage::images/esql/processing-command.svg[A processing command changing an input table,align=\\\"center\\\"]\\n\\nYou can chain processing commands, separated by a pipe character: `|`. Each\\nprocessing command works on the output table of the previous command.\\n\\nimage::images/esql/chaining-processing-commands.svg[Processing commands can be chained,align=\\\"center\\\"]\\n\\nThe result of a query is the table produced by the final processing command.\\n\\n[discrete]\\n[[esql-console]]\\n=== Run an {esql} query\\n\\n[discrete]\\n==== The {esql} API\\n\\nUse the `_query` endpoint to run an {esql} query:\\n\\n[source,console]\\n----\\nPOST /_query\\n{\\n \\\"query\\\": \\\"\\\"\\\"\\n FROM library\\n | EVAL year = DATE_TRUNC(1 YEARS, release_date)\\n | STATS MAX(page_count) BY year\\n | SORT year\\n | LIMIT 5\\n \\\"\\\"\\\"\\n}\\n----\\n// TEST[setup:library]\\n\\nThe results come back in rows:\\n\\n[source,console-result]\\n----\\n{\\n \\\"columns\\\": [\\n { \\\"name\\\": \\\"MAX(page_count)\\\", \\\"type\\\": \\\"integer\\\"},\\n { \\\"name\\\": \\\"year\\\" , \\\"type\\\": \\\"date\\\"}\\n ],\\n \\\"values\\\": [\\n [268, \\\"1932-01-01T00:00:00.000Z\\\"],\\n [224, \\\"1951-01-01T00:00:00.000Z\\\"],\\n [227, \\\"1953-01-01T00:00:00.000Z\\\"],\\n [335, \\\"1959-01-01T00:00:00.000Z\\\"],\\n [604, \\\"1965-01-01T00:00:00.000Z\\\"]\\n ]\\n}\\n----\\n\\nBy default, results are returned as JSON. To return results formatted as text,\\nCSV, or TSV, use the `format` parameter:\\n\\n[source,console]\\n----\\nPOST /_query?format=txt\\n{\\n \\\"query\\\": \\\"\\\"\\\"\\n FROM library\\n | EVAL year = DATE_TRUNC(1 YEARS, release_date)\\n | STATS MAX(page_count) BY year\\n | SORT year\\n | LIMIT 5\\n \\\"\\\"\\\"\\n}\\n----\\n// TEST[setup:library]\\n\\n[discrete]\\n==== {kib}\\n\\n{esql} can be used in Discover to explore a data set, and in Lens to visualize it.\\nFirst, enable the `enableTextBased` setting in *Advanced Settings*. Next, in\\nDiscover or Lens, from the data view dropdown, select *{esql}*.\\n\\nNOTE: {esql} queries in Discover and Lens are subject to the time range selected\\nwith the time filter.\\n\\n[discrete]\\n[[esql-limitations]]\\n=== Limitations\\n\\n{esql} currently supports the following <<mapping-types,field types>>:\\n\\n- `alias`\\n- `boolean`\\n- `date`\\n- `double` (`float`, `half_f…
Pinging @elastic/security-solution (Team: SecuritySolution) |
Documentation preview: |
Reference: Annotated
|
@elasticmachine merge upstream |
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From pair review -- just noting that these don't necessary need to be asciidocs
's, so can be any format that works best from an embedding standpoint.
// Add additional metadata to the example queries that indicates they are required KB documents: | ||
const requiredExampleQueries = addRequiredKbResourceMetadata({ | ||
docs: rawExampleQueries, | ||
kbResource: ESQL_RESOURCE, | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side thought: perhaps another method would be to hoist these to a specific agent system message instead of retrieving each time? Not sure here, more discovery needed.... 🙂
index: string, | ||
logger: Logger, | ||
model?: string, | ||
kbResource?: string | undefined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed scoping kbResource
to esStore instances during pair review. Abstraction here makes sense and will instantiate additional scoped stores as needed, but could change in the future.
* Fun facts: | ||
* - This function is called by LangChain's `VectorStoreRetriever._getRelevantDocuments` | ||
* - The `k` parameter is typically determined by LangChain's `VectorStoreRetriever._getRelevantDocuments`, and has been observed to default to `4` in the wild (see langchain/dist/vectorstores/base.ts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏 Love FunFacts™, thank you! 🙂
model_id: modelId, | ||
model_text: query, | ||
}, | ||
} as unknown as QueryDslTextExpansionQuery, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice if we didn't have to do this, but I haven't found a way around this yet either 😔
@@ -58,6 +59,7 @@ export const postActionsConnectorExecuteRoute = ( | |||
logger, | |||
request, | |||
elserId, | |||
kbResource: ESQL_RESOURCE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, if we were to fetch kbResource
via getKbResource(request)
it would be undefined here as the other routes are using url params, not query params. Whomever needs to grab it from the client here will find that out though as well 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked out, tested locally, and pair reviewed code -- thank you @andrew-goldstein for introducing our first hybrid search implementation and improving the quality of our ESQL query generation. LGTM! 👍
@elasticmachine merge upstream |
💛 Build succeeded, but was flaky
Failed CI Steps
Test Failures
Metrics [docs]Async chunks
To update your PR or re-run it, just comment with: |
…arch for improved ES|QL query generation (elastic#168995) ## [Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation This PR implements hybrid (vector + terms) search to improve the quality of `ES|QL` queries generated by the Elastic AI Assistant. The hybrid search combines (from a single request to Elasticsearch): - Vector search results from ELSER that vary depending on the query specified by the user - Terms search results that return a set of Knowledge Base (KB) documents marked as "required" for a topic The hybrid search results, when provided as context to an LLM, improve the quality of generated `ES|QL` queries by combining `ES|QL` parser grammar and documentation specific to the question asked by a user with additional examples of valid `ES|QL` queries that aren't specific to the query. ## Details ### Indexing additional `metadata` The `loadESQL` function in `x-pack/plugins/elastic_assistant/server/lib/langchain/content_loaders/esql_loader.ts` loads a directory containing 13 valid, and one invalid example of `ES|QL` queries: ```typescript const rawExampleQueries = await exampleQueriesLoader.load(); // Add additional metadata to the example queries that indicates they are required KB documents: const requiredExampleQueries = addRequiredKbResourceMetadata({ docs: rawExampleQueries, kbResource: ESQL_RESOURCE, }); ``` The `addRequiredKbResourceMetadata` function adds two additional fields to the `metadata` property of the document: - `kbResource` - a `keyword` field that specifies the category of knowledge, e.g. `esql` - `required` - a `boolean` field that when `true`, indicates the document should be returned in all searches for the `kbResource` The additional metadata fields are shown below in the following abridged sample document: ``` { "_index": ".kibana-elastic-ai-assistant-kb", "_id": "e297e2d9-fb0e-4638-b4be-af31d1b31b9f", "_version": 1, "_seq_no": 129, "_primary_term": 1, "found": true, "_source": { "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc", "required": true, "kbResource": "esql" }, "vector": { "tokens": { "serial": 0.5612584, "syntax": 0.006727545, "user": 1.1184403, // ...additional tokens }, "model_id": ".elser_model_2" }, "text": """[[esql-example-queries]] The following is an example ES|QL query: \`\`\` FROM logs-* | WHERE NOT CIDR_MATCH(destination.ip, "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16") | STATS destcount = COUNT(destination.ip) by user.name, host.name | ENRICH ldap_lookup_new ON user.name | WHERE group.name IS NOT NULL | EVAL follow_up = CASE( destcount >= 100, "true", "false") | SORT destcount desc | KEEP destcount, host.name, user.name, group.name, follow_up \`\`\` """ } } ``` ### Hybrid search The `ElasticsearchStore.similaritySearch` function is invoked by LangChain's `VectorStoreRetriever.getRelevantDocuments` function when the `RetrievalQAChain` searches for documents. A single request to Elasticsearch performs a hybrid search that combines the vector and terms searches into a single request with an [msearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html): ```typescript // requiredDocs is an array of filters that can be used in a `bool` Elasticsearch DSL query to filter in/out required KB documents: const requiredDocs = getRequiredKbDocsTermsQueryDsl(this.kbResource); // The `k` parameter is typically provided by LangChain's `VectorStoreRetriever._getRelevantDocuments`, which calls this function: const vectorSearchQuerySize = k ?? FALLBACK_SIMILARITY_SEARCH_SIZE; // build a vector search query: const vectorSearchQuery = getVectorSearchQuery({ filter, modelId: this.model, mustNotTerms: requiredDocs, query, }); // build a (separate) terms search query: const termsSearchQuery = getTermsSearchQuery(requiredDocs); // combine the vector search query and the terms search queries into a single multi-search query: const mSearchQueryBody = getMsearchQueryBody({ index: this.index, termsSearchQuery, termsSearchQuerySize: TERMS_QUERY_SIZE, vectorSearchQuery, vectorSearchQuerySize, }); try { // execute both queries via a single multi-search request: const result = await this.esClient.msearch<MsearchResponse>(mSearchQueryBody); // flatten the results of the combined queries into a single array of hits: const results: FlattenedHit[] = result.responses.flatMap((response) => // ... ``` ## Desk testing 1. Delete any previous instances of the Knowledge Base by executing the following query in Kibana's `Dev Tools`: ``` DELETE .kibana-elastic-ai-assistant-kb ``` 2. In the Security Solution, open the Elastic AI Assistant 3. In the assistant, click the `Settings` gear 4. Click the `Knowledge Base` icon to view the KB settings 5. Toggle the `Knowledge Base` setting `off` if it's already on 6. Toggle the `Knowledge Base` setting `on` to load the KB documents 7. Click the `Save` button to close settings 8. Enter the following prompt, then press Enter: ``` Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called "follow_up" that contains a value of "true", otherwise, it should contain "false". The user names should also be enriched with their respective group names. ``` **Expected result** A response similar to the following is returned: ``` FROM logs-* | WHERE NOT CIDR_MATCH(destination.ip, "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16") | STATS destcount = COUNT(destination.ip) BY user.name | ENRICH ldap_lookup ON user.name | EVAL follow_up = CASE( destcount >= 100, "true", "false") | SORT destcount DESC | KEEP destcount, user.name, group.name, follow_up ``` (cherry picked from commit d0e9925)
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
…rms) search for improved ES|QL query generation (#168995) (#169054) # Backport This will backport the following commits from `main` to `8.11`: - [[Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation (#168995)](#168995) <!--- Backport version: 8.9.7 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Andrew Macri","email":"[email protected]"},"sourceCommit":{"committedDate":"2023-10-17T00:54:40Z","message":"[Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation (#168995)\n\n## [Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation\r\n\r\nThis PR implements hybrid (vector + terms) search to improve the quality of `ES|QL` queries generated by the Elastic AI Assistant.\r\n\r\nThe hybrid search combines (from a single request to Elasticsearch):\r\n\r\n- Vector search results from ELSER that vary depending on the query specified by the user\r\n- Terms search results that return a set of Knowledge Base (KB) documents marked as \"required\" for a topic\r\n\r\nThe hybrid search results, when provided as context to an LLM, improve the quality of generated `ES|QL` queries by combining `ES|QL` parser grammar and documentation specific to the question asked by a user with additional examples of valid `ES|QL` queries that aren't specific to the query.\r\n\r\n## Details\r\n\r\n### Indexing additional `metadata`\r\n\r\nThe `loadESQL` function in `x-pack/plugins/elastic_assistant/server/lib/langchain/content_loaders/esql_loader.ts` loads a directory containing 13 valid, and one invalid example of `ES|QL` queries:\r\n\r\n```typescript\r\n const rawExampleQueries = await exampleQueriesLoader.load();\r\n\r\n // Add additional metadata to the example queries that indicates they are required KB documents:\r\n const requiredExampleQueries = addRequiredKbResourceMetadata({\r\n docs: rawExampleQueries,\r\n kbResource: ESQL_RESOURCE,\r\n });\r\n```\r\n\r\nThe `addRequiredKbResourceMetadata` function adds two additional fields to the `metadata` property of the document:\r\n\r\n- `kbResource` - a `keyword` field that specifies the category of knowledge, e.g. `esql`\r\n- `required` - a `boolean` field that when `true`, indicates the document should be returned in all searches for the `kbResource`\r\n\r\nThe additional metadata fields are shown below in the following abridged sample document:\r\n\r\n```\r\n{\r\n \"_index\": \".kibana-elastic-ai-assistant-kb\",\r\n \"_id\": \"e297e2d9-fb0e-4638-b4be-af31d1b31b9f\",\r\n \"_version\": 1,\r\n \"_seq_no\": 129,\r\n \"_primary_term\": 1,\r\n \"found\": true,\r\n \"_source\": {\r\n \"metadata\": {\r\n \"source\": \"/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc\",\r\n \"required\": true,\r\n \"kbResource\": \"esql\"\r\n },\r\n \"vector\": {\r\n \"tokens\": {\r\n \"serial\": 0.5612584,\r\n \"syntax\": 0.006727545,\r\n \"user\": 1.1184403,\r\n // ...additional tokens\r\n },\r\n \"model_id\": \".elser_model_2\"\r\n },\r\n \"text\": \"\"\"[[esql-example-queries]]\r\n\r\nThe following is an example ES|QL query:\r\n\r\n\\`\\`\\`\r\nFROM logs-*\r\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\", \"192.168.0.0/16\")\r\n| STATS destcount = COUNT(destination.ip) by user.name, host.name\r\n| ENRICH ldap_lookup_new ON user.name\r\n| WHERE group.name IS NOT NULL\r\n| EVAL follow_up = CASE(\r\n destcount >= 100, \"true\",\r\n \"false\")\r\n| SORT destcount desc\r\n| KEEP destcount, host.name, user.name, group.name, follow_up\r\n\\`\\`\\`\r\n\"\"\"\r\n }\r\n}\r\n```\r\n\r\n### Hybrid search\r\n\r\nThe `ElasticsearchStore.similaritySearch` function is invoked by LangChain's `VectorStoreRetriever.getRelevantDocuments` function when the `RetrievalQAChain` searches for documents.\r\n\r\nA single request to Elasticsearch performs a hybrid search that combines the vector and terms searches into a single request with an [msearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html):\r\n\r\n```typescript\r\n // requiredDocs is an array of filters that can be used in a `bool` Elasticsearch DSL query to filter in/out required KB documents:\r\n const requiredDocs = getRequiredKbDocsTermsQueryDsl(this.kbResource);\r\n\r\n // The `k` parameter is typically provided by LangChain's `VectorStoreRetriever._getRelevantDocuments`, which calls this function:\r\n const vectorSearchQuerySize = k ?? FALLBACK_SIMILARITY_SEARCH_SIZE;\r\n\r\n // build a vector search query:\r\n const vectorSearchQuery = getVectorSearchQuery({\r\n filter,\r\n modelId: this.model,\r\n mustNotTerms: requiredDocs,\r\n query,\r\n });\r\n\r\n // build a (separate) terms search query:\r\n const termsSearchQuery = getTermsSearchQuery(requiredDocs);\r\n\r\n // combine the vector search query and the terms search queries into a single multi-search query:\r\n const mSearchQueryBody = getMsearchQueryBody({\r\n index: this.index,\r\n termsSearchQuery,\r\n termsSearchQuerySize: TERMS_QUERY_SIZE,\r\n vectorSearchQuery,\r\n vectorSearchQuerySize,\r\n });\r\n\r\n try {\r\n // execute both queries via a single multi-search request:\r\n const result = await this.esClient.msearch<MsearchResponse>(mSearchQueryBody);\r\n\r\n // flatten the results of the combined queries into a single array of hits:\r\n const results: FlattenedHit[] = result.responses.flatMap((response) =>\r\n // ...\r\n```\r\n\r\n## Desk testing\r\n\r\n1. Delete any previous instances of the Knowledge Base by executing the following query in Kibana's `Dev Tools`:\r\n\r\n```\r\n\r\nDELETE .kibana-elastic-ai-assistant-kb\r\n\r\n```\r\n\r\n2. In the Security Solution, open the Elastic AI Assistant\r\n\r\n3. In the assistant, click the `Settings` gear\r\n\r\n4. Click the `Knowledge Base` icon to view the KB settings\r\n\r\n5. Toggle the `Knowledge Base` setting `off` if it's already on\r\n\r\n6. Toggle the `Knowledge Base` setting `on` to load the KB documents\r\n\r\n7. Click the `Save` button to close settings\r\n\r\n8. Enter the following prompt, then press Enter:\r\n\r\n```\r\nGenerate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names.\r\n```\r\n\r\n**Expected result**\r\n\r\nA response similar to the following is returned:\r\n\r\n```\r\nFROM logs-*\r\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\", \"192.168.0.0/16\")\r\n| STATS destcount = COUNT(destination.ip) BY user.name\r\n| ENRICH ldap_lookup ON user.name\r\n| EVAL follow_up = CASE(\r\n destcount >= 100, \"true\",\r\n \"false\")\r\n| SORT destcount DESC\r\n| KEEP destcount, user.name, group.name, follow_up\r\n```","sha":"d0e99258c68d57bc83788724814783ece176aa78","branchLabelMapping":{"^v8.12.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team: SecuritySolution","Feature:Elastic AI Assistant","v8.11.0","v8.12.0"],"number":168995,"url":"https://github.com/elastic/kibana/pull/168995","mergeCommit":{"message":"[Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation (#168995)\n\n## [Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation\r\n\r\nThis PR implements hybrid (vector + terms) search to improve the quality of `ES|QL` queries generated by the Elastic AI Assistant.\r\n\r\nThe hybrid search combines (from a single request to Elasticsearch):\r\n\r\n- Vector search results from ELSER that vary depending on the query specified by the user\r\n- Terms search results that return a set of Knowledge Base (KB) documents marked as \"required\" for a topic\r\n\r\nThe hybrid search results, when provided as context to an LLM, improve the quality of generated `ES|QL` queries by combining `ES|QL` parser grammar and documentation specific to the question asked by a user with additional examples of valid `ES|QL` queries that aren't specific to the query.\r\n\r\n## Details\r\n\r\n### Indexing additional `metadata`\r\n\r\nThe `loadESQL` function in `x-pack/plugins/elastic_assistant/server/lib/langchain/content_loaders/esql_loader.ts` loads a directory containing 13 valid, and one invalid example of `ES|QL` queries:\r\n\r\n```typescript\r\n const rawExampleQueries = await exampleQueriesLoader.load();\r\n\r\n // Add additional metadata to the example queries that indicates they are required KB documents:\r\n const requiredExampleQueries = addRequiredKbResourceMetadata({\r\n docs: rawExampleQueries,\r\n kbResource: ESQL_RESOURCE,\r\n });\r\n```\r\n\r\nThe `addRequiredKbResourceMetadata` function adds two additional fields to the `metadata` property of the document:\r\n\r\n- `kbResource` - a `keyword` field that specifies the category of knowledge, e.g. `esql`\r\n- `required` - a `boolean` field that when `true`, indicates the document should be returned in all searches for the `kbResource`\r\n\r\nThe additional metadata fields are shown below in the following abridged sample document:\r\n\r\n```\r\n{\r\n \"_index\": \".kibana-elastic-ai-assistant-kb\",\r\n \"_id\": \"e297e2d9-fb0e-4638-b4be-af31d1b31b9f\",\r\n \"_version\": 1,\r\n \"_seq_no\": 129,\r\n \"_primary_term\": 1,\r\n \"found\": true,\r\n \"_source\": {\r\n \"metadata\": {\r\n \"source\": \"/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc\",\r\n \"required\": true,\r\n \"kbResource\": \"esql\"\r\n },\r\n \"vector\": {\r\n \"tokens\": {\r\n \"serial\": 0.5612584,\r\n \"syntax\": 0.006727545,\r\n \"user\": 1.1184403,\r\n // ...additional tokens\r\n },\r\n \"model_id\": \".elser_model_2\"\r\n },\r\n \"text\": \"\"\"[[esql-example-queries]]\r\n\r\nThe following is an example ES|QL query:\r\n\r\n\\`\\`\\`\r\nFROM logs-*\r\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\", \"192.168.0.0/16\")\r\n| STATS destcount = COUNT(destination.ip) by user.name, host.name\r\n| ENRICH ldap_lookup_new ON user.name\r\n| WHERE group.name IS NOT NULL\r\n| EVAL follow_up = CASE(\r\n destcount >= 100, \"true\",\r\n \"false\")\r\n| SORT destcount desc\r\n| KEEP destcount, host.name, user.name, group.name, follow_up\r\n\\`\\`\\`\r\n\"\"\"\r\n }\r\n}\r\n```\r\n\r\n### Hybrid search\r\n\r\nThe `ElasticsearchStore.similaritySearch` function is invoked by LangChain's `VectorStoreRetriever.getRelevantDocuments` function when the `RetrievalQAChain` searches for documents.\r\n\r\nA single request to Elasticsearch performs a hybrid search that combines the vector and terms searches into a single request with an [msearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html):\r\n\r\n```typescript\r\n // requiredDocs is an array of filters that can be used in a `bool` Elasticsearch DSL query to filter in/out required KB documents:\r\n const requiredDocs = getRequiredKbDocsTermsQueryDsl(this.kbResource);\r\n\r\n // The `k` parameter is typically provided by LangChain's `VectorStoreRetriever._getRelevantDocuments`, which calls this function:\r\n const vectorSearchQuerySize = k ?? FALLBACK_SIMILARITY_SEARCH_SIZE;\r\n\r\n // build a vector search query:\r\n const vectorSearchQuery = getVectorSearchQuery({\r\n filter,\r\n modelId: this.model,\r\n mustNotTerms: requiredDocs,\r\n query,\r\n });\r\n\r\n // build a (separate) terms search query:\r\n const termsSearchQuery = getTermsSearchQuery(requiredDocs);\r\n\r\n // combine the vector search query and the terms search queries into a single multi-search query:\r\n const mSearchQueryBody = getMsearchQueryBody({\r\n index: this.index,\r\n termsSearchQuery,\r\n termsSearchQuerySize: TERMS_QUERY_SIZE,\r\n vectorSearchQuery,\r\n vectorSearchQuerySize,\r\n });\r\n\r\n try {\r\n // execute both queries via a single multi-search request:\r\n const result = await this.esClient.msearch<MsearchResponse>(mSearchQueryBody);\r\n\r\n // flatten the results of the combined queries into a single array of hits:\r\n const results: FlattenedHit[] = result.responses.flatMap((response) =>\r\n // ...\r\n```\r\n\r\n## Desk testing\r\n\r\n1. Delete any previous instances of the Knowledge Base by executing the following query in Kibana's `Dev Tools`:\r\n\r\n```\r\n\r\nDELETE .kibana-elastic-ai-assistant-kb\r\n\r\n```\r\n\r\n2. In the Security Solution, open the Elastic AI Assistant\r\n\r\n3. In the assistant, click the `Settings` gear\r\n\r\n4. Click the `Knowledge Base` icon to view the KB settings\r\n\r\n5. Toggle the `Knowledge Base` setting `off` if it's already on\r\n\r\n6. Toggle the `Knowledge Base` setting `on` to load the KB documents\r\n\r\n7. Click the `Save` button to close settings\r\n\r\n8. Enter the following prompt, then press Enter:\r\n\r\n```\r\nGenerate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names.\r\n```\r\n\r\n**Expected result**\r\n\r\nA response similar to the following is returned:\r\n\r\n```\r\nFROM logs-*\r\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\", \"192.168.0.0/16\")\r\n| STATS destcount = COUNT(destination.ip) BY user.name\r\n| ENRICH ldap_lookup ON user.name\r\n| EVAL follow_up = CASE(\r\n destcount >= 100, \"true\",\r\n \"false\")\r\n| SORT destcount DESC\r\n| KEEP destcount, user.name, group.name, follow_up\r\n```","sha":"d0e99258c68d57bc83788724814783ece176aa78"}},"sourceBranch":"main","suggestedTargetBranches":["8.11"],"targetPullRequestStates":[{"branch":"8.11","label":"v8.11.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.12.0","labelRegex":"^v8.12.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/168995","number":168995,"mergeCommit":{"message":"[Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation (#168995)\n\n## [Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation\r\n\r\nThis PR implements hybrid (vector + terms) search to improve the quality of `ES|QL` queries generated by the Elastic AI Assistant.\r\n\r\nThe hybrid search combines (from a single request to Elasticsearch):\r\n\r\n- Vector search results from ELSER that vary depending on the query specified by the user\r\n- Terms search results that return a set of Knowledge Base (KB) documents marked as \"required\" for a topic\r\n\r\nThe hybrid search results, when provided as context to an LLM, improve the quality of generated `ES|QL` queries by combining `ES|QL` parser grammar and documentation specific to the question asked by a user with additional examples of valid `ES|QL` queries that aren't specific to the query.\r\n\r\n## Details\r\n\r\n### Indexing additional `metadata`\r\n\r\nThe `loadESQL` function in `x-pack/plugins/elastic_assistant/server/lib/langchain/content_loaders/esql_loader.ts` loads a directory containing 13 valid, and one invalid example of `ES|QL` queries:\r\n\r\n```typescript\r\n const rawExampleQueries = await exampleQueriesLoader.load();\r\n\r\n // Add additional metadata to the example queries that indicates they are required KB documents:\r\n const requiredExampleQueries = addRequiredKbResourceMetadata({\r\n docs: rawExampleQueries,\r\n kbResource: ESQL_RESOURCE,\r\n });\r\n```\r\n\r\nThe `addRequiredKbResourceMetadata` function adds two additional fields to the `metadata` property of the document:\r\n\r\n- `kbResource` - a `keyword` field that specifies the category of knowledge, e.g. `esql`\r\n- `required` - a `boolean` field that when `true`, indicates the document should be returned in all searches for the `kbResource`\r\n\r\nThe additional metadata fields are shown below in the following abridged sample document:\r\n\r\n```\r\n{\r\n \"_index\": \".kibana-elastic-ai-assistant-kb\",\r\n \"_id\": \"e297e2d9-fb0e-4638-b4be-af31d1b31b9f\",\r\n \"_version\": 1,\r\n \"_seq_no\": 129,\r\n \"_primary_term\": 1,\r\n \"found\": true,\r\n \"_source\": {\r\n \"metadata\": {\r\n \"source\": \"/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc\",\r\n \"required\": true,\r\n \"kbResource\": \"esql\"\r\n },\r\n \"vector\": {\r\n \"tokens\": {\r\n \"serial\": 0.5612584,\r\n \"syntax\": 0.006727545,\r\n \"user\": 1.1184403,\r\n // ...additional tokens\r\n },\r\n \"model_id\": \".elser_model_2\"\r\n },\r\n \"text\": \"\"\"[[esql-example-queries]]\r\n\r\nThe following is an example ES|QL query:\r\n\r\n\\`\\`\\`\r\nFROM logs-*\r\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\", \"192.168.0.0/16\")\r\n| STATS destcount = COUNT(destination.ip) by user.name, host.name\r\n| ENRICH ldap_lookup_new ON user.name\r\n| WHERE group.name IS NOT NULL\r\n| EVAL follow_up = CASE(\r\n destcount >= 100, \"true\",\r\n \"false\")\r\n| SORT destcount desc\r\n| KEEP destcount, host.name, user.name, group.name, follow_up\r\n\\`\\`\\`\r\n\"\"\"\r\n }\r\n}\r\n```\r\n\r\n### Hybrid search\r\n\r\nThe `ElasticsearchStore.similaritySearch` function is invoked by LangChain's `VectorStoreRetriever.getRelevantDocuments` function when the `RetrievalQAChain` searches for documents.\r\n\r\nA single request to Elasticsearch performs a hybrid search that combines the vector and terms searches into a single request with an [msearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html):\r\n\r\n```typescript\r\n // requiredDocs is an array of filters that can be used in a `bool` Elasticsearch DSL query to filter in/out required KB documents:\r\n const requiredDocs = getRequiredKbDocsTermsQueryDsl(this.kbResource);\r\n\r\n // The `k` parameter is typically provided by LangChain's `VectorStoreRetriever._getRelevantDocuments`, which calls this function:\r\n const vectorSearchQuerySize = k ?? FALLBACK_SIMILARITY_SEARCH_SIZE;\r\n\r\n // build a vector search query:\r\n const vectorSearchQuery = getVectorSearchQuery({\r\n filter,\r\n modelId: this.model,\r\n mustNotTerms: requiredDocs,\r\n query,\r\n });\r\n\r\n // build a (separate) terms search query:\r\n const termsSearchQuery = getTermsSearchQuery(requiredDocs);\r\n\r\n // combine the vector search query and the terms search queries into a single multi-search query:\r\n const mSearchQueryBody = getMsearchQueryBody({\r\n index: this.index,\r\n termsSearchQuery,\r\n termsSearchQuerySize: TERMS_QUERY_SIZE,\r\n vectorSearchQuery,\r\n vectorSearchQuerySize,\r\n });\r\n\r\n try {\r\n // execute both queries via a single multi-search request:\r\n const result = await this.esClient.msearch<MsearchResponse>(mSearchQueryBody);\r\n\r\n // flatten the results of the combined queries into a single array of hits:\r\n const results: FlattenedHit[] = result.responses.flatMap((response) =>\r\n // ...\r\n```\r\n\r\n## Desk testing\r\n\r\n1. Delete any previous instances of the Knowledge Base by executing the following query in Kibana's `Dev Tools`:\r\n\r\n```\r\n\r\nDELETE .kibana-elastic-ai-assistant-kb\r\n\r\n```\r\n\r\n2. In the Security Solution, open the Elastic AI Assistant\r\n\r\n3. In the assistant, click the `Settings` gear\r\n\r\n4. Click the `Knowledge Base` icon to view the KB settings\r\n\r\n5. Toggle the `Knowledge Base` setting `off` if it's already on\r\n\r\n6. Toggle the `Knowledge Base` setting `on` to load the KB documents\r\n\r\n7. Click the `Save` button to close settings\r\n\r\n8. Enter the following prompt, then press Enter:\r\n\r\n```\r\nGenerate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called \"follow_up\" that contains a value of \"true\", otherwise, it should contain \"false\". The user names should also be enriched with their respective group names.\r\n```\r\n\r\n**Expected result**\r\n\r\nA response similar to the following is returned:\r\n\r\n```\r\nFROM logs-*\r\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\", \"192.168.0.0/16\")\r\n| STATS destcount = COUNT(destination.ip) BY user.name\r\n| ENRICH ldap_lookup ON user.name\r\n| EVAL follow_up = CASE(\r\n destcount >= 100, \"true\",\r\n \"false\")\r\n| SORT destcount DESC\r\n| KEEP destcount, user.name, group.name, follow_up\r\n```","sha":"d0e99258c68d57bc83788724814783ece176aa78"}}]}] BACKPORT--> Co-authored-by: Andrew Macri <[email protected]>
…arch for improved ES|QL query generation (elastic#168995) ## [Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation This PR implements hybrid (vector + terms) search to improve the quality of `ES|QL` queries generated by the Elastic AI Assistant. The hybrid search combines (from a single request to Elasticsearch): - Vector search results from ELSER that vary depending on the query specified by the user - Terms search results that return a set of Knowledge Base (KB) documents marked as "required" for a topic The hybrid search results, when provided as context to an LLM, improve the quality of generated `ES|QL` queries by combining `ES|QL` parser grammar and documentation specific to the question asked by a user with additional examples of valid `ES|QL` queries that aren't specific to the query. ## Details ### Indexing additional `metadata` The `loadESQL` function in `x-pack/plugins/elastic_assistant/server/lib/langchain/content_loaders/esql_loader.ts` loads a directory containing 13 valid, and one invalid example of `ES|QL` queries: ```typescript const rawExampleQueries = await exampleQueriesLoader.load(); // Add additional metadata to the example queries that indicates they are required KB documents: const requiredExampleQueries = addRequiredKbResourceMetadata({ docs: rawExampleQueries, kbResource: ESQL_RESOURCE, }); ``` The `addRequiredKbResourceMetadata` function adds two additional fields to the `metadata` property of the document: - `kbResource` - a `keyword` field that specifies the category of knowledge, e.g. `esql` - `required` - a `boolean` field that when `true`, indicates the document should be returned in all searches for the `kbResource` The additional metadata fields are shown below in the following abridged sample document: ``` { "_index": ".kibana-elastic-ai-assistant-kb", "_id": "e297e2d9-fb0e-4638-b4be-af31d1b31b9f", "_version": 1, "_seq_no": 129, "_primary_term": 1, "found": true, "_source": { "metadata": { "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc", "required": true, "kbResource": "esql" }, "vector": { "tokens": { "serial": 0.5612584, "syntax": 0.006727545, "user": 1.1184403, // ...additional tokens }, "model_id": ".elser_model_2" }, "text": """[[esql-example-queries]] The following is an example ES|QL query: \`\`\` FROM logs-* | WHERE NOT CIDR_MATCH(destination.ip, "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16") | STATS destcount = COUNT(destination.ip) by user.name, host.name | ENRICH ldap_lookup_new ON user.name | WHERE group.name IS NOT NULL | EVAL follow_up = CASE( destcount >= 100, "true", "false") | SORT destcount desc | KEEP destcount, host.name, user.name, group.name, follow_up \`\`\` """ } } ``` ### Hybrid search The `ElasticsearchStore.similaritySearch` function is invoked by LangChain's `VectorStoreRetriever.getRelevantDocuments` function when the `RetrievalQAChain` searches for documents. A single request to Elasticsearch performs a hybrid search that combines the vector and terms searches into a single request with an [msearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html): ```typescript // requiredDocs is an array of filters that can be used in a `bool` Elasticsearch DSL query to filter in/out required KB documents: const requiredDocs = getRequiredKbDocsTermsQueryDsl(this.kbResource); // The `k` parameter is typically provided by LangChain's `VectorStoreRetriever._getRelevantDocuments`, which calls this function: const vectorSearchQuerySize = k ?? FALLBACK_SIMILARITY_SEARCH_SIZE; // build a vector search query: const vectorSearchQuery = getVectorSearchQuery({ filter, modelId: this.model, mustNotTerms: requiredDocs, query, }); // build a (separate) terms search query: const termsSearchQuery = getTermsSearchQuery(requiredDocs); // combine the vector search query and the terms search queries into a single multi-search query: const mSearchQueryBody = getMsearchQueryBody({ index: this.index, termsSearchQuery, termsSearchQuerySize: TERMS_QUERY_SIZE, vectorSearchQuery, vectorSearchQuerySize, }); try { // execute both queries via a single multi-search request: const result = await this.esClient.msearch<MsearchResponse>(mSearchQueryBody); // flatten the results of the combined queries into a single array of hits: const results: FlattenedHit[] = result.responses.flatMap((response) => // ... ``` ## Desk testing 1. Delete any previous instances of the Knowledge Base by executing the following query in Kibana's `Dev Tools`: ``` DELETE .kibana-elastic-ai-assistant-kb ``` 2. In the Security Solution, open the Elastic AI Assistant 3. In the assistant, click the `Settings` gear 4. Click the `Knowledge Base` icon to view the KB settings 5. Toggle the `Knowledge Base` setting `off` if it's already on 6. Toggle the `Knowledge Base` setting `on` to load the KB documents 7. Click the `Save` button to close settings 8. Enter the following prompt, then press Enter: ``` Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called "follow_up" that contains a value of "true", otherwise, it should contain "false". The user names should also be enriched with their respective group names. ``` **Expected result** A response similar to the following is returned: ``` FROM logs-* | WHERE NOT CIDR_MATCH(destination.ip, "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16") | STATS destcount = COUNT(destination.ip) BY user.name | ENRICH ldap_lookup ON user.name | EVAL follow_up = CASE( destcount >= 100, "true", "false") | SORT destcount DESC | KEEP destcount, user.name, group.name, follow_up ```
[Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation
This PR implements hybrid (vector + terms) search to improve the quality of
ES|QL
queries generated by the Elastic AI Assistant.The hybrid search combines (from a single request to Elasticsearch):
The hybrid search results, when provided as context to an LLM, improve the quality of generated
ES|QL
queries by combiningES|QL
parser grammar and documentation specific to the question asked by a user with additional examples of validES|QL
queries that aren't specific to the query.Details
Indexing additional
metadata
The
loadESQL
function inx-pack/plugins/elastic_assistant/server/lib/langchain/content_loaders/esql_loader.ts
loads a directory containing 13 valid, and one invalid example ofES|QL
queries:The
addRequiredKbResourceMetadata
function adds two additional fields to themetadata
property of the document:kbResource
- akeyword
field that specifies the category of knowledge, e.g.esql
required
- aboolean
field that whentrue
, indicates the document should be returned in all searches for thekbResource
The additional metadata fields are shown below in the following abridged sample document:
Hybrid search
The
ElasticsearchStore.similaritySearch
function is invoked by LangChain'sVectorStoreRetriever.getRelevantDocuments
function when theRetrievalQAChain
searches for documents.A single request to Elasticsearch performs a hybrid search that combines the vector and terms searches into a single request with an msearch:
Desk testing
Dev Tools
:In the Security Solution, open the Elastic AI Assistant
In the assistant, click the
Settings
gearClick the
Knowledge Base
icon to view the KB settingsToggle the
Knowledge Base
settingoff
if it's already onToggle the
Knowledge Base
settingon
to load the KB documentsClick the
Save
button to close settingsEnter the following prompt, then press Enter:
Expected result
A response similar to the following is returned: