Skip to content

Commit

Permalink
[8.11] [Security Solution] [Elastic AI Assistant] Hybrid (vector + te…
Browse files Browse the repository at this point in the history
…rms) search for improved ES|QL query generation (#168995) (#169054)

# Backport

This will backport the following commits from `main` to `8.11`:
- [[Security Solution] [Elastic AI Assistant] Hybrid (vector + terms)
search for improved ES|QL query generation
(#168995)](#168995)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Andrew
Macri","email":"[email protected]"},"sourceCommit":{"committedDate":"2023-10-17T00:54:40Z","message":"[Security
Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for
improved ES|QL query generation (#168995)\n\n## [Security Solution]
[Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL
query generation\r\n\r\nThis PR implements hybrid (vector + terms)
search to improve the quality of `ES|QL` queries generated by the
Elastic AI Assistant.\r\n\r\nThe hybrid search combines (from a single
request to Elasticsearch):\r\n\r\n- Vector search results from ELSER
that vary depending on the query specified by the user\r\n- Terms search
results that return a set of Knowledge Base (KB) documents marked as
\"required\" for a topic\r\n\r\nThe hybrid search results, when provided
as context to an LLM, improve the quality of generated `ES|QL` queries
by combining `ES|QL` parser grammar and documentation specific to the
question asked by a user with additional examples of valid `ES|QL`
queries that aren't specific to the query.\r\n\r\n## Details\r\n\r\n###
Indexing additional `metadata`\r\n\r\nThe `loadESQL` function in
`x-pack/plugins/elastic_assistant/server/lib/langchain/content_loaders/esql_loader.ts`
loads a directory containing 13 valid, and one invalid example of
`ES|QL` queries:\r\n\r\n```typescript\r\n const rawExampleQueries =
await exampleQueriesLoader.load();\r\n\r\n // Add additional metadata to
the example queries that indicates they are required KB documents:\r\n
const requiredExampleQueries = addRequiredKbResourceMetadata({\r\n docs:
rawExampleQueries,\r\n kbResource: ESQL_RESOURCE,\r\n
});\r\n```\r\n\r\nThe `addRequiredKbResourceMetadata` function adds two
additional fields to the `metadata` property of the document:\r\n\r\n-
`kbResource` - a `keyword` field that specifies the category of
knowledge, e.g. `esql`\r\n- `required` - a `boolean` field that when
`true`, indicates the document should be returned in all searches for
the `kbResource`\r\n\r\nThe additional metadata fields are shown below
in the following abridged sample document:\r\n\r\n```\r\n{\r\n
\"_index\": \".kibana-elastic-ai-assistant-kb\",\r\n \"_id\":
\"e297e2d9-fb0e-4638-b4be-af31d1b31b9f\",\r\n \"_version\": 1,\r\n
\"_seq_no\": 129,\r\n \"_primary_term\": 1,\r\n \"found\": true,\r\n
\"_source\": {\r\n \"metadata\": {\r\n \"source\":
\"/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc\",\r\n
\"required\": true,\r\n \"kbResource\": \"esql\"\r\n },\r\n \"vector\":
{\r\n \"tokens\": {\r\n \"serial\": 0.5612584,\r\n \"syntax\":
0.006727545,\r\n \"user\": 1.1184403,\r\n // ...additional tokens\r\n
},\r\n \"model_id\": \".elser_model_2\"\r\n },\r\n \"text\":
\"\"\"[[esql-example-queries]]\r\n\r\nThe following is an example ES|QL
query:\r\n\r\n\\`\\`\\`\r\nFROM logs-*\r\n| WHERE NOT
CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\",
\"192.168.0.0/16\")\r\n| STATS destcount = COUNT(destination.ip) by
user.name, host.name\r\n| ENRICH ldap_lookup_new ON user.name\r\n| WHERE
group.name IS NOT NULL\r\n| EVAL follow_up = CASE(\r\n destcount >= 100,
\"true\",\r\n \"false\")\r\n| SORT destcount desc\r\n| KEEP destcount,
host.name, user.name, group.name, follow_up\r\n\\`\\`\\`\r\n\"\"\"\r\n
}\r\n}\r\n```\r\n\r\n### Hybrid search\r\n\r\nThe
`ElasticsearchStore.similaritySearch` function is invoked by LangChain's
`VectorStoreRetriever.getRelevantDocuments` function when the
`RetrievalQAChain` searches for documents.\r\n\r\nA single request to
Elasticsearch performs a hybrid search that combines the vector and
terms searches into a single request with an
[msearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html):\r\n\r\n```typescript\r\n
// requiredDocs is an array of filters that can be used in a `bool`
Elasticsearch DSL query to filter in/out required KB documents:\r\n
const requiredDocs =
getRequiredKbDocsTermsQueryDsl(this.kbResource);\r\n\r\n // The `k`
parameter is typically provided by LangChain's
`VectorStoreRetriever._getRelevantDocuments`, which calls this
function:\r\n const vectorSearchQuerySize = k ??
FALLBACK_SIMILARITY_SEARCH_SIZE;\r\n\r\n // build a vector search
query:\r\n const vectorSearchQuery = getVectorSearchQuery({\r\n
filter,\r\n modelId: this.model,\r\n mustNotTerms: requiredDocs,\r\n
query,\r\n });\r\n\r\n // build a (separate) terms search query:\r\n
const termsSearchQuery = getTermsSearchQuery(requiredDocs);\r\n\r\n //
combine the vector search query and the terms search queries into a
single multi-search query:\r\n const mSearchQueryBody =
getMsearchQueryBody({\r\n index: this.index,\r\n termsSearchQuery,\r\n
termsSearchQuerySize: TERMS_QUERY_SIZE,\r\n vectorSearchQuery,\r\n
vectorSearchQuerySize,\r\n });\r\n\r\n try {\r\n // execute both queries
via a single multi-search request:\r\n const result = await
this.esClient.msearch<MsearchResponse>(mSearchQueryBody);\r\n\r\n //
flatten the results of the combined queries into a single array of
hits:\r\n const results: FlattenedHit[] =
result.responses.flatMap((response) =>\r\n // ...\r\n```\r\n\r\n## Desk
testing\r\n\r\n1. Delete any previous instances of the Knowledge Base by
executing the following query in Kibana's `Dev
Tools`:\r\n\r\n```\r\n\r\nDELETE
.kibana-elastic-ai-assistant-kb\r\n\r\n```\r\n\r\n2. In the Security
Solution, open the Elastic AI Assistant\r\n\r\n3. In the assistant,
click the `Settings` gear\r\n\r\n4. Click the `Knowledge Base` icon to
view the KB settings\r\n\r\n5. Toggle the `Knowledge Base` setting `off`
if it's already on\r\n\r\n6. Toggle the `Knowledge Base` setting `on` to
load the KB documents\r\n\r\n7. Click the `Save` button to close
settings\r\n\r\n8. Enter the following prompt, then press
Enter:\r\n\r\n```\r\nGenerate an ES|QL query that will count the number
of connections made to external IP addresses, broken down by user. If
the count is greater than 100 for a specific user, add a new field
called \"follow_up\" that contains a value of \"true\", otherwise, it
should contain \"false\". The user names should also be enriched with
their respective group names.\r\n```\r\n\r\n**Expected result**\r\n\r\nA
response similar to the following is returned:\r\n\r\n```\r\nFROM
logs-*\r\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\",
\"172.16.0.0/12\", \"192.168.0.0/16\")\r\n| STATS destcount =
COUNT(destination.ip) BY user.name\r\n| ENRICH ldap_lookup ON
user.name\r\n| EVAL follow_up = CASE(\r\n destcount >= 100,
\"true\",\r\n \"false\")\r\n| SORT destcount DESC\r\n| KEEP destcount,
user.name, group.name,
follow_up\r\n```","sha":"d0e99258c68d57bc83788724814783ece176aa78","branchLabelMapping":{"^v8.12.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:
SecuritySolution","Feature:Elastic AI
Assistant","v8.11.0","v8.12.0"],"number":168995,"url":"https://github.com/elastic/kibana/pull/168995","mergeCommit":{"message":"[Security
Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for
improved ES|QL query generation (#168995)\n\n## [Security Solution]
[Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL
query generation\r\n\r\nThis PR implements hybrid (vector + terms)
search to improve the quality of `ES|QL` queries generated by the
Elastic AI Assistant.\r\n\r\nThe hybrid search combines (from a single
request to Elasticsearch):\r\n\r\n- Vector search results from ELSER
that vary depending on the query specified by the user\r\n- Terms search
results that return a set of Knowledge Base (KB) documents marked as
\"required\" for a topic\r\n\r\nThe hybrid search results, when provided
as context to an LLM, improve the quality of generated `ES|QL` queries
by combining `ES|QL` parser grammar and documentation specific to the
question asked by a user with additional examples of valid `ES|QL`
queries that aren't specific to the query.\r\n\r\n## Details\r\n\r\n###
Indexing additional `metadata`\r\n\r\nThe `loadESQL` function in
`x-pack/plugins/elastic_assistant/server/lib/langchain/content_loaders/esql_loader.ts`
loads a directory containing 13 valid, and one invalid example of
`ES|QL` queries:\r\n\r\n```typescript\r\n const rawExampleQueries =
await exampleQueriesLoader.load();\r\n\r\n // Add additional metadata to
the example queries that indicates they are required KB documents:\r\n
const requiredExampleQueries = addRequiredKbResourceMetadata({\r\n docs:
rawExampleQueries,\r\n kbResource: ESQL_RESOURCE,\r\n
});\r\n```\r\n\r\nThe `addRequiredKbResourceMetadata` function adds two
additional fields to the `metadata` property of the document:\r\n\r\n-
`kbResource` - a `keyword` field that specifies the category of
knowledge, e.g. `esql`\r\n- `required` - a `boolean` field that when
`true`, indicates the document should be returned in all searches for
the `kbResource`\r\n\r\nThe additional metadata fields are shown below
in the following abridged sample document:\r\n\r\n```\r\n{\r\n
\"_index\": \".kibana-elastic-ai-assistant-kb\",\r\n \"_id\":
\"e297e2d9-fb0e-4638-b4be-af31d1b31b9f\",\r\n \"_version\": 1,\r\n
\"_seq_no\": 129,\r\n \"_primary_term\": 1,\r\n \"found\": true,\r\n
\"_source\": {\r\n \"metadata\": {\r\n \"source\":
\"/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc\",\r\n
\"required\": true,\r\n \"kbResource\": \"esql\"\r\n },\r\n \"vector\":
{\r\n \"tokens\": {\r\n \"serial\": 0.5612584,\r\n \"syntax\":
0.006727545,\r\n \"user\": 1.1184403,\r\n // ...additional tokens\r\n
},\r\n \"model_id\": \".elser_model_2\"\r\n },\r\n \"text\":
\"\"\"[[esql-example-queries]]\r\n\r\nThe following is an example ES|QL
query:\r\n\r\n\\`\\`\\`\r\nFROM logs-*\r\n| WHERE NOT
CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\",
\"192.168.0.0/16\")\r\n| STATS destcount = COUNT(destination.ip) by
user.name, host.name\r\n| ENRICH ldap_lookup_new ON user.name\r\n| WHERE
group.name IS NOT NULL\r\n| EVAL follow_up = CASE(\r\n destcount >= 100,
\"true\",\r\n \"false\")\r\n| SORT destcount desc\r\n| KEEP destcount,
host.name, user.name, group.name, follow_up\r\n\\`\\`\\`\r\n\"\"\"\r\n
}\r\n}\r\n```\r\n\r\n### Hybrid search\r\n\r\nThe
`ElasticsearchStore.similaritySearch` function is invoked by LangChain's
`VectorStoreRetriever.getRelevantDocuments` function when the
`RetrievalQAChain` searches for documents.\r\n\r\nA single request to
Elasticsearch performs a hybrid search that combines the vector and
terms searches into a single request with an
[msearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html):\r\n\r\n```typescript\r\n
// requiredDocs is an array of filters that can be used in a `bool`
Elasticsearch DSL query to filter in/out required KB documents:\r\n
const requiredDocs =
getRequiredKbDocsTermsQueryDsl(this.kbResource);\r\n\r\n // The `k`
parameter is typically provided by LangChain's
`VectorStoreRetriever._getRelevantDocuments`, which calls this
function:\r\n const vectorSearchQuerySize = k ??
FALLBACK_SIMILARITY_SEARCH_SIZE;\r\n\r\n // build a vector search
query:\r\n const vectorSearchQuery = getVectorSearchQuery({\r\n
filter,\r\n modelId: this.model,\r\n mustNotTerms: requiredDocs,\r\n
query,\r\n });\r\n\r\n // build a (separate) terms search query:\r\n
const termsSearchQuery = getTermsSearchQuery(requiredDocs);\r\n\r\n //
combine the vector search query and the terms search queries into a
single multi-search query:\r\n const mSearchQueryBody =
getMsearchQueryBody({\r\n index: this.index,\r\n termsSearchQuery,\r\n
termsSearchQuerySize: TERMS_QUERY_SIZE,\r\n vectorSearchQuery,\r\n
vectorSearchQuerySize,\r\n });\r\n\r\n try {\r\n // execute both queries
via a single multi-search request:\r\n const result = await
this.esClient.msearch<MsearchResponse>(mSearchQueryBody);\r\n\r\n //
flatten the results of the combined queries into a single array of
hits:\r\n const results: FlattenedHit[] =
result.responses.flatMap((response) =>\r\n // ...\r\n```\r\n\r\n## Desk
testing\r\n\r\n1. Delete any previous instances of the Knowledge Base by
executing the following query in Kibana's `Dev
Tools`:\r\n\r\n```\r\n\r\nDELETE
.kibana-elastic-ai-assistant-kb\r\n\r\n```\r\n\r\n2. In the Security
Solution, open the Elastic AI Assistant\r\n\r\n3. In the assistant,
click the `Settings` gear\r\n\r\n4. Click the `Knowledge Base` icon to
view the KB settings\r\n\r\n5. Toggle the `Knowledge Base` setting `off`
if it's already on\r\n\r\n6. Toggle the `Knowledge Base` setting `on` to
load the KB documents\r\n\r\n7. Click the `Save` button to close
settings\r\n\r\n8. Enter the following prompt, then press
Enter:\r\n\r\n```\r\nGenerate an ES|QL query that will count the number
of connections made to external IP addresses, broken down by user. If
the count is greater than 100 for a specific user, add a new field
called \"follow_up\" that contains a value of \"true\", otherwise, it
should contain \"false\". The user names should also be enriched with
their respective group names.\r\n```\r\n\r\n**Expected result**\r\n\r\nA
response similar to the following is returned:\r\n\r\n```\r\nFROM
logs-*\r\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\",
\"172.16.0.0/12\", \"192.168.0.0/16\")\r\n| STATS destcount =
COUNT(destination.ip) BY user.name\r\n| ENRICH ldap_lookup ON
user.name\r\n| EVAL follow_up = CASE(\r\n destcount >= 100,
\"true\",\r\n \"false\")\r\n| SORT destcount DESC\r\n| KEEP destcount,
user.name, group.name,
follow_up\r\n```","sha":"d0e99258c68d57bc83788724814783ece176aa78"}},"sourceBranch":"main","suggestedTargetBranches":["8.11"],"targetPullRequestStates":[{"branch":"8.11","label":"v8.11.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.12.0","labelRegex":"^v8.12.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/168995","number":168995,"mergeCommit":{"message":"[Security
Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for
improved ES|QL query generation (#168995)\n\n## [Security Solution]
[Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL
query generation\r\n\r\nThis PR implements hybrid (vector + terms)
search to improve the quality of `ES|QL` queries generated by the
Elastic AI Assistant.\r\n\r\nThe hybrid search combines (from a single
request to Elasticsearch):\r\n\r\n- Vector search results from ELSER
that vary depending on the query specified by the user\r\n- Terms search
results that return a set of Knowledge Base (KB) documents marked as
\"required\" for a topic\r\n\r\nThe hybrid search results, when provided
as context to an LLM, improve the quality of generated `ES|QL` queries
by combining `ES|QL` parser grammar and documentation specific to the
question asked by a user with additional examples of valid `ES|QL`
queries that aren't specific to the query.\r\n\r\n## Details\r\n\r\n###
Indexing additional `metadata`\r\n\r\nThe `loadESQL` function in
`x-pack/plugins/elastic_assistant/server/lib/langchain/content_loaders/esql_loader.ts`
loads a directory containing 13 valid, and one invalid example of
`ES|QL` queries:\r\n\r\n```typescript\r\n const rawExampleQueries =
await exampleQueriesLoader.load();\r\n\r\n // Add additional metadata to
the example queries that indicates they are required KB documents:\r\n
const requiredExampleQueries = addRequiredKbResourceMetadata({\r\n docs:
rawExampleQueries,\r\n kbResource: ESQL_RESOURCE,\r\n
});\r\n```\r\n\r\nThe `addRequiredKbResourceMetadata` function adds two
additional fields to the `metadata` property of the document:\r\n\r\n-
`kbResource` - a `keyword` field that specifies the category of
knowledge, e.g. `esql`\r\n- `required` - a `boolean` field that when
`true`, indicates the document should be returned in all searches for
the `kbResource`\r\n\r\nThe additional metadata fields are shown below
in the following abridged sample document:\r\n\r\n```\r\n{\r\n
\"_index\": \".kibana-elastic-ai-assistant-kb\",\r\n \"_id\":
\"e297e2d9-fb0e-4638-b4be-af31d1b31b9f\",\r\n \"_version\": 1,\r\n
\"_seq_no\": 129,\r\n \"_primary_term\": 1,\r\n \"found\": true,\r\n
\"_source\": {\r\n \"metadata\": {\r\n \"source\":
\"/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc\",\r\n
\"required\": true,\r\n \"kbResource\": \"esql\"\r\n },\r\n \"vector\":
{\r\n \"tokens\": {\r\n \"serial\": 0.5612584,\r\n \"syntax\":
0.006727545,\r\n \"user\": 1.1184403,\r\n // ...additional tokens\r\n
},\r\n \"model_id\": \".elser_model_2\"\r\n },\r\n \"text\":
\"\"\"[[esql-example-queries]]\r\n\r\nThe following is an example ES|QL
query:\r\n\r\n\\`\\`\\`\r\nFROM logs-*\r\n| WHERE NOT
CIDR_MATCH(destination.ip, \"10.0.0.0/8\", \"172.16.0.0/12\",
\"192.168.0.0/16\")\r\n| STATS destcount = COUNT(destination.ip) by
user.name, host.name\r\n| ENRICH ldap_lookup_new ON user.name\r\n| WHERE
group.name IS NOT NULL\r\n| EVAL follow_up = CASE(\r\n destcount >= 100,
\"true\",\r\n \"false\")\r\n| SORT destcount desc\r\n| KEEP destcount,
host.name, user.name, group.name, follow_up\r\n\\`\\`\\`\r\n\"\"\"\r\n
}\r\n}\r\n```\r\n\r\n### Hybrid search\r\n\r\nThe
`ElasticsearchStore.similaritySearch` function is invoked by LangChain's
`VectorStoreRetriever.getRelevantDocuments` function when the
`RetrievalQAChain` searches for documents.\r\n\r\nA single request to
Elasticsearch performs a hybrid search that combines the vector and
terms searches into a single request with an
[msearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html):\r\n\r\n```typescript\r\n
// requiredDocs is an array of filters that can be used in a `bool`
Elasticsearch DSL query to filter in/out required KB documents:\r\n
const requiredDocs =
getRequiredKbDocsTermsQueryDsl(this.kbResource);\r\n\r\n // The `k`
parameter is typically provided by LangChain's
`VectorStoreRetriever._getRelevantDocuments`, which calls this
function:\r\n const vectorSearchQuerySize = k ??
FALLBACK_SIMILARITY_SEARCH_SIZE;\r\n\r\n // build a vector search
query:\r\n const vectorSearchQuery = getVectorSearchQuery({\r\n
filter,\r\n modelId: this.model,\r\n mustNotTerms: requiredDocs,\r\n
query,\r\n });\r\n\r\n // build a (separate) terms search query:\r\n
const termsSearchQuery = getTermsSearchQuery(requiredDocs);\r\n\r\n //
combine the vector search query and the terms search queries into a
single multi-search query:\r\n const mSearchQueryBody =
getMsearchQueryBody({\r\n index: this.index,\r\n termsSearchQuery,\r\n
termsSearchQuerySize: TERMS_QUERY_SIZE,\r\n vectorSearchQuery,\r\n
vectorSearchQuerySize,\r\n });\r\n\r\n try {\r\n // execute both queries
via a single multi-search request:\r\n const result = await
this.esClient.msearch<MsearchResponse>(mSearchQueryBody);\r\n\r\n //
flatten the results of the combined queries into a single array of
hits:\r\n const results: FlattenedHit[] =
result.responses.flatMap((response) =>\r\n // ...\r\n```\r\n\r\n## Desk
testing\r\n\r\n1. Delete any previous instances of the Knowledge Base by
executing the following query in Kibana's `Dev
Tools`:\r\n\r\n```\r\n\r\nDELETE
.kibana-elastic-ai-assistant-kb\r\n\r\n```\r\n\r\n2. In the Security
Solution, open the Elastic AI Assistant\r\n\r\n3. In the assistant,
click the `Settings` gear\r\n\r\n4. Click the `Knowledge Base` icon to
view the KB settings\r\n\r\n5. Toggle the `Knowledge Base` setting `off`
if it's already on\r\n\r\n6. Toggle the `Knowledge Base` setting `on` to
load the KB documents\r\n\r\n7. Click the `Save` button to close
settings\r\n\r\n8. Enter the following prompt, then press
Enter:\r\n\r\n```\r\nGenerate an ES|QL query that will count the number
of connections made to external IP addresses, broken down by user. If
the count is greater than 100 for a specific user, add a new field
called \"follow_up\" that contains a value of \"true\", otherwise, it
should contain \"false\". The user names should also be enriched with
their respective group names.\r\n```\r\n\r\n**Expected result**\r\n\r\nA
response similar to the following is returned:\r\n\r\n```\r\nFROM
logs-*\r\n| WHERE NOT CIDR_MATCH(destination.ip, \"10.0.0.0/8\",
\"172.16.0.0/12\", \"192.168.0.0/16\")\r\n| STATS destcount =
COUNT(destination.ip) BY user.name\r\n| ENRICH ldap_lookup ON
user.name\r\n| EVAL follow_up = CASE(\r\n destcount >= 100,
\"true\",\r\n \"false\")\r\n| SORT destcount DESC\r\n| KEEP destcount,
user.name, group.name,
follow_up\r\n```","sha":"d0e99258c68d57bc83788724814783ece176aa78"}}]}]
BACKPORT-->

Co-authored-by: Andrew Macri <[email protected]>
  • Loading branch information
kibanamachine and andrew-goldstein authored Oct 17, 2023
1 parent 5047d71 commit 74ae54d
Show file tree
Hide file tree
Showing 49 changed files with 1,673 additions and 114 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/

import { Document } from 'langchain/document';

/**
* Mock LangChain `Document`s from `knowledge_base/esql/docs`, loaded from a LangChain `DirectoryLoader`
*/
export const mockEsqlDocsFromDirectoryLoader: Document[] = [
{
pageContent:
'[[esql-agg-avg]]\n=== `AVG`\nThe average of a numeric field.\n\n[source.merge.styled,esql]\n----\ninclude::{esql-specs}/stats.csv-spec[tag=avg]\n----\n[%header.monospaced.styled,format=dsv,separator=|]\n|===\ninclude::{esql-specs}/stats.csv-spec[tag=avg-result]\n|===\n\nThe result is always a `double` not matter the input type.\n',
metadata: {
source:
'/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/aggregation_functions/avg.asciidoc',
},
},
];

/**
* Mock LangChain `Document`s from `knowledge_base/esql/language_definition`, loaded from a LangChain `DirectoryLoader`
*/
export const mockEsqlLanguageDocsFromDirectoryLoader: Document[] = [
{
pageContent:
"lexer grammar EsqlBaseLexer;\n\nDISSECT : 'dissect' -> pushMode(EXPRESSION);\nDROP : 'drop' -> pushMode(SOURCE_IDENTIFIERS);\nENRICH : 'enrich' -> pushMode(SOURCE_IDENTIFIERS);\nEVAL : 'eval' -> pushMode(EXPRESSION);\nEXPLAIN : 'explain' -> pushMode(EXPLAIN_MODE);\nFROM : 'from' -> pushMode(SOURCE_IDENTIFIERS);\nGROK : 'grok' -> pushMode(EXPRESSION);\nINLINESTATS : 'inlinestats' -> pushMode(EXPRESSION);\nKEEP : 'keep' -> pushMode(SOURCE_IDENTIFIERS);\nLIMIT : 'limit' -> pushMode(EXPRESSION);\nMV_EXPAND : 'mv_expand' -> pushMode(SOURCE_IDENTIFIERS);\nPROJECT : 'project' -> pushMode(SOURCE_IDENTIFIERS);\nRENAME : 'rename' -> pushMode(SOURCE_IDENTIFIERS);\nROW : 'row' -> pushMode(EXPRESSION);\nSHOW : 'show' -> pushMode(EXPRESSION);\nSORT : 'sort' -> pushMode(EXPRESSION);\nSTATS : 'stats' -> pushMode(EXPRESSION);\nWHERE : 'where' -> pushMode(EXPRESSION);\nUNKNOWN_CMD : ~[ \\r\\n\\t[\\]/]+ -> pushMode(EXPRESSION);\n\nLINE_COMMENT\n : '//' ~[\\r\\n]* '\\r'? '\\n'? -> channel(HIDDEN)\n ;\n\nMULTILINE_COMMENT\n : '/*' (MULTILINE_COMMENT|.)*? '*/' -> channel(HIDDEN)\n ;\n\nWS\n : [ \\r\\n\\t]+ -> channel(HIDDEN)\n ;\n\n\nmode EXPLAIN_MODE;\nEXPLAIN_OPENING_BRACKET : '[' -> type(OPENING_BRACKET), pushMode(DEFAULT_MODE);\nEXPLAIN_PIPE : '|' -> type(PIPE), popMode;\nEXPLAIN_WS : WS -> channel(HIDDEN);\nEXPLAIN_LINE_COMMENT : LINE_COMMENT -> channel(HIDDEN);\nEXPLAIN_MULTILINE_COMMENT : MULTILINE_COMMENT -> channel(HIDDEN);\n\nmode EXPRESSION;\n\nPIPE : '|' -> popMode;\n\nfragment DIGIT\n : [0-9]\n ;\n\nfragment LETTER\n : [A-Za-z]\n ;\n\nfragment ESCAPE_SEQUENCE\n : '\\\\' [tnr\"\\\\]\n ;\n\nfragment UNESCAPED_CHARS\n : ~[\\r\\n\"\\\\]\n ;\n\nfragment EXPONENT\n : [Ee] [+-]? DIGIT+\n ;\n\nSTRING\n : '\"' (ESCAPE_SEQUENCE | UNESCAPED_CHARS)* '\"'\n | '\"\"\"' (~[\\r\\n])*? '\"\"\"' '\"'? '\"'?\n ;\n\nINTEGER_LITERAL\n : DIGIT+\n ;\n\nDECIMAL_LITERAL\n : DIGIT+ DOT DIGIT*\n | DOT DIGIT+\n | DIGIT+ (DOT DIGIT*)? EXPONENT\n | DOT DIGIT+ EXPONENT\n ;\n\nBY : 'by';\n\nAND : 'and';\nASC : 'asc';\nASSIGN : '=';\nCOMMA : ',';\nDESC : 'desc';\nDOT : '.';\nFALSE : 'false';\nFIRST : 'first';\nLAST : 'last';\nLP : '(';\nIN: 'in';\nIS: 'is';\nLIKE: 'like';\nNOT : 'not';\nNULL : 'null';\nNULLS : 'nulls';\nOR : 'or';\nPARAM: '?';\nRLIKE: 'rlike';\nRP : ')';\nTRUE : 'true';\nINFO : 'info';\nFUNCTIONS : 'functions';\n\nEQ : '==';\nNEQ : '!=';\nLT : '<';\nLTE : '<=';\nGT : '>';\nGTE : '>=';\n\nPLUS : '+';\nMINUS : '-';\nASTERISK : '*';\nSLASH : '/';\nPERCENT : '%';\n\n// Brackets are funny. We can happen upon a CLOSING_BRACKET in two ways - one\n// way is to start in an explain command which then shifts us to expression\n// mode. Thus, the two popModes on CLOSING_BRACKET. The other way could as\n// the start of a multivalued field constant. To line up with the double pop\n// the explain mode needs, we double push when we see that.\nOPENING_BRACKET : '[' -> pushMode(EXPRESSION), pushMode(EXPRESSION);\nCLOSING_BRACKET : ']' -> popMode, popMode;\n\n\nUNQUOTED_IDENTIFIER\n : LETTER (LETTER | DIGIT | '_')*\n // only allow @ at beginning of identifier to keep the option to allow @ as infix operator in the future\n // also, single `_` and `@` characters are not valid identifiers\n | ('_' | '@') (LETTER | DIGIT | '_')+\n ;\n\nQUOTED_IDENTIFIER\n : '`' ( ~'`' | '``' )* '`'\n ;\n\nEXPR_LINE_COMMENT\n : LINE_COMMENT -> channel(HIDDEN)\n ;\n\nEXPR_MULTILINE_COMMENT\n : MULTILINE_COMMENT -> channel(HIDDEN)\n ;\n\nEXPR_WS\n : WS -> channel(HIDDEN)\n ;\n\n\n\nmode SOURCE_IDENTIFIERS;\n\nSRC_PIPE : '|' -> type(PIPE), popMode;\nSRC_OPENING_BRACKET : '[' -> type(OPENING_BRACKET), pushMode(SOURCE_IDENTIFIERS), pushMode(SOURCE_IDENTIFIERS);\nSRC_CLOSING_BRACKET : ']' -> popMode, popMode, type(CLOSING_BRACKET);\nSRC_COMMA : ',' -> type(COMMA);\nSRC_ASSIGN : '=' -> type(ASSIGN);\nAS : 'as';\nMETADATA: 'metadata';\nON : 'on';\nWITH : 'with';\n\nSRC_UNQUOTED_IDENTIFIER\n : SRC_UNQUOTED_IDENTIFIER_PART+\n ;\n\nfragment SRC_UNQUOTED_IDENTIFIER_PART\n : ~[=`|,[\\]/ \\t\\r\\n]+\n | '/' ~[*/] // allow single / but not followed by another / or * which would start a comment\n ;\n\nSRC_QUOTED_IDENTIFIER\n : QUOTED_IDENTIFIER\n ;\n\nSRC_LINE_COMMENT\n : LINE_COMMENT -> channel(HIDDEN)\n ;\n\nSRC_MULTILINE_COMMENT\n : MULTILINE_COMMENT -> channel(HIDDEN)\n ;\n\nSRC_WS\n : WS -> channel(HIDDEN)\n ;\n",
metadata: {
source:
'/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/language_definition/esql_base_lexer.g4',
},
},
{
pageContent:
"DISSECT=1\nDROP=2\nENRICH=3\nEVAL=4\nEXPLAIN=5\nFROM=6\nGROK=7\nINLINESTATS=8\nKEEP=9\nLIMIT=10\nMV_EXPAND=11\nPROJECT=12\nRENAME=13\nROW=14\nSHOW=15\nSORT=16\nSTATS=17\nWHERE=18\nUNKNOWN_CMD=19\nLINE_COMMENT=20\nMULTILINE_COMMENT=21\nWS=22\nEXPLAIN_WS=23\nEXPLAIN_LINE_COMMENT=24\nEXPLAIN_MULTILINE_COMMENT=25\nPIPE=26\nSTRING=27\nINTEGER_LITERAL=28\nDECIMAL_LITERAL=29\nBY=30\nAND=31\nASC=32\nASSIGN=33\nCOMMA=34\nDESC=35\nDOT=36\nFALSE=37\nFIRST=38\nLAST=39\nLP=40\nIN=41\nIS=42\nLIKE=43\nNOT=44\nNULL=45\nNULLS=46\nOR=47\nPARAM=48\nRLIKE=49\nRP=50\nTRUE=51\nINFO=52\nFUNCTIONS=53\nEQ=54\nNEQ=55\nLT=56\nLTE=57\nGT=58\nGTE=59\nPLUS=60\nMINUS=61\nASTERISK=62\nSLASH=63\nPERCENT=64\nOPENING_BRACKET=65\nCLOSING_BRACKET=66\nUNQUOTED_IDENTIFIER=67\nQUOTED_IDENTIFIER=68\nEXPR_LINE_COMMENT=69\nEXPR_MULTILINE_COMMENT=70\nEXPR_WS=71\nAS=72\nMETADATA=73\nON=74\nWITH=75\nSRC_UNQUOTED_IDENTIFIER=76\nSRC_QUOTED_IDENTIFIER=77\nSRC_LINE_COMMENT=78\nSRC_MULTILINE_COMMENT=79\nSRC_WS=80\nEXPLAIN_PIPE=81\n'dissect'=1\n'drop'=2\n'enrich'=3\n'eval'=4\n'explain'=5\n'from'=6\n'grok'=7\n'inlinestats'=8\n'keep'=9\n'limit'=10\n'mv_expand'=11\n'project'=12\n'rename'=13\n'row'=14\n'show'=15\n'sort'=16\n'stats'=17\n'where'=18\n'by'=30\n'and'=31\n'asc'=32\n'desc'=35\n'.'=36\n'false'=37\n'first'=38\n'last'=39\n'('=40\n'in'=41\n'is'=42\n'like'=43\n'not'=44\n'null'=45\n'nulls'=46\n'or'=47\n'?'=48\n'rlike'=49\n')'=50\n'true'=51\n'info'=52\n'functions'=53\n'=='=54\n'!='=55\n'<'=56\n'<='=57\n'>'=58\n'>='=59\n'+'=60\n'-'=61\n'*'=62\n'/'=63\n'%'=64\n']'=66\n'as'=72\n'metadata'=73\n'on'=74\n'with'=75\n",
metadata: {
source:
'/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/language_definition/esql_base_lexer.tokens',
},
},
];

/**
* Mock LangChain `Document`s from `knowledge_base/esql/example_queries`, loaded from a LangChain `DirectoryLoader`
*/
export const mockExampleQueryDocsFromDirectoryLoader: Document[] = [
{
pageContent:
'[[esql-example-queries]]\n\nThe following is an example an ES|QL query:\n\n```\nFROM logs-*\n| WHERE NOT CIDR_MATCH(destination.ip, "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16")\n| STATS destcount = COUNT(destination.ip) by user.name, host.name\n| ENRICH ldap_lookup_new ON user.name\n| WHERE group.name IS NOT NULL\n| EVAL follow_up = CASE(\n destcount >= 100, "true",\n "false")\n| SORT destcount desc\n| KEEP destcount, host.name, user.name, group.name, follow_up\n```\n',
metadata: {
source:
'/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc',
},
},
{
pageContent:
'[[esql-example-queries]]\n\nThe following is an example an ES|QL query:\n\n```\nfrom logs-*\n| grok dns.question.name "%{DATA}\\\\.%{GREEDYDATA:dns.question.registered_domain:string}"\n| stats unique_queries = count_distinct(dns.question.name) by dns.question.registered_domain, process.name\n| where unique_queries > 5\n| sort unique_queries desc\n```\n',
metadata: {
source:
'/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0002.asciidoc',
},
},
{
pageContent:
'[[esql-example-queries]]\n\nThe following is an example an ES|QL query:\n\n```\nfrom logs-*\n| where event.code is not null\n| stats event_code_count = count(event.code) by event.code,host.name\n| enrich win_events on event.code with EVENT_DESCRIPTION\n| where EVENT_DESCRIPTION is not null and host.name is not null\n| rename EVENT_DESCRIPTION as event.description\n| sort event_code_count desc\n| keep event_code_count,event.code,host.name,event.description\n```\n',
metadata: {
source:
'/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0003.asciidoc',
},
},
];
75 changes: 75 additions & 0 deletions x-pack/plugins/elastic_assistant/server/__mocks__/msearch_query.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/

import type { QueryDslTextExpansionQuery } from '@elastic/elasticsearch/lib/api/types';

import type { MsearchQueryBody } from '../lib/langchain/elasticsearch_store/helpers/get_msearch_query_body';

/**
* This mock Elasticsearch msearch request body contains two queries:
* - The first query is a similarity (vector) search
* - The second query is a required KB document (terms) search
*/
export const mSearchQueryBody: MsearchQueryBody = {
body: [
{
index: '.kibana-elastic-ai-assistant-kb',
},
{
query: {
bool: {
must_not: [
{
term: {
'metadata.kbResource': 'esql',
},
},
{
term: {
'metadata.required': true,
},
},
],
must: [
{
text_expansion: {
'vector.tokens': {
model_id: '.elser_model_2',
model_text:
'Generate an ESQL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called "follow_up" that contains a value of "true", otherwise, it should contain "false". The user names should also be enriched with their respective group names.',
},
} as unknown as QueryDslTextExpansionQuery,
},
],
},
},
size: 1,
},
{
index: '.kibana-elastic-ai-assistant-kb',
},
{
query: {
bool: {
must: [
{
term: {
'metadata.kbResource': 'esql',
},
},
{
term: {
'metadata.required': true,
},
},
],
},
},
size: 1,
},
],
};
101 changes: 101 additions & 0 deletions x-pack/plugins/elastic_assistant/server/__mocks__/msearch_response.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/

import type { MsearchResponse } from '@elastic/elasticsearch/lib/api/types';

/**
* This mock response from an Elasticsearch msearch contains two hits, where
* the first hit is from a similarity (vector) search, and the second hit is a
* required KB document (terms) search.
*/
export const mockMsearchResponse: MsearchResponse = {
took: 142,
responses: [
{
took: 142,
timed_out: false,
_shards: {
total: 1,
successful: 1,
skipped: 0,
failed: 0,
},
hits: {
total: {
value: 129,
relation: 'eq',
},
max_score: 21.658352,
hits: [
{
_index: '.kibana-elastic-ai-assistant-kb',
_id: 'fa1c8ba1-25c9-4404-9736-09b7eb7124f8',
_score: 21.658352,
_ignored: ['text.keyword'],
_source: {
metadata: {
source:
'/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/docs/source_commands/from.asciidoc',
},
vector: {
tokens: {
wild: 1.2001507,
// truncated for mock
},
model_id: '.elser_model_2',
},
text: "[[esql-from]]\n=== `FROM`\n\nThe `FROM` source command returns a table with up to 10,000 documents from a\ndata stream, index, or alias. Each row in the resulting table represents a\ndocument. Each column corresponds to a field, and can be accessed by the name\nof that field.\n\n[source,esql]\n----\nFROM employees\n----\n\nYou can use <<api-date-math-index-names,date math>> to refer to indices, aliases\nand data streams. This can be useful for time series data, for example to access\ntoday's index:\n\n[source,esql]\n----\nFROM <logs-{now/d}>\n----\n\nUse comma-separated lists or wildcards to query multiple data streams, indices,\nor aliases:\n\n[source,esql]\n----\nFROM employees-00001,employees-*\n----\n",
},
},
],
},
status: 200,
},
{
took: 3,
timed_out: false,
_shards: {
total: 1,
successful: 1,
skipped: 0,
failed: 0,
},
hits: {
total: {
value: 14,
relation: 'eq',
},
max_score: 0.034783483,
hits: [
{
_index: '.kibana-elastic-ai-assistant-kb',
_id: '280d4882-0f64-4471-a268-669a3f8c958f',
_score: 0.034783483,
_ignored: ['text.keyword'],
_source: {
metadata: {
source:
'/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc',
required: true,
kbResource: 'esql',
},
vector: {
tokens: {
user: 1.1084619,
// truncated for mock
},
model_id: '.elser_model_2',
},
text: '[[esql-example-queries]]\n\nThe following is an example an ES|QL query:\n\n```\nFROM logs-*\n| WHERE NOT CIDR_MATCH(destination.ip, "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16")\n| STATS destcount = COUNT(destination.ip) by user.name, host.name\n| ENRICH ldap_lookup_new ON user.name\n| WHERE group.name IS NOT NULL\n| EVAL follow_up = CASE(\n destcount >= 100, "true",\n "false")\n| SORT destcount desc\n| KEEP destcount, host.name, user.name, group.name, follow_up\n```\n',
},
},
],
},
status: 200,
},
],
};
28 changes: 28 additions & 0 deletions x-pack/plugins/elastic_assistant/server/__mocks__/query_text.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/

/**
* This mock query text is an example of a prompt that might be passed to
* the `ElasticSearchStore`'s `similaritySearch` function, as the `query`
* parameter.
*
* In the real world, an LLM extracted the `mockQueryText` from the
* following prompt, which includes a system prompt:
*
* ```
* You are a helpful, expert assistant who answers questions about Elastic Security. Do not answer questions unrelated to Elastic Security.
* If you answer a question related to KQL, EQL, or ES|QL, it should be immediately usable within an Elastic Security timeline; please always format the output correctly with back ticks. Any answer provided for Query DSL should also be usable in a security timeline. This means you should only ever include the "filter" portion of the query.
*
* Use the following context to answer questions:
*
* Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called "follow_up" that contains a value of "true", otherwise, it should contain "false". The user names should also be enriched with their respective group names.
* ```
*
* In the example above, the LLM omitted the system prompt, such that only `mockQueryText` is passed to the `similaritySearch` function.
*/
export const mockQueryText =
'Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called follow_up that contains a value of true, otherwise, it should contain false. The user names should also be enriched with their respective group names.';
28 changes: 28 additions & 0 deletions x-pack/plugins/elastic_assistant/server/__mocks__/terms.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/

import type { Field, FieldValue, QueryDslTermQuery } from '@elastic/elasticsearch/lib/api/types';

/**
* These (mock) terms may be used in multiple queries.
*
* For example, it may be be used in a vector search to exclude the required `esql` KB docs.
*
* It may also be used in a terms search to find all of the required `esql` KB docs.
*/
export const mockTerms: Array<Partial<Record<Field, QueryDslTermQuery | FieldValue>>> = [
{
term: {
'metadata.kbResource': 'esql',
},
},
{
term: {
'metadata.required': true,
},
},
];
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/

import type { QueryDslQueryContainer } from '@elastic/elasticsearch/lib/api/types';

/**
* This Elasticsearch query DSL is a terms search for required `esql` KB docs
*/
export const mockTermsSearchQuery: QueryDslQueryContainer = {
bool: {
must: [
{
term: {
'metadata.kbResource': 'esql',
},
},
{
term: {
'metadata.required': true,
},
},
],
},
};
Loading

0 comments on commit 74ae54d

Please sign in to comment.