Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.11] [Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation (#168995) #169054

Merged
merged 1 commit into from
Oct 17, 2023

Conversation

kibanamachine
Copy link
Contributor

Backport

This will backport the following commits from main to 8.11:

Questions ?

Please refer to the Backport tool documentation

…arch for improved ES|QL query generation (elastic#168995)

## [Security Solution] [Elastic AI Assistant] Hybrid (vector + terms) search for improved ES|QL query generation

This PR implements hybrid (vector + terms) search to improve the quality of `ES|QL` queries generated by the Elastic AI Assistant.

The hybrid search combines (from a single request to Elasticsearch):

- Vector search results from ELSER that vary depending on the query specified by the user
- Terms search results that return a set of Knowledge Base (KB) documents marked as "required" for a topic

The hybrid search results, when provided as context to an LLM, improve the quality of generated `ES|QL` queries by combining `ES|QL` parser grammar and documentation specific to the question asked by a user with additional examples of valid `ES|QL` queries that aren't specific to the query.

## Details

### Indexing additional `metadata`

The `loadESQL` function in `x-pack/plugins/elastic_assistant/server/lib/langchain/content_loaders/esql_loader.ts` loads a directory containing 13 valid, and one invalid example of `ES|QL` queries:

```typescript
    const rawExampleQueries = await exampleQueriesLoader.load();

    // Add additional metadata to the example queries that indicates they are required KB documents:
    const requiredExampleQueries = addRequiredKbResourceMetadata({
      docs: rawExampleQueries,
      kbResource: ESQL_RESOURCE,
    });
```

The `addRequiredKbResourceMetadata` function adds two additional fields to the `metadata` property of the document:

- `kbResource` - a `keyword` field that specifies the category of knowledge, e.g. `esql`
- `required` - a `boolean` field that when `true`, indicates the document should be returned in all searches for the `kbResource`

The additional metadata fields are shown below in the following abridged sample document:

```
{
  "_index": ".kibana-elastic-ai-assistant-kb",
  "_id": "e297e2d9-fb0e-4638-b4be-af31d1b31b9f",
  "_version": 1,
  "_seq_no": 129,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "metadata": {
      "source": "/Users/andrew.goldstein/Projects/forks/andrew-goldstein/kibana/x-pack/plugins/elastic_assistant/server/knowledge_base/esql/example_queries/esql_example_query_0001.asciidoc",
      "required": true,
      "kbResource": "esql"
    },
    "vector": {
      "tokens": {
        "serial": 0.5612584,
        "syntax": 0.006727545,
        "user": 1.1184403,
        // ...additional tokens
      },
      "model_id": ".elser_model_2"
    },
    "text": """[[esql-example-queries]]

The following is an example ES|QL query:

\`\`\`
FROM logs-*
| WHERE NOT CIDR_MATCH(destination.ip, "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16")
| STATS destcount = COUNT(destination.ip) by user.name, host.name
| ENRICH ldap_lookup_new ON user.name
| WHERE group.name IS NOT NULL
| EVAL follow_up = CASE(
    destcount >= 100, "true",
     "false")
| SORT destcount desc
| KEEP destcount, host.name, user.name, group.name, follow_up
\`\`\`
"""
  }
}
```

### Hybrid search

The `ElasticsearchStore.similaritySearch` function is invoked by LangChain's `VectorStoreRetriever.getRelevantDocuments` function when the `RetrievalQAChain` searches for documents.

A single request to Elasticsearch performs a hybrid search that combines the vector and terms searches into a single request with an [msearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html):

```typescript
    // requiredDocs is an array of filters that can be used in a `bool` Elasticsearch DSL query to filter in/out required KB documents:
    const requiredDocs = getRequiredKbDocsTermsQueryDsl(this.kbResource);

    // The `k` parameter is typically provided by LangChain's `VectorStoreRetriever._getRelevantDocuments`, which calls this function:
    const vectorSearchQuerySize = k ?? FALLBACK_SIMILARITY_SEARCH_SIZE;

    // build a vector search query:
    const vectorSearchQuery = getVectorSearchQuery({
      filter,
      modelId: this.model,
      mustNotTerms: requiredDocs,
      query,
    });

    // build a (separate) terms search query:
    const termsSearchQuery = getTermsSearchQuery(requiredDocs);

    // combine the vector search query and the terms search queries into a single multi-search query:
    const mSearchQueryBody = getMsearchQueryBody({
      index: this.index,
      termsSearchQuery,
      termsSearchQuerySize: TERMS_QUERY_SIZE,
      vectorSearchQuery,
      vectorSearchQuerySize,
    });

    try {
      // execute both queries via a single multi-search request:
      const result = await this.esClient.msearch<MsearchResponse>(mSearchQueryBody);

      // flatten the results of the combined queries into a single array of hits:
      const results: FlattenedHit[] = result.responses.flatMap((response) =>
      // ...
```

## Desk testing

1. Delete any previous instances of the Knowledge Base by executing the following query in Kibana's `Dev Tools`:

```

DELETE .kibana-elastic-ai-assistant-kb

```

2. In the Security Solution, open the Elastic AI Assistant

3. In the assistant, click the `Settings` gear

4. Click the `Knowledge Base` icon to view the KB settings

5. Toggle the `Knowledge Base` setting `off` if it's already on

6. Toggle the `Knowledge Base` setting `on` to load the KB documents

7. Click the `Save` button to close settings

8. Enter the following prompt, then press Enter:

```
Generate an ES|QL query that will count the number of connections made to external IP addresses, broken down by user. If the count is greater than 100 for a specific user, add a new field called "follow_up" that contains a value of "true", otherwise, it should contain "false". The user names should also be enriched with their respective group names.
```

**Expected result**

A response similar to the following is returned:

```
FROM logs-*
| WHERE NOT CIDR_MATCH(destination.ip, "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16")
| STATS destcount = COUNT(destination.ip) BY user.name
| ENRICH ldap_lookup ON user.name
| EVAL follow_up = CASE(
    destcount >= 100, "true",
    "false")
| SORT destcount DESC
| KEEP destcount, user.name, group.name, follow_up
```

(cherry picked from commit d0e9925)
@github-actions
Copy link
Contributor

Documentation preview:

@kibana-ci
Copy link
Collaborator

kibana-ci commented Oct 17, 2023

💔 Build Failed

Failed CI Steps

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
securitySolution 13.0MB 13.0MB +8.0B

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @andrew-goldstein

@kibanamachine kibanamachine merged commit 74ae54d into elastic:8.11 Oct 17, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants