[Obs AI Assistant] Improve LLM evaluation framework (#204574)

Closes #203122 ## Summary ### Problem The Obs AI Assistant LLM evaluation framework cannot successfully run in the current state in the `main` branch and has missing scenarios. Problems identified: - Unable to run the evaluation with a local Elasticsearch instance - Alerts and APM results are skipped entirely when reporting the final result on the terminal (due to consistent failures in the tests) - State contaminations between runs makes the script throw errors when run multiple times. - Authentication issues when calling `/internal` APIs ### Solution As a part of spacetime, worked on fixing the current issues in the LLM evaluation framework and working on improving and enhancing the framework. #### Fixes | Problem | RC (Root Cause) | Fixed? | |------------------------|---------------------------------|--------| | Running with a local Elasticsearch instance | Service URLs were not picking up the correct auth because of the format specified in `kibana.dev.yml` | ✅ | | Alerts and APM results skipped in final result | Most (if not all) tests are failing in the alerts and APM suites, hence no final results are reported. | ✅ (all test scenarios fixed) | | State contaminations between runs | Some `after` hooks were not running successfully because of an error in the `callKibana` method | ✅ | | Authentication issues when calling `/internal` APIs | The required headers are not present in the request | ✅ | #### Enhancements / Improvements | What was added | How does it enhance the framework | |------------------------|---------------------------------| | Added new KB retrieval test to the KB scenario | More scenarios covered | | Added new scenario for the `retrieve_elastic_doc` function | Cover missing newly added functions | | Enhance how scope is used for each scenario and apply correct scope | The scope determines the wording of the system message. Certain scenarios need to be scoped to observability (e.g.: `alerts`) to produce the best result. At present all scenarios use the scope `all` which is not ideal and doesn't align with the actual functionality of the AI Assistant | | Avoid throwing unnecessary errors on the console (This was fixed by adding guard rails, e.g.: not creating a dataview if it exists) | Makes it easier to navigate through the results printed on the terminal | | Improved readme | Easier to configure and use the framework while identifying all possible options | | Improved logging | Easier to navigate through the terminal output | ### Checklist - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) --------- Co-authored-by: kibanamachine <[email protected]>
elastic · Dec 31, 2024 · 38310a5 · 38310a5
1 parent 6a25db9
commit 38310a5
Show file tree

Hide file tree

Showing 10 changed files with 422 additions and 196 deletions.
diff --git a/...servability/plugins/observability_ai_assistant_app/scripts/evaluation/README.md b/...servability/plugins/observability_ai_assistant_app/scripts/evaluation/README.md
@@ -2,7 +2,7 @@
 
 ## Overview
 
-This tool is developed for our team working on the Elastic Observability platform, specifically focusing on evaluating the Observability AI Assistant. It simplifies scripting and evaluating various scenarios with the Large Language Model (LLM) integration.
+This tool is developed for our team working on the Elastic Observability platform, specifically focusing on evaluating the Observability AI Assistant. It simplifies scripting and evaluating various scenarios with Large Language Model (LLM) integrations.
 
 ## Setup requirements
 
@@ -12,26 +12,40 @@ This tool is developed for our team working on the Elastic Observability platfor
 
 ## Running evaluations
 
-Run the tool using:
-
-`$ node x-pack/solutions/observability/plugins/observability_solution/observability_ai_assistant_app/scripts/evaluation/index.js`
-
-This will evaluate all existing scenarios, and write the evaluation results to the terminal.
-
 ### Configuration
 
-#### Kibana and Elasticsearch
-
-By default, the tool will look for a Kibana instance running locally (at `http://localhost:5601`, which is the default address for running Kibana in development mode). It will also attempt to read the Kibana config file for the Elasticsearch address & credentials. If you want to override these settings, use `--kibana` and `--es`. Only basic auth is supported, e.g. `--kibana http://username:password@localhost:5601`. If you want to use a specific space, use `--spaceId`
+#### To run the evaluation using a local Elasticsearch and Kibana instance:
 
-#### Connector
+- Run Elasticsearch locally: `yarn es snapshot --license trial`
+- Start Kibana (Default address for Kibana in dev mode: `http://localhost:5601`)
+- Run this command to start evaluating:
+`$ node x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/index.js`
 
-Use `--connectorId` to specify a `.gen-ai` or `.bedrock` connector to use. If none are given, it will prompt you to select a connector based on the ones that are available. If only a single supported connector is found, it will be used without prompting.
-
-#### Persisting conversations
-
-By default, completed conversations are not persisted. If you do want to persist them, for instance for reviewing purposes, set the `--persist` flag to store them. This will also generate a clickable link in the output of the evaluation that takes you to the conversation.
-
-If you want to clear conversations on startup, use the `--clear` flag. This only works when `--persist` is enabled. If `--spaceId` is set, only conversations for the current space will be cleared.
+This will evaluate all existing scenarios, and write the evaluation results to the terminal.
 
-When storing conversations, the name of the scenario is used as a title. Set the `--autoTitle` flag to have the LLM generate a title for you.
+#### To run the evaluation using a hosted deployment:
+- Add the credentials of Elasticsearch to `kibana.dev.yml` as follows:
+```
+elasticsearch.hosts: https://<hosted-url>:<port>
+elasticsearch.username: <username>
+elasticsearch.password: <password>
+elasticsearch.ssl.verificationMode: none
+elasticsearch.ignoreVersionMismatch: true
+```
+- Start Kibana
+- Run this command to start evaluating: `node x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/index.js --kibana http://<username>:<password>@localhost:5601`
+
+By default the script will use the Elasticsearch credentials specified in `kibana.dev.yml`, if you want to override it use the `--es` flag when running the evaluation script:
+E.g.: `node x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/index.js --kibana http://<username>:<password>@localhost:5601 --es https://<username>:<password>@<hosted-url>:<port>`
+
+The `--kibana` and `--es` flags override the default credentials. Only basic auth is supported.
+
+## Other (optional) configuration flags
+- `--connectorId` - Specify a generative AI connector to use. If none are given, it will prompt you to select a connector based on the ones that are available. If only a single supported connector is found, it will be used without prompting.
+- `--evaluateWith`: The connector ID to evaluate with. Leave empty to use the same connector, use "other" to get a selection menu.
+- `--spaceId` - Specify the space ID if you want to use a specific space.
+- `--persist` - By default, completed conversations are not persisted. If you want to persist them, for instance for reviewing purposes, include this flag when running the evaluation script. This will also generate a clickable link in the output of the evaluation that takes you to the conversation in Kibana.
+- `--clear` - If you want to clear conversations on startup, include this command when running the evaluation script. This only works when `--persist` is enabled. If `--spaceId` is set, only conversations for the current space will be cleared
+- `--autoTitle`: When storing conversations, the name of the scenario is used as a title. Set this flag to have the LLM generate a title for you. This only works when `--persist` is enabled.
+- `--files`: A file or list of files containing the scenarios to evaluate. Defaults to all.
+- `--grep`: A string or regex to filter scenarios by.
diff --git a/...ons/observability/plugins/observability_ai_assistant_app/scripts/evaluation/evaluation.ts b/...ons/observability/plugins/observability_ai_assistant_app/scripts/evaluation/evaluation.ts
@@ -37,6 +37,8 @@ function runEvaluations() {
             kibana: argv.kibana,
           });
 
+          log.info(`Elasticsearch URL: ${serviceUrls.esUrl}`);
+
           const kibanaClient = new KibanaClient(log, serviceUrls.kibanaUrl, argv.spaceId);
           const esClient = new Client({
             node: serviceUrls.esUrl,
@@ -100,7 +102,7 @@ function runEvaluations() {
             evaluationConnectorId: evaluationConnector.id!,
             persist: argv.persist,
             suite: mocha.suite,
-            scopes: ['all'],
+            scopes: ['observability'],
           });
 
           const header: string[][] = [

diff --git a/.../observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts b/.../observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts
@@ -26,7 +26,7 @@ import { Message, MessageRole } from '@kbn/observability-ai-assistant-plugin/com
 import { streamIntoObservable } from '@kbn/observability-ai-assistant-plugin/server';
 import { ToolingLog } from '@kbn/tooling-log';
 import axios, { AxiosInstance, AxiosResponse, isAxiosError } from 'axios';
-import { isArray, omit, pick, remove } from 'lodash';
+import { omit, pick, remove } from 'lodash';
 import pRetry from 'p-retry';
 import {
   concatMap,
@@ -59,13 +59,14 @@ interface Options {
   screenContexts?: ObservabilityAIAssistantScreenContext[];
 }
 
-type CompleteFunction = (
-  ...args:
-    | [StringOrMessageList]
-    | [StringOrMessageList, Options]
-    | [string | undefined, StringOrMessageList]
-    | [string | undefined, StringOrMessageList, Options]
-) => Promise<{
+interface CompleteFunctionParams {
+  messages: StringOrMessageList;
+  conversationId?: string;
+  options?: Options;
+  scope?: AssistantScope;
+}
+
+type CompleteFunction = (params: CompleteFunctionParams) => Promise<{
   conversationId?: string;
   messages: InnerMessage[];
   errors: ChatCompletionErrorEvent[];
@@ -74,7 +75,6 @@ type CompleteFunction = (
 export interface ChatClient {
   chat: (message: StringOrMessageList) => Promise<InnerMessage>;
   complete: CompleteFunction;
-
   evaluate: (
     {}: { conversationId?: string; messages: InnerMessage[]; errors: ChatCompletionErrorEvent[] },
     criteria: string[]
@@ -124,10 +124,10 @@ export class KibanaClient {
     return this.axios<T>({
       method,
       url,
-      data: data || {},
+      ...(method.toLowerCase() === 'delete' && !data ? {} : { data: data || {} }),
       headers: {
         'kbn-xsrf': 'true',
-        'x-elastic-internal-origin': 'foo',
+        'x-elastic-internal-origin': 'Kibana',
       },
     }).catch((error) => {
       if (isAxiosError(error)) {
@@ -148,7 +148,7 @@ export class KibanaClient {
   }
 
   async installKnowledgeBase() {
-    this.log.debug('Checking to see whether knowledge base is installed');
+    this.log.info('Checking whether the knowledge base is installed');
 
     const {
       data: { ready },
@@ -157,7 +157,7 @@ export class KibanaClient {
     });
 
     if (ready) {
-      this.log.info('Knowledge base is installed');
+      this.log.success('Knowledge base is already installed');
       return;
     }
 
@@ -176,15 +176,15 @@ export class KibanaClient {
       { retries: 10 }
     );
 
-    this.log.info('Knowledge base installed');
+    this.log.success('Knowledge base installed');
   }
 
   async createSpaceIfNeeded() {
     if (!this.spaceId) {
       return;
     }
 
-    this.log.debug(`Checking if space ${this.spaceId} exists`);
+    this.log.info(`Checking if space ${this.spaceId} exists`);
 
     const spaceExistsResponse = await this.callKibana<{
       id?: string;
@@ -204,7 +204,7 @@ export class KibanaClient {
     });
 
     if (spaceExistsResponse.data.id) {
-      this.log.debug(`Space id ${this.spaceId} found`);
+      this.log.success(`Space id ${this.spaceId} found`);
       return;
     }
 
@@ -223,14 +223,26 @@ export class KibanaClient {
     );
 
     if (spaceCreatedResponse.status === 200) {
-      this.log.info(`Created space ${this.spaceId}`);
+      this.log.success(`Created space ${this.spaceId}`);
     } else {
       throw new Error(
         `Error creating space: ${spaceCreatedResponse.status} - ${spaceCreatedResponse.data}`
       );
     }
   }
 
+  getMessages(message: string | Array<Message['message']>): Array<Message['message']> {
+    if (typeof message === 'string') {
+      return [
+        {
+          content: message,
+          role: MessageRole.User,
+        },
+      ];
+    }
+    return message;
+  }
+
   createChatClient({
     connectorId,
     evaluationConnectorId,
@@ -244,22 +256,11 @@ export class KibanaClient {
     suite?: Mocha.Suite;
     scopes: AssistantScope[];
   }): ChatClient {
-    function getMessages(message: string | Array<Message['message']>): Array<Message['message']> {
-      if (typeof message === 'string') {
-        return [
-          {
-            content: message,
-            role: MessageRole.User,
-          },
-        ];
-      }
-      return message;
-    }
-
     const that = this;
 
     let currentTitle: string = '';
     let firstSuiteName: string = '';
+    let currentScopes = scopes;
 
     if (suite) {
       suite.beforeEach(function () {
@@ -362,23 +363,27 @@ export class KibanaClient {
       that.log.info('Chat', name);
 
       const chat$ = defer(() => {
-        that.log.debug(`Calling chat API`);
+        that.log.info('Calling the /chat API');
         const params: ObservabilityAIAssistantAPIClientRequestParamsOf<'POST /internal/observability_ai_assistant/chat'>['params']['body'] =
           {
             name,
             messages,
             connectorId: connectorIdOverride || connectorId,
             functions: functions.map((fn) => pick(fn, 'name', 'description', 'parameters')),
             functionCall,
-            scopes,
+            scopes: currentScopes,
           };
 
         return that.axios.post(
           that.getUrl({
             pathname: '/internal/observability_ai_assistant/chat',
           }),
           params,
-          { responseType: 'stream', timeout: NaN }
+          {
+            responseType: 'stream',
+            timeout: NaN,
+            headers: { 'x-elastic-internal-origin': 'Kibana' },
+          }
         );
       }).pipe(
         switchMap((response) => streamIntoObservable(response.data)),
@@ -400,54 +405,33 @@ export class KibanaClient {
     return {
       chat: async (message) => {
         const messages = [
-          ...getMessages(message).map((msg) => ({
+          ...this.getMessages(message).map((msg) => ({
             message: msg,
             '@timestamp': new Date().toISOString(),
           })),
         ];
         return chat('chat', { messages, functions: [] });
       },
-      complete: async (...args) => {
-        that.log.info(`Complete`);
-        let messagesArg: StringOrMessageList | undefined;
-        let conversationId: string | undefined;
-        let options: Options = {};
-
-        function isMessageList(arg: any): arg is StringOrMessageList {
-          return isArray(arg) || typeof arg === 'string';
-        }
+      complete: async ({
+        messages: messagesArg,
+        conversationId,
+        options = {},
+        scope: newScope,
+      }: CompleteFunctionParams) => {
+        that.log.info('Calling complete');
 
-        // | [StringOrMessageList]
-        // | [StringOrMessageList, Options]
-        // | [string, StringOrMessageList]
-        // | [string, StringOrMessageList, Options]
-        if (args.length === 1) {
-          messagesArg = args[0];
-        } else if (args.length === 2 && !isMessageList(args[1])) {
-          messagesArg = args[0];
-          options = args[1];
-        } else if (
-          args.length === 2 &&
-          (typeof args[0] === 'string' || typeof args[0] === 'undefined') &&
-          isMessageList(args[1])
-        ) {
-          conversationId = args[0];
-          messagesArg = args[1];
-        } else if (args.length === 3) {
-          conversationId = args[0];
-          messagesArg = args[1];
-          options = args[2];
-        }
+        // set scope
+        currentScopes = [newScope || 'observability'];
 
         const messages = [
-          ...getMessages(messagesArg!).map((msg) => ({
+          ...this.getMessages(messagesArg!).map((msg) => ({
             message: msg,
             '@timestamp': new Date().toISOString(),
           })),
         ];
 
         const stream$ = defer(() => {
-          that.log.debug(`Calling /chat/complete API`);
+          that.log.info(`Calling /chat/complete API`);
           return from(
             that.axios.post(
               that.getUrl({
@@ -460,9 +444,13 @@ export class KibanaClient {
                 connectorId,
                 persist,
                 title: currentTitle,
-                scopes,
+                scopes: currentScopes,
               },
-              { responseType: 'stream', timeout: NaN }
+              {
+                responseType: 'stream',
+                timeout: NaN,
+                headers: { 'x-elastic-internal-origin': 'Kibana' },
+              }
             )
           );
         }).pipe(
@@ -615,7 +603,7 @@ export class KibanaClient {
           })
           .concat({
             score: errors.length === 0 ? 1 : 0,
-            criterion: 'The conversation encountered errors',
+            criterion: 'The conversation did not encounter any errors',
             reasoning: errors.length
               ? `The following errors occurred: ${errors.map((error) => error.error.message)}`
               : 'No errors occurred',