-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Obs AI Assistant] Evaluation framework #173010
Merged
dgieselaar
merged 18 commits into
elastic:main
from
dgieselaar:obs-ai-assistant-evaluation-framework
Dec 13, 2023
Merged
Changes from 14 commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
597a22f
[Obs AI Assistant] Abort controller when component unmounts
dgieselaar 90aad9b
[Obs AI Assistant] Evaluation framework
dgieselaar 1a18f39
Merge branch 'main' of github.com:elastic/kibana into obs-ai-assistan…
dgieselaar 6c9a3fc
README.md
dgieselaar 2c6ad29
Add --grep option
dgieselaar 5fbd4a3
Fix types
dgieselaar d908e35
[CI] Auto-commit changed files from 'node scripts/lint_ts_projects --…
kibanamachine 76d8734
Add --spaceId option
dgieselaar c6859ca
Merge branch 'obs-ai-assistant-evaluation-framework' of github.com:dg…
dgieselaar 97fe1ac
Replace glob with fast-glob to prevent type errors
dgieselaar a1edd53
[CI] Auto-commit changed files from 'node scripts/eslint --no-cache -…
kibanamachine decc8b7
Newlines
dgieselaar afa5235
Merge branch 'obs-ai-assistant-evaluation-framework' of github.com:dg…
dgieselaar beaf320
Mock logger.debug() in tests
dgieselaar 385da0a
Lockfile changes
dgieselaar 159704e
Merge branch 'main' into obs-ai-assistant-evaluation-framework
kibanamachine c32b8be
linting errors
dgieselaar 047a404
Merge branch 'obs-ai-assistant-evaluation-framework' of github.com:dg…
dgieselaar File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
37 changes: 37 additions & 0 deletions
37
x-pack/plugins/observability_ai_assistant/scripts/evaluation/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# Observability AI Assistant Evaluation Framework | ||
|
||
## Overview | ||
|
||
This tool is developed for our team working on the Elastic Observability platform, specifically focusing on evaluating the Observability AI Assistant. It simplifies scripting and evaluating various scenarios with the Large Language Model (LLM) integration. | ||
|
||
## Setup requirements | ||
|
||
- An Elasticsearch instance | ||
- A Kibana instance | ||
- At least one .gen-ai connector set up | ||
|
||
## Running evaluations | ||
|
||
Run the tool using: | ||
|
||
`$ node x-pack/plugins/observability_ai_assistant/scripts/evaluation/index.js` | ||
|
||
This will evaluate all existing scenarios, and write the evaluation results to the terminal. | ||
|
||
### Configuration | ||
|
||
#### Kibana and Elasticsearch | ||
|
||
By default, the tool will look for a Kibana instance running locally (at `http://localhost:5601`, which is the default address for running Kibana in development mode). It will also attempt to read the Kibana config file for the Elasticsearch address & credentials. If you want to override these settings, use `--kibana` and `--es`. Only basic auth is supported, e.g. `--kibana http://username:password@localhost:5601`. If you want to use a specific space, use `--spaceId` | ||
|
||
#### Connector | ||
|
||
Use `--connectorId` to specify a `.gen-ai` connector to use. If none are given, it will prompt you to select a connector based on the ones that are available. If only a single `.gen-ai` connector is found, it will be used without prompting. | ||
|
||
#### Persisting conversations | ||
|
||
By default, completed conversations are not persisted. If you do want to persist them, for instance for reviewing purposes, set the `--persist` flag to store them. This will also generate a clickable link in the output of the evaluation that takes you to the conversation. | ||
|
||
If you want to clear conversations on startup, use the `--clear` flag. This only works when `--persist` is enabled. If `--spaceId` is set, only conversations for the current space will be cleared. | ||
|
||
When storing conversations, the name of the scenario is used as a title. Set the `--autoTitle` flag to have the LLM generate a title for you. |
78 changes: 78 additions & 0 deletions
78
x-pack/plugins/observability_ai_assistant/scripts/evaluation/cli.ts
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
/* | ||
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one | ||
* or more contributor license agreements. Licensed under the Elastic License | ||
* 2.0; you may not use this file except in compliance with the Elastic License | ||
* 2.0. | ||
*/ | ||
import { format, parse } from 'url'; | ||
import { Argv } from 'yargs'; | ||
import { readKibanaConfig } from './read_kibana_config'; | ||
|
||
export function options(y: Argv) { | ||
const config = readKibanaConfig(); | ||
|
||
return y | ||
.option('files', { | ||
string: true as const, | ||
array: true, | ||
describe: 'A file or list of files containing the scenarios to evaluate. Defaults to all', | ||
}) | ||
.option('grep', { | ||
string: true, | ||
array: false, | ||
describe: 'A string or regex to filter scenarios by', | ||
}) | ||
.option('kibana', { | ||
describe: 'Where Kibana is running', | ||
string: true, | ||
default: process.env.KIBANA_HOST || 'http://localhost:5601', | ||
}) | ||
.option('spaceId', { | ||
describe: | ||
'The space to use. If space is set, conversations will only be cleared for that spaceId', | ||
string: true, | ||
array: false, | ||
}) | ||
.option('elasticsearch', { | ||
alias: 'es', | ||
describe: 'Where Elasticsearch is running', | ||
string: true, | ||
default: format({ | ||
...parse(config['elasticsearch.hosts']), | ||
auth: `${config['elasticsearch.username']}:${config['elasticsearch.password']}`, | ||
}), | ||
}) | ||
.option('connectorId', { | ||
describe: 'The ID of the connector', | ||
string: true, | ||
}) | ||
.option('persist', { | ||
describe: | ||
'Whether the conversations should be stored. Adding this will generate a link at which the conversation can be opened.', | ||
boolean: true, | ||
default: false, | ||
}) | ||
.option('clear', { | ||
describe: 'Clear conversations on startup', | ||
boolean: true, | ||
default: false, | ||
}) | ||
.option('autoTitle', { | ||
describe: 'Whether to generate titles for new conversations', | ||
boolean: true, | ||
default: false, | ||
}) | ||
.option('logLevel', { | ||
describe: 'Log level', | ||
default: 'info', | ||
}) | ||
.check((argv) => { | ||
if (!argv.persist && argv.clear) { | ||
throw new Error('clear cannot be true if persist is false'); | ||
} | ||
if (!argv.persist && argv.autoTitle) { | ||
throw new Error('autoTitle cannot be true if persist is false'); | ||
} | ||
return true; | ||
}); | ||
} |
202 changes: 202 additions & 0 deletions
202
x-pack/plugins/observability_ai_assistant/scripts/evaluation/evaluation.ts
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,202 @@ | ||
/* | ||
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one | ||
* or more contributor license agreements. Licensed under the Elastic License | ||
* 2.0; you may not use this file except in compliance with the Elastic License | ||
* 2.0. | ||
*/ | ||
|
||
import yargs from 'yargs'; | ||
import { run } from '@kbn/dev-cli-runner'; | ||
import { Client } from '@elastic/elasticsearch'; | ||
import inquirer from 'inquirer'; | ||
import * as fastGlob from 'fast-glob'; | ||
import Path from 'path'; | ||
import chalk from 'chalk'; | ||
import * as table from 'table'; | ||
import { castArray, omit, sortBy } from 'lodash'; | ||
import { TableUserConfig } from 'table'; | ||
import { format, parse } from 'url'; | ||
import { options } from './cli'; | ||
import { getServiceUrls } from './get_service_urls'; | ||
import { KibanaClient } from './kibana_client'; | ||
import { EvaluationFunction } from './types'; | ||
import { MessageRole } from '../../common'; | ||
|
||
function runEvaluations() { | ||
yargs(process.argv.slice(2)) | ||
.command('*', 'Run AI Assistant evaluations', options, (argv) => { | ||
run( | ||
async ({ log }) => { | ||
const serviceUrls = await getServiceUrls({ | ||
log, | ||
elasticsearch: argv.elasticsearch, | ||
kibana: argv.kibana, | ||
}); | ||
|
||
const kibanaClient = new KibanaClient(serviceUrls.kibanaUrl, argv.spaceId); | ||
const esClient = new Client({ | ||
node: serviceUrls.esUrl, | ||
}); | ||
|
||
const connectors = await kibanaClient.getConnectors(); | ||
|
||
if (!connectors.length) { | ||
throw new Error('No connectors found'); | ||
} | ||
|
||
let connector = connectors.find((item) => item.id === argv.connectorId); | ||
|
||
if (!connector && argv.connectorId) { | ||
log.warning(`Could not find connector ${argv.connectorId}`); | ||
} | ||
|
||
if (!connector && connectors.length === 1) { | ||
connector = connectors[0]; | ||
log.debug('Using the only connector found'); | ||
} else { | ||
const connectorChoice = await inquirer.prompt({ | ||
type: 'list', | ||
name: 'connector', | ||
message: 'Select a connector', | ||
choices: connectors.map((item) => item.name), | ||
}); | ||
|
||
connector = connectors.find((item) => item.name === connectorChoice.connector)!; | ||
} | ||
|
||
log.info(`Using connector ${connector.id}`); | ||
|
||
const scenarios = | ||
(argv.files !== undefined && | ||
castArray(argv.files).map((file) => Path.join(process.cwd(), file))) || | ||
fastGlob.sync(Path.join(__dirname, './scenarios/**/*.ts')); | ||
|
||
if (!scenarios.length) { | ||
throw new Error('No scenarios to run'); | ||
} | ||
|
||
if (argv.clear) { | ||
log.info('Clearing conversations'); | ||
await esClient.deleteByQuery({ | ||
index: '.kibana-observability-ai-assistant-conversations', | ||
query: { | ||
...(argv.spaceId ? { term: { namespace: argv.spaceId } } : { match_all: {} }), | ||
}, | ||
refresh: true, | ||
}); | ||
} | ||
|
||
let evaluationFunctions: Array<{ | ||
name: string; | ||
fileName: string; | ||
fn: EvaluationFunction; | ||
}> = []; | ||
|
||
for (const fileName of scenarios) { | ||
log.info(`Running scenario ${fileName}`); | ||
const mod = await import(fileName); | ||
Object.keys(mod).forEach((key) => { | ||
evaluationFunctions.push({ name: key, fileName, fn: mod[key] }); | ||
}); | ||
} | ||
|
||
if (argv.grep) { | ||
const lc = argv.grep.toLowerCase(); | ||
evaluationFunctions = evaluationFunctions.filter((fn) => | ||
fn.name.toLowerCase().includes(lc) | ||
); | ||
} | ||
|
||
const header: string[][] = [ | ||
[chalk.bold('Criterion'), chalk.bold('Result'), chalk.bold('Reasoning')], | ||
]; | ||
|
||
const tableConfig: TableUserConfig = { | ||
singleLine: false, | ||
border: { | ||
topBody: `─`, | ||
topJoin: `┬`, | ||
topLeft: `┌`, | ||
topRight: `┐`, | ||
|
||
bottomBody: `─`, | ||
bottomJoin: `┴`, | ||
bottomLeft: `└`, | ||
bottomRight: `┘`, | ||
|
||
bodyLeft: `│`, | ||
bodyRight: `│`, | ||
bodyJoin: `│`, | ||
|
||
joinBody: `─`, | ||
joinLeft: `├`, | ||
joinRight: `┤`, | ||
joinJoin: `┼`, | ||
}, | ||
spanningCells: [ | ||
{ row: 0, col: 0, colSpan: 3 }, | ||
{ row: 1, col: 0, colSpan: 3 }, | ||
], | ||
columns: [ | ||
{ wrapWord: true, width: 60 }, | ||
{ wrapWord: true }, | ||
{ wrapWord: true, width: 60 }, | ||
], | ||
}; | ||
|
||
const sortedEvaluationFunctions = sortBy(evaluationFunctions, 'fileName', 'name'); | ||
|
||
for (const { name, fn } of sortedEvaluationFunctions) { | ||
log.debug(`Executing ${name}`); | ||
const result = await fn({ | ||
esClient, | ||
kibanaClient, | ||
chatClient: kibanaClient.createChatClient({ | ||
connectorId: connector.id!, | ||
persist: argv.persist, | ||
title: argv.autoTitle ? undefined : name, | ||
}), | ||
}); | ||
log.debug(`Result:`, JSON.stringify(result)); | ||
const output: string[][] = [ | ||
[ | ||
result.messages.find((message) => message.role === MessageRole.User)!.content!, | ||
'', | ||
'', | ||
], | ||
result.conversationId | ||
? [ | ||
`${format(omit(parse(serviceUrls.kibanaUrl), 'auth'))}/${ | ||
argv.spaceId ? `s/${argv.spaceId}/` : '' | ||
}app/observabilityAIAssistant/conversations/${result.conversationId}`, | ||
'', | ||
'', | ||
] | ||
: ['', '', ''], | ||
...header, | ||
]; | ||
|
||
result.scores.forEach((score) => { | ||
output.push([ | ||
score.criterion, | ||
score.score === 0 ? chalk.redBright('failed') : chalk.greenBright('passed'), | ||
score.reasoning, | ||
]); | ||
}); | ||
log.write(table.table(output, tableConfig)); | ||
} | ||
}, | ||
{ | ||
log: { | ||
defaultLevel: argv.logLevel as any, | ||
}, | ||
flags: { | ||
allowUnexpected: true, | ||
}, | ||
} | ||
); | ||
}) | ||
.parse(); | ||
} | ||
|
||
runEvaluations(); |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little bit hesitant to sort the test within the file, why not keep them in their order of declaration?
I haven't seen any other test framework sort test cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use exports so no order is guaranteed. Ideally we have something similar to describe and it where the order of the statements decides order of execution, but need to figure out how we can use something like Mocha for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, then it makes a lot of sense to sort them until then