Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Inference] Inference plugin + chatComplete API (elastic#188280)
This PR introduces an Inference plugin. ## Goals - Provide a single place for all interactions with large language models and other generative AI adjacent tasks. - Abstract away differences between different LLM providers like OpenAI, Bedrock and Gemini - Host commonly used LLM-based tasks like generating ES|QL from natural language and knowledge base recall. - Allow us to move gradually to the _inference endpoint without disrupting engineers. ## Architecture and examples ![CleanShot 2024-07-14 at 14 45 27@2x](https://github.com/user-attachments/assets/e65a3e47-bce1-4dcf-bbed-4f8ac12a104f) ## Terminology The following concepts are referenced throughout this POC: - **chat completion**: the process in which the LLM generates the next message in the conversation. This is sometimes referred to as inference, text completion, text generation or content generation. - **tasks**: higher level tasks that, based on its input, use the LLM in conjunction with other services like Elasticsearch to achieve a result. The example in this POC is natural language to ES|QL. - **tools**: a set of tools that the LLM can choose to use when generating the next message. In essence, it allows the consumer of the API to define a schema for structured output instead of plain text, and having the LLM select the most appropriate one. - **tool call**: when the LLM has chosen a tool (schema) to use for its output, and returns a document that matches the schema, this is referred to as a tool call. ## Usage examples ```ts class MyPlugin { setup(coreSetup, pluginsSetup) { const router = coreSetup.http.createRouter(); router.post( { path: '/internal/my_plugin/do_something', validate: { body: schema.object({ connectorId: schema.string(), }), }, }, async (context, request, response) => { const [coreStart, pluginsStart] = await coreSetup.getStartServices(); const inferenceClient = pluginsSetup.inference.getClient({ request }); const chatComplete$ = inferenceClient.chatComplete({ connectorId: request.body.connectorId, system: `Here is my system message`, messages: [ { role: MessageRole.User, content: 'Do something', }, ], }); const message = await lastValueFrom( chatComplete$.pipe(withoutTokenCountEvents(), withoutChunkEvents()) ); return response.ok({ body: { message, }, }); } ); } } ``` ## Implementation The bulk of the work here is implementing a `chatComplete` API. Here's what it does: - Formats the request for the specific LLM that is being called (all have different API specifications). - Executes the specified connector with the formatted request. - Creates and returns an Observable, and starts reading from the stream. - Every event in the stream is normalized to a format that is close to (but not exactly the same) as OpenAI's format, and emitted as a value from the Observable. - When the stream ends, the individual events (chunks) are concatenated into a single message. - If the LLM has called any tools, the tool call is validated according to its schema. - After emitting the message, the Observable completes There's also a thin wrapper around this API, which is called the `output` API. It simplifies a few things: - It doesn't require a conversation (list of messages), a simple `input` string suffices. - You can define a schema for the output of the LLM. - It drops the token count events that are emitted - It simplifies the event format (update & complete) ### Observable event streams These APIs, both on the client and the server, return Observables that emit events. When converting the Observable into a stream, the following things happen: - Errors are caught and serialized as events sent over the stream (after an error, the stream ends). - The response stream outputs data as [server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events) - The client that reads the stream, parses the event source as an Observable, and if it encounters a serialized error, it deserializes it and throws an error in the Observable. ### Errors All known errors are instances, and not extensions, from the `InferenceTaskError` base class, which has a `code`, a `message`, and `meta` information about the error. This allows us to serialize and deserialize errors over the wire without a complicated factory pattern. ### Tools Tools are defined as a record, with a `description` and optionally a `schema`. The reason why it's a record is because of type-safety. This allows us to have fully typed tool calls (e.g. when the name of the tool being called is `x`, its arguments are typed as the schema of `x`). ## Notes for reviewers - I've only added one reference implementation for a connector adapter, which is OpenAI. Adding more would create noise in the PR, but I can add them as well. Bedrock would need simulated function calling, which I would also expect to be handled by this plugin. - Similarly, the natural language to ES|QL task just creates dummy steps, as moving the entire implementation would mean 1000s of additional LOC due to it needing the documentation, for instance. - Observables over promises/iterators: Observables are a well-defined and widely-adopted solution for async programming. Promises are not suitable for streamed/chunked responses because there are no intermediate values. Async iterators are not widely adopted for Kibana engineers. - JSON Schema over Zod: I've tried using Zod, because I like its ergonomics over plain JSON Schema, but we need to convert it to JSON Schema at some point, which is a lossy conversion, creating a risk of using features that we cannot convert to JSON Schema. Additionally, tools for converting Zod to and [from JSON Schema are not always suitable ](https://github.com/StefanTerdell/json-schema-to-zod#use-at-runtime). I've implemented my own JSON Schema to type definition, as [json-schema-to-ts](https://github.com/ThomasAribart/json-schema-to-ts) is very slow. - There's no option for raw input or output. There could be, but it would defeat the purpose of the normalization that the `chatComplete` API handles. At that point it might be better to use the connector directly. - That also means that for LangChain, something would be needed to convert the Observable into an async iterator that returns OpenAI-compatible output. This is doable, although it would be nice if we could just use the output from the OpenAI API in that case. - I have not made room for any vendor-specific parameters in the `chatComplete` API. We might need it, but hopefully not. - I think type safety is critical here, so there is some TypeScript voodoo in some places to make that happen. - `system` is not a message in the conversation, but a separate property. Given the semantics of a system message (there can only be one, and only at the beginning of the conversation), I think it's easier to make it a top-level property than a message type. --------- Co-authored-by: kibanamachine <[email protected]> Co-authored-by: Elastic Machine <[email protected]>
- Loading branch information