diff --git a/app/_data/docs_nav_gateway_3.8.x.yml b/app/_data/docs_nav_gateway_3.8.x.yml index 4a236fd54a80..230e12edca3d 100644 --- a/app/_data/docs_nav_gateway_3.8.x.yml +++ b/app/_data/docs_nav_gateway_3.8.x.yml @@ -408,6 +408,10 @@ items: - text: Expose and graph AI Metrics url: /ai-gateway/metrics/ generate: false + - text: AI Gateway Load Balancing + url: /hub/kong-inc/ai-proxy-advanced/#load-balancing + generate: false + absolute_url: true - text: AI Gateway plugins url: /hub/?category=ai generate: false diff --git a/app/_data/docs_nav_gateway_3.9.x.yml b/app/_data/docs_nav_gateway_3.9.x.yml index 083e938d26d1..5a1721fabfad 100644 --- a/app/_data/docs_nav_gateway_3.9.x.yml +++ b/app/_data/docs_nav_gateway_3.9.x.yml @@ -392,12 +392,26 @@ items: url: /hub/kong-inc/ai-proxy/how-to/llm-provider-integration-guides/llama2/ generate: false absolute_url: true + - text: AI Platform Integration Guides + items: + - text: Gemini + url: /hub/kong-inc/ai-proxy/how-to/machine-learning-platform-integration-guides/gemini/ + generate: false + absolute_url: true + - text: Amazon Bedrock + url: /hub/kong-inc/ai-proxy/how-to/machine-learning-platform-integration-guides/bedrock/ + generate: false + absolute_url: true - text: AI Gateway Analytics url: /ai-gateway/ai-analytics/ generate: false - text: Expose and graph AI Metrics url: /ai-gateway/metrics/ generate: false + - text: AI Gateway Load Balancing + url: /hub/kong-inc/ai-proxy-advanced/#load-balancing + generate: false + absolute_url: true - text: AI Gateway plugins url: /hub/?category=ai generate: false diff --git a/app/_hub/kong-inc/ai-proxy-advanced/how-to/_load-balancing.md b/app/_hub/kong-inc/ai-proxy-advanced/how-to/_load-balancing.md new file mode 100644 index 000000000000..0ebb30ab9e36 --- /dev/null +++ b/app/_hub/kong-inc/ai-proxy-advanced/how-to/_load-balancing.md @@ -0,0 +1,282 @@ +--- +title: Load Balancing +nav_title: Configure Load Balancing with AI Proxy Advanced +minimum_version: 3.8.x +--- + +The AI Proxy Advanced plugin offers different load-balancing algorithms to define how to distribute requests to different AI models. This guide provides a configuration example for each algorithm. + +## Semantic routing + +Semantic routing enables distribution of requests based on the similarity between the prompt and the description of each model. This allows Kong to automatically select the model that is best suited for the given domain or use case. + +To set up load balancing with the AI Proxy Advanced plugin, you need to configure the following parameters: +* [`config.embeddings`](/hub/kong-inc/ai-proxy-advanced/configuration/#config-embeddings) to define the model to use to match the model description and the prompts. +* [`config.vectordb`](/hub/kong-inc/ai-proxy-advanced/configuration/#config-vectordb) to define the vector database parameters. Only Redis is supported, so you need a Redis instance running in your environment. +* [`config.targets[].description`](/hub/kong-inc/ai-proxy-advanced/configuration/#config-targets-description) to define the description to be matched with the prompts. + +For example, the following configuration uses two OpenAI models: one for questions related to Kong, and another for questions related to Microsoft. + +```yaml +_format_version: "3.0" +services: +- name: openai-chat-service + url: https://httpbin.konghq.com/ + routes: + - name: openai-chat-route + paths: + - /chat +plugins: +- name: ai-proxy-advanced + config: + embeddings: + auth: + header_name: Authorization + header_value: Bearer + model: + name: text-embedding-3-small + provider: openai + vectordb: + dimensions: 1024 + distance_metric: cosine + strategy: redis + threshold: 0.7 + redis: + host: redis-stack-server + port: 6379 + balancer: + algorithm: semantic + targets: + - model: + name: gpt-4 + provider: openai + options: + max_tokens: 512 + temperature: 1.0 + route_type: llm/v1/chat + auth: + header_name: Authorization + header_value: Bearer + description: "What is Kong?" + - model: + name: gpt-4o-mini + provider: openai + options: + max_tokens: 512 + temperature: 1.0 + route_type: llm/v1/chat + auth: + header_name: Authorization + header_value: Bearer + description: "What is Microsoft?" +``` + +You can validate this configuration by sending requests and checking the `X-Kong-LLM-Model` response header to see which model was used. + +In the response to the following request, the `X-Kong-LLM-Model` header value is `openai/gpt-4`. + +```bash +curl --request POST \ + --url http://localhost:8000/chat \ + --header 'Content-Type: application/json' \ + --header 'User-Agent: insomnia/10.0.0' \ + --data '{ + "messages": [ + { + "role": "system", + "content": "You are an IT specialist" + }, + { + "role": "user", + "content": "Who founded Kong?" + } + ] +}' +``` + +## Weighted round-robin + +The round-robin algorithm distributes requests to the different models on a rotation. By default, all models have the same weight and receive the same percentage of requests. However, this can be configured with the [`config.targets[].weight`](/hub/kong-inc/ai-proxy-advanced/configuration/#config-targets-weight) parameter. + +If you have three models and want to assign 70% of requests to the first one, 25% of requests to the second one, and 5% of requests to the third one, you can use the following configuration: + +```yaml +_format_version: "3.0" +services: +- name: openai-chat-service + url: https://httpbin.konghq.com/ + routes: + - name: openai-chat-route + paths: + - /chat +plugins: +- name: ai-proxy-advanced + config: + balancer: + algorithm: round-robin + targets: + - model: + name: gpt-4 + provider: openai + options: + max_tokens: 512 + temperature: 1.0 + route_type: llm/v1/chat + auth: + header_name: Authorization + header_value: Bearer + weight: 70 + - model: + name: gpt-4o-mini + provider: openai + options: + max_tokens: 512 + temperature: 1.0 + route_type: llm/v1/chat + auth: + header_name: Authorization + header_value: Bearer + weight: 25 + - model: + name: gpt-3 + provider: openai + options: + max_tokens: 512 + temperature: 1.0 + route_type: llm/v1/chat + auth: + header_name: Authorization + header_value: Bearer + weight: 5 +``` + +## Consistent-hashing + +The consistent-hashing algorithm uses a request header to consistently route requests to the same AI model based on the header value. By default, the header is `X-Kong-LLM-Request-ID`, but it can be customized with the [`config.balancer.hash_on_header`](/hub/kong-inc/ai-proxy-advanced/configuration/#config-balancer-hash_on_header) parameter. + +For example: +```yaml +_format_version: "3.0" +services: +- name: openai-chat-service + url: https://httpbin.konghq.com/ + routes: + - name: openai-chat-route + paths: + - /chat +plugins: +- name: ai-proxy-advanced + config: + balancer: + algorithm: consistent-hashing + hash_on_header: X-Hashing-Header + targets: + - model: + name: gpt-4 + provider: openai + options: + max_tokens: 512 + temperature: 1.0 + route_type: llm/v1/chat + auth: + header_name: Authorization + header_value: Bearer + - model: + name: gpt-4o-mini + provider: openai + options: + max_tokens: 512 + temperature: 1.0 + route_type: llm/v1/chat + auth: + header_name: Authorization + header_value: Bearer +``` + +## Lowest-latency + +The lowest-latency algorithm distributes requests to the model with the lowest response time. By default, the latency is calculated based on the time the model takes to generate each token (`tpot`). You can change the value of the [`config.balancer.latency_strategy`](/hub/kong-inc/ai-proxy-advanced/configuration/#config-balancer-latency_strategy) to `e2e` to use the end-to-end response time. + +For example: +```yaml +_format_version: "3.0" +services: +- name: openai-chat-service + url: https://httpbin.konghq.com/ + routes: + - name: openai-chat-route + paths: + - /chat +plugins: +- name: ai-proxy-advanced + config: + balancer: + algorithm: lowest-latency + latency_strategy: e2e + targets: + - model: + name: gpt-4 + provider: openai + options: + max_tokens: 512 + temperature: 1.0 + route_type: llm/v1/chat + auth: + header_name: Authorization + header_value: Bearer + - model: + name: gpt-4o-mini + provider: openai + options: + max_tokens: 512 + temperature: 1.0 + route_type: llm/v1/chat + auth: + header_name: Authorization + header_value: Bearer +``` + +## Lowest-usage + +The lowest-usage algorithm distributes requests to the model with the lowest usage volume. By default, the usage is calculated based on the total number of tokens in the prompt and in the response. However, you can customize this using the [`config.balancer.tokens_count_strategy`](/hub/kong-inc/ai-proxy-advanced/configuration/#config-balancer-tokens_count_strategy) parameter. You can use: +* `prompt-tokens` to count only the tokens in the prompt +* `completion-tokens` to count only the tokens in the response + +For example: +```yaml +_format_version: "3.0" +services: +- name: openai-chat-service + url: https://httpbin.konghq.com/ + routes: + - name: openai-chat-route + paths: + - /chat +plugins: +- name: ai-proxy-advanced + config: + balancer: + algorithm: lowest-usage + tokens_count_strategy: prompt-tokens + targets: + - model: + name: gpt-4 + provider: openai + options: + max_tokens: 512 + temperature: 1.0 + route_type: llm/v1/chat + auth: + header_name: Authorization + header_value: Bearer + - model: + name: gpt-4o-mini + provider: openai + options: + max_tokens: 512 + temperature: 1.0 + route_type: llm/v1/chat + auth: + header_name: Authorization + header_value: Bearer +``` \ No newline at end of file diff --git a/app/_hub/kong-inc/ai-proxy-advanced/overview/_index.md b/app/_hub/kong-inc/ai-proxy-advanced/overview/_index.md index 340d9898e72b..e477e67d5439 100644 --- a/app/_hub/kong-inc/ai-proxy-advanced/overview/_index.md +++ b/app/_hub/kong-inc/ai-proxy-advanced/overview/_index.md @@ -5,7 +5,7 @@ nav_title: Overview The AI Proxy Advanced plugin lets you transform and proxy requests to multiple AI providers and models at the same time. This lets you set up load balancing between targets. -The plugin accepts requests in one of a few defined and standardised formats, translates them to the configured target format, and then transforms the response back into a standard format. +The plugin accepts requests in one of a few defined and standardized formats, translates them to the configured target format, and then transforms the response back into a standard format. The following table describes which providers and requests the AI Proxy Advanced plugin supports: @@ -47,16 +47,23 @@ This plugin currently only supports REST-based full text responses. This plugin supports several load-balancing algorithms, similar to those used for Kong upstreams, allowing efficient distribution of requests across different AI models. The supported algorithms include: * **Lowest-usage**: The lowest-usage algorithm in AI Proxy Advanced is based on the volume of usage for each model. It balances the load by distributing requests to models with the lowest usage, measured by factors such as prompt token counts, response token counts, or other resource metrics. +* **Lowest-latency**: The lowest-latency algorithm is based on the response time for each model. It distributes requests to models with the lowest response time. +* **Semantic**: The semantic algorithm distributes requests to different models based on the similarity between the prompt in the request and the description provided in the model configuration. This allows Kong to automatically select the model that is best suited for the given domain or use case. This feature enhances the flexibility and efficiency of model selection, especially when dealing with a diverse range of AI providers and models. * [Round-robin (weighted)](/gateway/latest/how-kong-works/load-balancing/#round-robin) * [Consistent-hashing (sticky-session on given header value)](/gateway/latest/how-kong-works/load-balancing/#consistent-hashing) -Additionally, semantic routing works similarly to load-balancing algorithms like lowest-usage or least-connections, but instead of volume or connection metrics, it uses the similarity score between the incoming prompt and the descriptions of each model. This allows Kong to automatically choose the model best suited for handling the request, based on performance in similar contexts. -## Semantic routing +## Retry and fallback -The AI Proxy Advanced plugin supports semantic routing, which enables distribution of requests based on the similarity between the prompt and the description of each model. This allows Kong to automatically select the model that is best suited for the given domain or use case. +The load balancer has customizable retries and timeouts for requests, and can redirect a request to a different model in case of failure. This allows you to have a fallback in case one of your targets is unavailable. -By analyzing the content of the request, the plugin can match it to the most appropriate model that is known to perform better in similar contexts. This feature enhances the flexibility and efficiency of model selection, especially when dealing with a diverse range of AI providers and models. +This plugin does not support fallback over targets with different formats. You can use different providers as long as the formats are compatible.For example, load balancers with these combinations of targets are supported: +* Different OpenAI models +* OpenAI models and Mistral models with the OpenAI format +* Mistral models with the OLLAMA format and Llama models with the OLLAMA format + +{:.note} +> Some errors, such as client errors, result in a failure and don't failover to another target. ## Request and response formats