From 583112ccd772aeb2b6df2fa69ed457c6263709f4 Mon Sep 17 00:00:00 2001 From: Diana <75819066+cloudjumpercat@users.noreply.github.com> Date: Fri, 23 Aug 2024 15:05:08 -0500 Subject: [PATCH] Revise and add some conditional rendering Signed-off-by: Diana <75819066+cloudjumpercat@users.noreply.github.com> --- .../kong-inc/prometheus/overview/_index.md | 33 +++---- .../production/logging/ai-analytics.md | 94 ++++++++++++++----- .../production/monitoring/ai-metrics.md | 37 ++++---- 3 files changed, 106 insertions(+), 58 deletions(-) diff --git a/app/_hub/kong-inc/prometheus/overview/_index.md b/app/_hub/kong-inc/prometheus/overview/_index.md index 448f0ada7f68..2c47531a0b3f 100644 --- a/app/_hub/kong-inc/prometheus/overview/_index.md +++ b/app/_hub/kong-inc/prometheus/overview/_index.md @@ -52,13 +52,16 @@ license signature. Those metrics are only exported on {{site.base_gateway}}. timers, in Running or Pending state. {% if_version gte:3.0.x %} +### Metrics disabled by default Following metrics are disabled by default as it may create high cardinality of metrics and may cause performance issues: +#### Status code metrics When `status_code_metrics` is set to true: - **Status codes**: HTTP status codes returned by upstream services. These are available per service, across all services, and per route per consumer. +#### Latency metrics When `latency_metrics` is set to true: - **Latencies Histograms**: Latency (in ms), as measured at Kong: - **Request**: Total time taken by Kong and upstream services to serve @@ -67,10 +70,12 @@ When `latency_metrics` is set to true: plugins. - **Upstream**: Time taken by the upstream service to respond to requests. +#### Bandwidth metrics When `bandwidth_metrics` is set to true: - **Bandwidth**: Total Bandwidth (egress/ingress) flowing through Kong. This metric is available per service and as a sum across all services. +#### Upstream health metrics When `upstream_health_metrics` is set to true: - **Target Health**: The healthiness status (`healthchecks_off`, `healthy`, `unhealthy`, or `dns_error`) of targets belonging to a given upstream as well as their subsystem (`http` or `stream`). @@ -79,30 +84,22 @@ When `upstream_health_metrics` is set to true: {% endif_version %} {% if_version gte:3.8.x %} +#### AI LLM metrics +All the following AI LLM metrics are available per provider, model, cache, database name (if cached), embeddings provider (if cached), embeddings model (if cached), and workspace. + When `ai_llm_metrics` is set to true: - **AI Requests**: AI request sent to LLM providers. - These are available per provider, model, cache, database name (if cached), - embeddings provider (if cached), embeddings model (if cached), and workspace. -- **AI Cost:**: AI Cost charged by LLM providers. - These are available per provider, model, cache, database name (if cached), - embeddings provider (if cached), embeddings model (if cached), and workspace. -- **AI Tokens** AI Tokens counted by LLM providers. - These are available per provider, model, cache, database name (if cached), - embeddings provider (if cached), embeddings model (if cached), token type, and workspace. -- **AI LLM Latency** Time taken to return a response by LLM providers. - These are available per provider, model, cache, database name (if cached), - embeddings provider (if cached), embeddings model (if cached), and workspace. -- **AI Cache Fetch Latency** Time taken to return a response from the cache. - These are available per provider, model, cache, database name (if cached), - embeddings provider (if cached), embeddings model (if cached), and workspace. -- **AI Cache Embeddings Latency** Time taken to generate embedding during the cache. - These are available per provider, model, cache, database name (if cached), - embeddings provider (if cached), embeddings model (if cached), and workspace. +- **AI Cost**: AI Cost charged by LLM providers. +- **AI Tokens**: AI Tokens counted by LLM providers. + These are also available per token type in addition to the options listed previously. +- **AI LLM Latency**: Time taken to return a response by LLM providers. +- **AI Cache Fetch Latency**: Time taken to return a response from the cache. +- **AI Cache Embeddings Latency**: Time taken to generate embedding during the cache. For more details, see [AI Metrics](/gateway/{{ page.release }}/production/monitoring/ai-metrics/). {% endif_version %} - +### Metrics output example Here is an example of output you could expect from the `/metrics` endpoint: ```bash diff --git a/app/_src/gateway/production/logging/ai-analytics.md b/app/_src/gateway/production/logging/ai-analytics.md index 6547b5ac5594..deeef13f1df9 100644 --- a/app/_src/gateway/production/logging/ai-analytics.md +++ b/app/_src/gateway/production/logging/ai-analytics.md @@ -11,6 +11,44 @@ Each AI plugin returns a set of tokens. All log entries include the following attributes: +{% if_version lte:3.7.x %} +```json +"ai": { + "payload": { "request": "[$optional_payload_request_]" }, + "[$plugin_name_1]": { + "payload": { "response": "[$optional_payload_response]" }, + "usage": { + "prompt_token": 28, + "total_tokens": 48, + "completion_token": 20, + "cost": 0.0038 + }, + "meta": { + "request_model": "command", + "provider_name": "cohere", + "response_model": "command", + "plugin_id": "546c3856-24b3-469a-bd6c-f6083babd2cd" + } + }, + "[$plugin_name_2]": { + "payload": { "response": "[$optional_payload_response]" }, + "usage": { + "prompt_token": 89, + "total_tokens": 145, + "completion_token": 56, + "cost": 0.0012 + }, + "meta": { + "request_model": "gpt-35-turbo", + "provider_name": "azure", + "response_model": "gpt-35-turbo", + "plugin_id": "5df193be-47a3-4f1b-8c37-37e31af0568b" + } + } + } +``` +{% endif_version %} +{% if_version gte:3.8.x %} ```json "ai": { "payload": { "request": "[$optional_payload_request_]" }, @@ -50,36 +88,45 @@ All log entries include the following attributes: } } ``` +{% endif_version %} ### Log Details Each log entry includes the following details: -Property | Description ----------|------------- -`ai.payload.request` | The request payload. -`ai.[$plugin_name].payload.response` |The response payload. -`ai.[$plugin_name].usage.prompt_token` | Number of tokens used for prompting. -`ai.[$plugin_name].usage.completion_token` | Number of tokens used for completion. -`ai.[$plugin_name].usage.total_tokens` | Total number of tokens used. -`ai.[$plugin_name].usage.cost` | The total cost of the request (input and output cost). -`ai.[$plugin_name].usage.time_per_token` | The average time to generate an output token in ms. -`ai.[$plugin_name].meta.request_model` | Model used for the AI request. -`ai.[$plugin_name].meta.provider_name` | Name of the AI service provider. -`ai.[$plugin_name].meta.response_model` | Model used for the AI response. -`ai.[$plugin_name].meta.plugin_id` | Unique identifier of the plugin. -`ai.[$plugin_name].meta.llm_latency` | The time in ms it took the llm provider to generate the full response. -`ai.[$plugin_name].cache.cache_status` | The cache status could be Hit, Miss. Bypass or Refresh. -`ai.[$plugin_name].cache.fetch_latency` | The time in ms it took the return a cache response. -`ai.[$plugin_name].cache.embeddings_provider` | For semantic caching, the provider used to generate the embeddings. -`ai.[$plugin_name].cache.embeddings_model` | For semantic caching, the model used to generate the embeddings. -`ai.[$plugin_name].cache.embeddings_latency` | For semantic caching, the time taken to generate the embeddings. +| Property | Description | +| --------- | ------------- | +| `ai.payload.request` | The request payload. | +| `ai.[$plugin_name].payload.response` | The response payload. | +| `ai.[$plugin_name].usage.prompt_token` | Number of tokens used for prompting. | +| `ai.[$plugin_name].usage.completion_token` | Number of tokens used for completion. | +| `ai.[$plugin_name].usage.total_tokens` | Total number of tokens used. | +| `ai.[$plugin_name].usage.cost` | The total cost of the request (input and output cost). | + +{% if_version gte:3.8.x %} +| `ai.[$plugin_name].usage.time_per_token` | The average time to generate an output token, in milliseconds. | +{% endif_version %} + +| `ai.[$plugin_name].meta.request_model` | Model used for the AI request. | +| `ai.[$plugin_name].meta.provider_name` | Name of the AI service provider. | +| `ai.[$plugin_name].meta.response_model` | Model used for the AI response. | +| `ai.[$plugin_name].meta.plugin_id` | Unique identifier of the plugin. | + +{% if_version gte:3.8.x %} +| `ai.[$plugin_name].meta.llm_latency` | The time, in milliseconds, it took the LLM provider to generate the full response. | +| `ai.[$plugin_name].cache.cache_status` | The cache status. This can be Hit, Miss, Bypass or Refresh. | +| `ai.[$plugin_name].cache.fetch_latency` | The time, in milliseconds, it took to return a cache response. | +| `ai.[$plugin_name].cache.embeddings_provider` | For semantic caching, the provider used to generate the embeddings. | +| `ai.[$plugin_name].cache.embeddings_model` | For semantic caching, the model used to generate the embeddings. | +| `ai.[$plugin_name].cache.embeddings_latency` | For semantic caching, the time taken to generate the embeddings. | +{% endif_version %} -### Caches Logging +{% if_version gte:3.8.x %} +### Caches logging -If using the cache semantic plugins, logging will be provided with some additional details about caching: +If you're using the [AI Semantic Cache plugin](/hub/kong-inc/), logging will include some additional details about caching: ```json "ai": { @@ -132,6 +179,7 @@ If using the cache semantic plugins, logging will be provided with some addition {:.note} > **Note:** -> When returning a cache response `time_per_token` and `llm_latency` will be omitted. -> The cache response can be returned either as a semantic cache or an exact cache. If returned as a semantic cache, it will include additional details such as the embeddings provider, embeddings model, and embeddings latency. +> When returning a cache response, `time_per_token` and `llm_latency` are omitted. +> The cache response can be returned either as a semantic cache or an exact cache. If it's returned as a semantic cache, it will include additional details such as the embeddings provider, embeddings model, and embeddings latency. +{% endif_version %} diff --git a/app/_src/gateway/production/monitoring/ai-metrics.md b/app/_src/gateway/production/monitoring/ai-metrics.md index 471a9be3a923..83a906cc997d 100644 --- a/app/_src/gateway/production/monitoring/ai-metrics.md +++ b/app/_src/gateway/production/monitoring/ai-metrics.md @@ -37,24 +37,27 @@ dashboard](https://grafana.com/grafana/dashboards/21162-kong-cx-ai/): ## Available metrics +{% if_version lte: 3.7.x %} +- **AI Requests**: AI requests sent to LLM providers. + These are available per provider, model, cache, database name (if cached), and workspace. +- **AI Cost**: AI costs charged by LLM providers. + These are available per provider, model, cache, database name (if cached), and workspace. +- **AI Tokens**: AI tokens counted by LLM providers. + These are available per provider, model, cache, database name (if cached), token type, and workspace. +{% endif_version %} + +{% if_version gte: 3.8.x %} +All the following AI LLM metrics are available per provider, model, cache, database name (if cached), embeddings provider (if cached), embeddings model (if cached), and workspace. + +When `ai_llm_metrics` is set to true: - **AI Requests**: AI request sent to LLM providers. - These are available per provider, model, cache, database name (if cached), - embeddings provider (if cached), embeddings model (if cached), and workspace. -- **AI Cost:**: AI Cost charged by LLM providers. - These are available per provider, model, cache, database name (if cached), - embeddings provider (if cached), embeddings model (if cached), and workspace. -- **AI Tokens** AI Tokens counted by LLM providers. - These are available per provider, model, cache, database name (if cached), - embeddings provider (if cached), embeddings model (if cached), token type, and workspace. -- **AI LLM Latency** Time taken to return a response by LLM providers. - These are available per provider, model, cache, database name (if cached), - embeddings provider (if cached), embeddings model (if cached), and workspace. -- **AI Cache Fetch Latency** Time taken to return a response from the cache. - These are available per provider, model, cache, database name (if cached), - embeddings provider (if cached), embeddings model (if cached), and workspace. -- **AI Cache Embeddings Latency** Time taken to generate embedding during the cache. - These are available per provider, model, cache, database name (if cached), - embeddings provider (if cached), embeddings model (if cached), and workspace. +- **AI Cost**: AI Cost charged by LLM providers. +- **AI Tokens**: AI Tokens counted by LLM providers. + These are also available per token type in addition to the options listed previously. +- **AI LLM Latency**: Time taken to return a response by LLM providers. +- **AI Cache Fetch Latency**: Time taken to return a response from the cache. +- **AI Cache Embeddings Latency**: Time taken to generate embedding during the cache. +{% endif_version %} AI metrics are disabled by default as it may create high cardinality of metrics and may cause performance issues. To enable them, set `ai_metrics` to true in the Prometheus plugin configuration.