Feat/add ai metrics docs (#7691)

* doc on ai metrics * fix * fix * fix * fix * fix * test visible * add 3.8 version * add 3.8 version * add 3.8 version * add 3.8 version * add 3.8 version * edit * edit cost name * doc ai metrics * add latencies and cache data * f * ignore vale * Revise and add some conditional rendering Signed-off-by: Diana <[email protected]> * Fix table formatting Signed-off-by: Diana <[email protected]> --------- Signed-off-by: Diana <[email protected]> Co-authored-by: Diana <[email protected]>
Kong · Aug 23, 2024 · 42c58d8 · 42c58d8
1 parent 826a7c1
commit 42c58d8
Show file tree

Hide file tree

Showing 3 changed files with 191 additions and 32 deletions.
diff --git a/app/_hub/kong-inc/prometheus/overview/_index.md b/app/_hub/kong-inc/prometheus/overview/_index.md
@@ -52,13 +52,16 @@ license signature. Those metrics are only exported on {{site.base_gateway}}.
     timers, in Running or Pending state.
 
 {% if_version gte:3.0.x %}
+### Metrics disabled by default
 Following metrics are disabled by default as it may create high cardinality of metrics and may
 cause performance issues:
 
+#### Status code metrics
 When `status_code_metrics` is set to true:
 - **Status codes**: HTTP status codes returned by upstream services.
   These are available per service, across all services, and per route per consumer.
 
+#### Latency metrics
 When `latency_metrics` is set to true:
 - **Latencies Histograms**: Latency (in ms), as measured at Kong:
    - **Request**: Total time taken by Kong and upstream services to serve
@@ -67,10 +70,12 @@ When `latency_metrics` is set to true:
      plugins.
    - **Upstream**: Time taken by the upstream service to respond to requests.
 
+#### Bandwidth metrics
 When `bandwidth_metrics` is set to true:
 - **Bandwidth**: Total Bandwidth (egress/ingress) flowing through Kong.
   This metric is available per service and as a sum across all services.
 
+#### Upstream health metrics
 When `upstream_health_metrics` is set to true:
 - **Target Health**: The healthiness status (`healthchecks_off`, `healthy`, `unhealthy`, or `dns_error`) of targets
   belonging to a given upstream as well as their subsystem (`http` or `stream`).
@@ -79,18 +84,22 @@ When `upstream_health_metrics` is set to true:
 {% endif_version %}
 
 {% if_version gte:3.8.x %}
-When `ai_llm_metrics` is set to `true`:
-- **AI Requests**: AI requests sent to LLM providers.
-  These are available per provider, model, cache, database name (if cached), and workspace.
-- **AI Cost:**: AI costs charged by LLM providers.
-  These are available per provider, model, cache, database name (if cached), and workspace.
-- **AI Tokens** AI tokens counted by LLM providers.
-  These are available per provider, model, cache, database name (if cached), token type, and workspace.
+#### AI LLM metrics
+All the following AI LLM metrics are available per provider, model, cache, database name (if cached), embeddings provider (if cached), embeddings model (if cached), and workspace.
 
-For more details, see [AI Metrics](/gateway/latest/production/monitoring/ai-metrics/).
-{% endif_version %}
+When `ai_llm_metrics` is set to true:
+- **AI Requests**: AI request sent to LLM providers.
+- **AI Cost**: AI Cost charged by LLM providers.
+- **AI Tokens**: AI Tokens counted by LLM providers.
+  These are also available per token type in addition to the options listed previously.
+- **AI LLM Latency**: Time taken to return a response by LLM providers.
+- **AI Cache Fetch Latency**: Time taken to return a response from the cache.
+- **AI Cache Embeddings Latency**: Time taken to generate embedding during the cache.
 
+For more details, see [AI Metrics](/gateway/{{ page.release }}/production/monitoring/ai-metrics/).
+{% endif_version %}
 
+### Metrics output example
 Here is an example of output you could expect from the `/metrics` endpoint:
 
 ```bash

diff --git a/app/_src/gateway/production/logging/ai-analytics.md b/app/_src/gateway/production/logging/ai-analytics.md
@@ -11,6 +11,7 @@ Each AI plugin returns a set of tokens.
 
 All log entries include the following attributes:
 
+{% if_version lte:3.7.x %}
 ```json
 "ai": {
     "payload": { "request": "[$optional_payload_request_]" },
@@ -45,23 +46,142 @@ All log entries include the following attributes:
       }
     }
   }
-
 ```
+{% endif_version %}
+{% if_version gte:3.8.x %}
+```json
+"ai": {
+    "payload": { "request": "[$optional_payload_request_]" },
+    "[$plugin_name_1]": {
+      "payload": { "response": "[$optional_payload_response]" },
+      "usage": {
+        "prompt_token": 28,
+        "total_tokens": 48,
+        "completion_token": 20,
+        "cost": 0.0038,
+        "time_per_token": 133
+      },
+      "meta": {
+        "request_model": "command",
+        "provider_name": "cohere",
+        "response_model": "command",
+        "plugin_id": "546c3856-24b3-469a-bd6c-f6083babd2cd",
+        "llm_latency": 2670
+      }
+    },
+    "[$plugin_name_2]": {
+      "payload": { "response": "[$optional_payload_response]" },
+      "usage": {
+        "prompt_token": 89,
+        "total_tokens": 145,
+        "completion_token": 56,
+        "cost": 0.0012,
+        "time_per_token": 87
+      },
+      "meta": {
+        "request_model": "gpt-35-turbo",
+        "provider_name": "azure",
+        "response_model": "gpt-35-turbo",
+        "plugin_id": "5df193be-47a3-4f1b-8c37-37e31af0568b",
+        "llm_latency": 4927
+      }
+    }
+  }
+```
+{% endif_version %}
 
 ### Log Details
 
 Each log entry includes the following details:
 
-Property | Description
----------|-------------
-`ai.payload.request` | The request payload.
-`ai.[$plugin_name].payload.response` |The response payload.
-`ai.[$plugin_name].usage.prompt_token` | Number of tokens used for prompting.
-`ai.[$plugin_name].usage.completion_token` | Number of tokens used for completion.
-`ai.[$plugin_name].usage.total_tokens` | Total number of tokens used.
-`ai.[$plugin_name].usage.cost` | The total cost of the request (input and output cost).
-`ai.[$plugin_name].meta.request_model` | Model used for the AI request.
-`ai.[$plugin_name].meta.provider_name` | Name of the AI service provider.
-`ai.[$plugin_name].meta.response_model` | Model used for the AI response.
-`ai.[$plugin_name].meta.plugin_id` | Unique identifier of the plugin.
+<!--vale off-->
+
+| Property | Description |
+| --------- | ------------- |
+| `ai.payload.request` | The request payload. |
+| `ai.[$plugin_name].payload.response` | The response payload. |
+| `ai.[$plugin_name].usage.prompt_token` | Number of tokens used for prompting. |
+| `ai.[$plugin_name].usage.completion_token` | Number of tokens used for completion. |
+| `ai.[$plugin_name].usage.total_tokens` | Total number of tokens used. |
+| `ai.[$plugin_name].usage.cost` | The total cost of the request (input and output cost). |
+
+{% if_version gte:3.8.x %}
+| `ai.[$plugin_name].usage.time_per_token` | The average time to generate an output token, in milliseconds. |
+{% endif_version %}
+
+| `ai.[$plugin_name].meta.request_model` | Model used for the AI request. |
+| `ai.[$plugin_name].meta.provider_name` | Name of the AI service provider. |
+| `ai.[$plugin_name].meta.response_model` | Model used for the AI response. |
+| `ai.[$plugin_name].meta.plugin_id` | Unique identifier of the plugin. |
+
+{% if_version gte:3.8.x %}
+| `ai.[$plugin_name].meta.llm_latency` | The time, in milliseconds, it took the LLM provider to generate the full response. |
+| `ai.[$plugin_name].cache.cache_status` | The cache status. This can be Hit, Miss, Bypass or Refresh. |
+| `ai.[$plugin_name].cache.fetch_latency` | The time, in milliseconds, it took to return a cache response. |
+| `ai.[$plugin_name].cache.embeddings_provider` | For semantic caching, the provider used to generate the embeddings. |
+| `ai.[$plugin_name].cache.embeddings_model` | For semantic caching, the model used to generate the embeddings. |
+| `ai.[$plugin_name].cache.embeddings_latency` | For semantic caching, the time taken to generate the embeddings. |
+{% endif_version %}
+
+<!--vale on-->
+
+{% if_version gte:3.8.x %}
+### Caches logging
+
+If you're using the [AI Semantic Cache plugin](/hub/kong-inc/), logging will include some additional details about caching:
+
+```json
+"ai": {
+    "payload": { "request": "[$optional_payload_request_]" },
+    "[$plugin_name_1]": {
+      "payload": { "response": "[$optional_payload_response]" },
+      "usage": {
+        "prompt_token": 28,
+        "total_tokens": 48,
+        "completion_token": 20,
+        "cost": 0.0038,
+        "time_per_token": 133
+      },
+      "meta": {
+        "request_model": "command",
+        "provider_name": "cohere",
+        "response_model": "command",
+        "plugin_id": "546c3856-24b3-469a-bd6c-f6083babd2cd",
+        "llm_latency": 2670
+      },
+      "cache": {
+        "cache_status": "Hit",
+        "fetch_latency": 21
+      }
+    },
+    "[$plugin_name_2]": {
+      "payload": { "response": "[$optional_payload_response]" },
+      "usage": {
+        "prompt_token": 89,
+        "total_tokens": 145,
+        "completion_token": 56,
+        "cost": 0.0012,
+      },
+      "meta": {
+        "request_model": "gpt-35-turbo",
+        "provider_name": "azure",
+        "response_model": "gpt-35-turbo",
+        "plugin_id": "5df193be-47a3-4f1b-8c37-37e31af0568b",
+      },
+      "cache": {
+        "cache_status": "Hit",
+        "fetch_latency": 444,
+        "embeddings_provider": "openai",
+        "embeddings_model": "text-embedding-3-small",
+        "embeddings_latency": 424
+      }
+    }
+  }
+```
+
+{:.note}
+> **Note:** 
+> When returning a cache response, `time_per_token` and `llm_latency` are omitted.
+> The cache response can be returned either as a semantic cache or an exact cache. If it's returned as a semantic cache, it will include additional details such as the embeddings provider, embeddings model, and embeddings latency.
+{% endif_version %}
 
diff --git a/app/_src/gateway/production/monitoring/ai-metrics.md b/app/_src/gateway/production/monitoring/ai-metrics.md
@@ -37,12 +37,27 @@ dashboard](https://grafana.com/grafana/dashboards/21162-kong-cx-ai/):
 
 ## Available metrics
 
+{% if_version lte: 3.7.x %}
 - **AI Requests**: AI requests sent to LLM providers.
   These are available per provider, model, cache, database name (if cached), and workspace.
 - **AI Cost**: AI costs charged by LLM providers.
   These are available per provider, model, cache, database name (if cached), and workspace.
 - **AI Tokens**: AI tokens counted by LLM providers.
   These are available per provider, model, cache, database name (if cached), token type, and workspace.
+{% endif_version %}
+
+{% if_version gte: 3.8.x %}
+All the following AI LLM metrics are available per provider, model, cache, database name (if cached), embeddings provider (if cached), embeddings model (if cached), and workspace.
+
+When `ai_llm_metrics` is set to true:
+- **AI Requests**: AI request sent to LLM providers.
+- **AI Cost**: AI Cost charged by LLM providers.
+- **AI Tokens**: AI Tokens counted by LLM providers.
+  These are also available per token type in addition to the options listed previously.
+- **AI LLM Latency**: Time taken to return a response by LLM providers.
+- **AI Cache Fetch Latency**: Time taken to return a response from the cache.
+- **AI Cache Embeddings Latency**: Time taken to generate embedding during the cache.
+{% endif_version %}
 
 AI metrics are disabled by default as it may create high cardinality of metrics and may
 cause performance issues. To enable them, set `ai_metrics` to true in the Prometheus plugin configuration.
@@ -63,16 +78,31 @@ Transfer-Encoding: chunked
 Connection: keep-alive
 Access-Control-Allow-Origin: *
 
-# HELP ai_requests_total AI requests total per ai_provider in Kong
-# TYPE ai_requests_total counter
-ai_requests_total{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",workspace="workspace1"} 100
-# HELP ai_cost_total AI requests cost per ai_provider/cache in Kong
-# TYPE ai_cost_total counter
-ai_cost_total{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",workspace="workspace1"} 50
-# HELP ai_tokens_total AI tokens total per ai_provider/cache in Kong
-# TYPE ai_tokens_total counter
-ai_tokens_total{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",token_type="input",workspace="workspace1"} 1000
-ai_tokens_total{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",token_type="output",workspace="workspace1"} 2000
+{% if_version gte:3.0.x %}
+# HELP ai_llm_requests_total AI requests total per ai_provider in Kong
+# TYPE ai_llm_requests_total counter
+ai_llm_requests_total{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",workspace="workspace1"} 100
+# HELP ai_llm_cost_total AI requests cost per ai_provider/cache in Kong
+# TYPE ai_llm_cost_total counter
+ai_llm_cost_total{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",workspace="workspace1"} 50
+# HELP ai_llm_provider_latency AI latencies per ai_provider in Kong
+# TYPE ai_llm_provider_latency bucket
+ai_llm_provider_latency_ms_bucket{ai_provider="provider1",ai_model="model1",cache_status="",vector_db="",embeddings_provider="",embeddings_model="",workspace="workspace1",le="+Inf"} 2
+# HELP ai_llm_tokens_total AI tokens total per ai_provider/cache in Kong
+# TYPE ai_llm_tokens_total counter
+ai_llm_tokens_total{ai_provider="provider1",ai_model="model1",cache_status="",vector_db="",embeddings_provider="",embeddings_model="",token_type="prompt_tokens",workspace="workspace1"} 1000
+ai_llm_tokens_total{ai_provider="provider1",ai_model="model1",cache_status="",vector_db="",embeddings_provider="",embeddings_model="",token_type="completion_tokens",workspace="workspace1"} 2000
+ai_llm_tokens_total{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",token_type="total_tokens",workspace="workspace1"} 3000
+# HELP ai_cache_fetch_latency AI cache latencies per ai_provider/database in Kong
+# TYPE ai_cache_fetch_latency bucket
+ai_cache_fetch_latency{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",workspace="workspace1",le="+Inf"} 2
+# HELP ai_cache_embeddings_latency AI cache latencies per ai_provider/database in Kong
+# TYPE ai_cache_embeddings_latency bucket
+ai_cache_embeddings_latency{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",workspace="workspace1",le="+Inf"} 2
+# HELP ai_llm_provider_latency AI cache latencies per ai_provider/database in Kong
+# TYPE ai_llm_provider_latency bucket
+ai_llm_provider_latency{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",workspace="workspace1",le="+Inf"} 2
+{% endif_version %}
 ```
 
 {:.note}