Skip to content

Commit

Permalink
Feat/add ai metrics docs (#7691)
Browse files Browse the repository at this point in the history
* doc on ai metrics

* fix

* fix

* fix

* fix

* fix

* test visible

* add 3.8 version

* add 3.8 version

* add 3.8 version

* add 3.8 version

* add 3.8 version

* edit

* edit cost name

* doc ai metrics

* add latencies and cache data

* f

* ignore vale

* Revise and add some conditional rendering

Signed-off-by: Diana <[email protected]>

* Fix table formatting

Signed-off-by: Diana <[email protected]>

---------

Signed-off-by: Diana <[email protected]>
Co-authored-by: Diana <[email protected]>
  • Loading branch information
AntoineJac and cloudjumpercat authored Aug 23, 2024
1 parent 826a7c1 commit 42c58d8
Show file tree
Hide file tree
Showing 3 changed files with 191 additions and 32 deletions.
27 changes: 18 additions & 9 deletions app/_hub/kong-inc/prometheus/overview/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,13 +52,16 @@ license signature. Those metrics are only exported on {{site.base_gateway}}.
timers, in Running or Pending state.

{% if_version gte:3.0.x %}
### Metrics disabled by default
Following metrics are disabled by default as it may create high cardinality of metrics and may
cause performance issues:

#### Status code metrics
When `status_code_metrics` is set to true:
- **Status codes**: HTTP status codes returned by upstream services.
These are available per service, across all services, and per route per consumer.

#### Latency metrics
When `latency_metrics` is set to true:
- **Latencies Histograms**: Latency (in ms), as measured at Kong:
- **Request**: Total time taken by Kong and upstream services to serve
Expand All @@ -67,10 +70,12 @@ When `latency_metrics` is set to true:
plugins.
- **Upstream**: Time taken by the upstream service to respond to requests.

#### Bandwidth metrics
When `bandwidth_metrics` is set to true:
- **Bandwidth**: Total Bandwidth (egress/ingress) flowing through Kong.
This metric is available per service and as a sum across all services.

#### Upstream health metrics
When `upstream_health_metrics` is set to true:
- **Target Health**: The healthiness status (`healthchecks_off`, `healthy`, `unhealthy`, or `dns_error`) of targets
belonging to a given upstream as well as their subsystem (`http` or `stream`).
Expand All @@ -79,18 +84,22 @@ When `upstream_health_metrics` is set to true:
{% endif_version %}

{% if_version gte:3.8.x %}
When `ai_llm_metrics` is set to `true`:
- **AI Requests**: AI requests sent to LLM providers.
These are available per provider, model, cache, database name (if cached), and workspace.
- **AI Cost:**: AI costs charged by LLM providers.
These are available per provider, model, cache, database name (if cached), and workspace.
- **AI Tokens** AI tokens counted by LLM providers.
These are available per provider, model, cache, database name (if cached), token type, and workspace.
#### AI LLM metrics
All the following AI LLM metrics are available per provider, model, cache, database name (if cached), embeddings provider (if cached), embeddings model (if cached), and workspace.

For more details, see [AI Metrics](/gateway/latest/production/monitoring/ai-metrics/).
{% endif_version %}
When `ai_llm_metrics` is set to true:
- **AI Requests**: AI request sent to LLM providers.
- **AI Cost**: AI Cost charged by LLM providers.
- **AI Tokens**: AI Tokens counted by LLM providers.
These are also available per token type in addition to the options listed previously.
- **AI LLM Latency**: Time taken to return a response by LLM providers.
- **AI Cache Fetch Latency**: Time taken to return a response from the cache.
- **AI Cache Embeddings Latency**: Time taken to generate embedding during the cache.

For more details, see [AI Metrics](/gateway/{{ page.release }}/production/monitoring/ai-metrics/).
{% endif_version %}

### Metrics output example
Here is an example of output you could expect from the `/metrics` endpoint:

```bash
Expand Down
146 changes: 133 additions & 13 deletions app/_src/gateway/production/logging/ai-analytics.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Each AI plugin returns a set of tokens.

All log entries include the following attributes:

{% if_version lte:3.7.x %}
```json
"ai": {
"payload": { "request": "[$optional_payload_request_]" },
Expand Down Expand Up @@ -45,23 +46,142 @@ All log entries include the following attributes:
}
}
}

```
{% endif_version %}
{% if_version gte:3.8.x %}
```json
"ai": {
"payload": { "request": "[$optional_payload_request_]" },
"[$plugin_name_1]": {
"payload": { "response": "[$optional_payload_response]" },
"usage": {
"prompt_token": 28,
"total_tokens": 48,
"completion_token": 20,
"cost": 0.0038,
"time_per_token": 133
},
"meta": {
"request_model": "command",
"provider_name": "cohere",
"response_model": "command",
"plugin_id": "546c3856-24b3-469a-bd6c-f6083babd2cd",
"llm_latency": 2670
}
},
"[$plugin_name_2]": {
"payload": { "response": "[$optional_payload_response]" },
"usage": {
"prompt_token": 89,
"total_tokens": 145,
"completion_token": 56,
"cost": 0.0012,
"time_per_token": 87
},
"meta": {
"request_model": "gpt-35-turbo",
"provider_name": "azure",
"response_model": "gpt-35-turbo",
"plugin_id": "5df193be-47a3-4f1b-8c37-37e31af0568b",
"llm_latency": 4927
}
}
}
```
{% endif_version %}

### Log Details

Each log entry includes the following details:

Property | Description
---------|-------------
`ai.payload.request` | The request payload.
`ai.[$plugin_name].payload.response` |The response payload.
`ai.[$plugin_name].usage.prompt_token` | Number of tokens used for prompting.
`ai.[$plugin_name].usage.completion_token` | Number of tokens used for completion.
`ai.[$plugin_name].usage.total_tokens` | Total number of tokens used.
`ai.[$plugin_name].usage.cost` | The total cost of the request (input and output cost).
`ai.[$plugin_name].meta.request_model` | Model used for the AI request.
`ai.[$plugin_name].meta.provider_name` | Name of the AI service provider.
`ai.[$plugin_name].meta.response_model` | Model used for the AI response.
`ai.[$plugin_name].meta.plugin_id` | Unique identifier of the plugin.
<!--vale off-->

| Property | Description |
| --------- | ------------- |
| `ai.payload.request` | The request payload. |
| `ai.[$plugin_name].payload.response` | The response payload. |
| `ai.[$plugin_name].usage.prompt_token` | Number of tokens used for prompting. |
| `ai.[$plugin_name].usage.completion_token` | Number of tokens used for completion. |
| `ai.[$plugin_name].usage.total_tokens` | Total number of tokens used. |
| `ai.[$plugin_name].usage.cost` | The total cost of the request (input and output cost). |

{% if_version gte:3.8.x %}
| `ai.[$plugin_name].usage.time_per_token` | The average time to generate an output token, in milliseconds. |
{% endif_version %}

| `ai.[$plugin_name].meta.request_model` | Model used for the AI request. |
| `ai.[$plugin_name].meta.provider_name` | Name of the AI service provider. |
| `ai.[$plugin_name].meta.response_model` | Model used for the AI response. |
| `ai.[$plugin_name].meta.plugin_id` | Unique identifier of the plugin. |

{% if_version gte:3.8.x %}
| `ai.[$plugin_name].meta.llm_latency` | The time, in milliseconds, it took the LLM provider to generate the full response. |
| `ai.[$plugin_name].cache.cache_status` | The cache status. This can be Hit, Miss, Bypass or Refresh. |
| `ai.[$plugin_name].cache.fetch_latency` | The time, in milliseconds, it took to return a cache response. |
| `ai.[$plugin_name].cache.embeddings_provider` | For semantic caching, the provider used to generate the embeddings. |
| `ai.[$plugin_name].cache.embeddings_model` | For semantic caching, the model used to generate the embeddings. |
| `ai.[$plugin_name].cache.embeddings_latency` | For semantic caching, the time taken to generate the embeddings. |
{% endif_version %}

<!--vale on-->

{% if_version gte:3.8.x %}
### Caches logging

If you're using the [AI Semantic Cache plugin](/hub/kong-inc/), logging will include some additional details about caching:

```json
"ai": {
"payload": { "request": "[$optional_payload_request_]" },
"[$plugin_name_1]": {
"payload": { "response": "[$optional_payload_response]" },
"usage": {
"prompt_token": 28,
"total_tokens": 48,
"completion_token": 20,
"cost": 0.0038,
"time_per_token": 133
},
"meta": {
"request_model": "command",
"provider_name": "cohere",
"response_model": "command",
"plugin_id": "546c3856-24b3-469a-bd6c-f6083babd2cd",
"llm_latency": 2670
},
"cache": {
"cache_status": "Hit",
"fetch_latency": 21
}
},
"[$plugin_name_2]": {
"payload": { "response": "[$optional_payload_response]" },
"usage": {
"prompt_token": 89,
"total_tokens": 145,
"completion_token": 56,
"cost": 0.0012,
},
"meta": {
"request_model": "gpt-35-turbo",
"provider_name": "azure",
"response_model": "gpt-35-turbo",
"plugin_id": "5df193be-47a3-4f1b-8c37-37e31af0568b",
},
"cache": {
"cache_status": "Hit",
"fetch_latency": 444,
"embeddings_provider": "openai",
"embeddings_model": "text-embedding-3-small",
"embeddings_latency": 424
}
}
}
```

{:.note}
> **Note:**
> When returning a cache response, `time_per_token` and `llm_latency` are omitted.
> The cache response can be returned either as a semantic cache or an exact cache. If it's returned as a semantic cache, it will include additional details such as the embeddings provider, embeddings model, and embeddings latency.
{% endif_version %}

50 changes: 40 additions & 10 deletions app/_src/gateway/production/monitoring/ai-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,27 @@ dashboard](https://grafana.com/grafana/dashboards/21162-kong-cx-ai/):

## Available metrics

{% if_version lte: 3.7.x %}
- **AI Requests**: AI requests sent to LLM providers.
These are available per provider, model, cache, database name (if cached), and workspace.
- **AI Cost**: AI costs charged by LLM providers.
These are available per provider, model, cache, database name (if cached), and workspace.
- **AI Tokens**: AI tokens counted by LLM providers.
These are available per provider, model, cache, database name (if cached), token type, and workspace.
{% endif_version %}

{% if_version gte: 3.8.x %}
All the following AI LLM metrics are available per provider, model, cache, database name (if cached), embeddings provider (if cached), embeddings model (if cached), and workspace.

When `ai_llm_metrics` is set to true:
- **AI Requests**: AI request sent to LLM providers.
- **AI Cost**: AI Cost charged by LLM providers.
- **AI Tokens**: AI Tokens counted by LLM providers.
These are also available per token type in addition to the options listed previously.
- **AI LLM Latency**: Time taken to return a response by LLM providers.
- **AI Cache Fetch Latency**: Time taken to return a response from the cache.
- **AI Cache Embeddings Latency**: Time taken to generate embedding during the cache.
{% endif_version %}

AI metrics are disabled by default as it may create high cardinality of metrics and may
cause performance issues. To enable them, set `ai_metrics` to true in the Prometheus plugin configuration.
Expand All @@ -63,16 +78,31 @@ Transfer-Encoding: chunked
Connection: keep-alive
Access-Control-Allow-Origin: *

# HELP ai_requests_total AI requests total per ai_provider in Kong
# TYPE ai_requests_total counter
ai_requests_total{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",workspace="workspace1"} 100
# HELP ai_cost_total AI requests cost per ai_provider/cache in Kong
# TYPE ai_cost_total counter
ai_cost_total{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",workspace="workspace1"} 50
# HELP ai_tokens_total AI tokens total per ai_provider/cache in Kong
# TYPE ai_tokens_total counter
ai_tokens_total{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",token_type="input",workspace="workspace1"} 1000
ai_tokens_total{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",token_type="output",workspace="workspace1"} 2000
{% if_version gte:3.0.x %}
# HELP ai_llm_requests_total AI requests total per ai_provider in Kong
# TYPE ai_llm_requests_total counter
ai_llm_requests_total{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",workspace="workspace1"} 100
# HELP ai_llm_cost_total AI requests cost per ai_provider/cache in Kong
# TYPE ai_llm_cost_total counter
ai_llm_cost_total{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",workspace="workspace1"} 50
# HELP ai_llm_provider_latency AI latencies per ai_provider in Kong
# TYPE ai_llm_provider_latency bucket
ai_llm_provider_latency_ms_bucket{ai_provider="provider1",ai_model="model1",cache_status="",vector_db="",embeddings_provider="",embeddings_model="",workspace="workspace1",le="+Inf"} 2
# HELP ai_llm_tokens_total AI tokens total per ai_provider/cache in Kong
# TYPE ai_llm_tokens_total counter
ai_llm_tokens_total{ai_provider="provider1",ai_model="model1",cache_status="",vector_db="",embeddings_provider="",embeddings_model="",token_type="prompt_tokens",workspace="workspace1"} 1000
ai_llm_tokens_total{ai_provider="provider1",ai_model="model1",cache_status="",vector_db="",embeddings_provider="",embeddings_model="",token_type="completion_tokens",workspace="workspace1"} 2000
ai_llm_tokens_total{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",token_type="total_tokens",workspace="workspace1"} 3000
# HELP ai_cache_fetch_latency AI cache latencies per ai_provider/database in Kong
# TYPE ai_cache_fetch_latency bucket
ai_cache_fetch_latency{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",workspace="workspace1",le="+Inf"} 2
# HELP ai_cache_embeddings_latency AI cache latencies per ai_provider/database in Kong
# TYPE ai_cache_embeddings_latency bucket
ai_cache_embeddings_latency{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",workspace="workspace1",le="+Inf"} 2
# HELP ai_llm_provider_latency AI cache latencies per ai_provider/database in Kong
# TYPE ai_llm_provider_latency bucket
ai_llm_provider_latency{ai_provider="provider1",ai_model="model1",cache_status="hit",vector_db="redis",embeddings_provider="openai",embeddings_model="text-embedding-3-large",workspace="workspace1",le="+Inf"} 2
{% endif_version %}
```

{:.note}
Expand Down

0 comments on commit 42c58d8

Please sign in to comment.