Skip to content

Commit

Permalink
Update metrics docs. (#3939)
Browse files Browse the repository at this point in the history
  • Loading branch information
mbobrovskyi authored Jan 17, 2025
1 parent d289679 commit a1c4c4b
Show file tree
Hide file tree
Showing 2 changed files with 24 additions and 14 deletions.
1 change: 1 addition & 0 deletions site/content/en/docs/installation/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,7 @@ The currently supported features are:
| `KeepQuotaForProvReqRetry` | `false` | Deprecated | 0.9 | 0.9 |
| `ManagedJobsNamespaceSelector` | `true` | Beta | 0.10 | |
| `LocalQueueDefaulting` | `false` | Alpha | 0.10 | |
| `LocalQueueMetrics` | `false` | Alpha | 0.10 | |

## What's next

Expand Down
37 changes: 23 additions & 14 deletions site/content/en/docs/reference/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ description: >
Prometheus metrics exported by Kueue
---
Kueue exposes [prometheus](https://prometheus.io) metrics to monitor the health
of the system and the status of [ClusterQueues](/docs/concepts/cluster_queue).
of the system and the status of [ClusterQueues](/docs/concepts/cluster_queue)
and [LocalQueues](/docs/concepts/local_queue).

## Kueue health

Expand All @@ -15,7 +16,7 @@ Use the following metrics to monitor the health of the kueue controllers:

| Metric name | Type | Description | Labels |
| -------------------------------------------- | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------- |
| `kueue_admission_attempts_total` | Counter | The total number of attempts to[admit](/docs/concepts#admission) workloads. Each admission attempt might try to admit more than one workload. | `result`: possible values are `success` or `inadmissible` |
| `kueue_admission_attempts_total` | Counter | The total number of attempts to [admit](/docs/concepts#admission) workloads. Each admission attempt might try to admit more than one workload. | `result`: possible values are `success` or `inadmissible` |
| `kueue_admission_attempt_duration_seconds` | Histogram | The latency of an admission attempt. | `result`: possible values are `success` or `inadmissible` |

## ClusterQueue status
Expand All @@ -34,22 +35,28 @@ Use the following metrics to monitor the status of your ClusterQueues:
| `kueue_admission_checks_wait_time_seconds` | Histogram | The time from when a workload got the quota reservation until admission. | `cluster_queue`: the name of the ClusterQueue |
| `kueue_admitted_active_workloads` | Gauge | The number of admitted Workloads that are active (unsuspended and not finished) | `cluster_queue`: the name of the ClusterQueue |
| `kueue_cluster_queue_status` | Gauge | Reports the status of the ClusterQueue | `cluster_queue`: The name of the ClusterQueue<br> `status`: Possible values are `pending`, `active` or `terminated`. For a ClusterQueue, the metric only reports a value of 1 for one of the statuses. |
| `kueue_reserving_active_workloads` | Gauge | The number of Workloads that are reserving quota, per `cluster_queue`. | `cluster_queue`: the name of the ClusterQueue |
| `kueue_admission_cycle_preemption_skips` | Gauge | The number of Workloads in the ClusterQueue that got preemption candidates but had to be skipped because other ClusterQueues needed the same resources in the same cycle | `cluster_queue`: the name of the ClusterQueue |
| `kueue_preempted_workloads_total` | Counter | The number of preempted workloads per `preempting_cluster_queue` | `preempting_cluster_queue`: the name of the ClusterQueue<br> `reason`: possible values are `InClusterQueue` means that the workload was preempted by a workload in the same ClusterQueue; `InCohortReclamation` means that the workload was preempted by a workload in the same cohort due to reclamation of nominal quota; `InCohortFairSharing` means that the workload was preempted by a workload in the same cohort due to fair sharing; `InCohortReclaimWhileBorrowing` means that the workload was preempted by a workload in the same cohort due to reclamation of nominal quota while borrowing |

## LocalQueue Status (alpha)

The following metrics are available only if `LocalQueueMetrics` feature gate is enabled. Check the [Change the feature gates configuration](/docs/installation/#change-the-feature-gates-configuration) section of the [Installation](/docs/installation/) for details.

| Metric Name | Type | Description | Labels |
| ------------------------------------------------ | ----------- | ------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `local_queue_pending_workloads` | Gauge | The number of pending workloads, per 'local_queue' and 'status'. | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in<br />`status`: can be either `active` for the number of active pending workloads or `inadmissible` |
| `local_queue_quota_reserved_workloads_total` | Counter | The number of workloads with quota reserved in a LocalQueue | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in |
| `local_queue_quota_reserved_wait_time_seconds` | Histogram | The time between a workload was created or requeued until it got quota reservation, per`local_queue` | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in |
| `local_queue_admitted_workloads_total` | Counter | The total number of admitted workloads per`local_queue` | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in |
| `local_queue_admission_wait_time_seconds` | Histogram | The time between a workload was created or requeued until admission, per`local_queue` | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in |
| `local_queue_evicted_workloads_total` | Counter | The number of evicted workloads per`local_queue` | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in<br />`reason`: the reason the workload was pre-empted. It can have the following values ["Preempted", "PodsReadyTimeout", "AdmissionCheck", "ClusterQueueStopped", "Deactivated"] |
| `local_queue_reserving_active_workloads` | Gauge | The number of Workloads that are reserving quota, per`localQueue` | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in |
| `local_queue_admitted_active_workloads` | Gauge | The number of admitted Workloads that are active (unsuspended and not finished), per`localQueue` | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in |
| `local_queue_status` | Gauge | Reports a LocalQueue's`active` status (ability to schedule workloads) | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in<br />`active`: one of [`True`, `False`, `Unknown`] and exclusively one is positive at any given time |
| `local_queue_resource_usage` | Gauge | Reports the LocalQueue's total resource usage within all the`flavors` | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in<br />`flavor`: the name of the flavor which resources are being consumed from<br />`resource`: the resource which is being consumed |
| Metric Name | Type | Description | Labels |
|--------------------------------------------------------|-----------|-------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `kueue_local_queue_pending_workloads` | Gauge | The number of pending workloads, per `local_queue` and `status`. | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in<br />`status`: can be either `active` for the number of active pending workloads or `inadmissible` |
| `kueue_local_queue_quota_reserved_workloads_total` | Counter | The number of workloads with quota reserved in a LocalQueue | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in |
| `kueue_local_queue_quota_reserved_wait_time_seconds` | Histogram | The time between a workload was created or requeued until it got quota reservation, per `local_queue` | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in |
| `kueue_local_queue_admitted_workloads_total` | Counter | The total number of admitted workloads per `local_queue` | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in |
| `kueue_local_queue_admission_checks_wait_time_seconds` | Histogram | The time from when a workload got the quota reservation until admission, per `local_queue` | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in |
| `kueue_local_queue_admission_wait_time_seconds` | Histogram | The time between a workload was created or requeued until admission, per `local_queue` | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in |
| `kueue_local_queue_evicted_workloads_total` | Counter | The number of evicted workloads per `local_queue` | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in<br />`reason`: the reason the workload was pre-empted. It can have the following values ["Preempted", "PodsReadyTimeout", "AdmissionCheck", "ClusterQueueStopped", "Deactivated"] |
| `kueue_local_queue_reserving_active_workloads` | Gauge | The number of Workloads that are reserving quota, per `localQueue` | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in |
| `kueue_local_queue_admitted_active_workloads` | Gauge | The number of admitted Workloads that are active (unsuspended and not finished), per `localQueue` | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in |
| `kueue_local_queue_status` | Gauge | Reports a LocalQueue's `active` status (ability to schedule workloads) | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in<br />`active`: one of [`True`, `False`, `Unknown`] and exclusively one is positive at any given time |
| `kueue_local_queue_resource_reservation` | Gauge | Reports the LocalQueue's total resource usage within all the`flavors` | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in<br />`flavor`: the name of the flavor which resources are being consumed from<br />`resource`: the resource which is being consumed |
| `kueue_local_queue_resource_usage` | Gauge | Reports the localQueue's total resource reservation within all the `flavors` | `name`: the name of the LocalQueue<br />`namespace`: the namespace that the LocalQueue resides in<br />`flavor`: the name of the flavor which resources are being consumed from<br />`resource`: the resource which is being consumed |

### Optional metrics

Expand All @@ -58,7 +65,9 @@ The following metrics are available only if `metrics.enableClusterQueueResources

| Metric name | Type | Description | Labels |
| --------------------------------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `kueue_cluster_queue_resource_reservation` | Gauge | Reports the cluster_queue's total resource reservation within all the flavors | `cohort`: The cohort in which the queue belongs<br> `cluster_queue`: The name of the ClusterQueue<br> `flavor`: referenced flavor<br> `resource`: The resource name |
| `kueue_cluster_queue_resource_usage` | Gauge | Reports the ClusterQueue's total resource usage | `cohort`: The cohort in which the queue belongs<br> `cluster_queue`: The name of the ClusterQueue<br> `flavor`: referenced flavor<br> `resource`: The resource name |
| `kueue_cluster_queue_nominal_quota` | Gauge | Reports the ClusterQueue's resource quota | `cohort`: The cohort in which the queue belongs<br> `cluster_queue`: The name of the ClusterQueue<br> `flavor`: referenced flavor<br> `resource`: The resource name |
| `kueue_cluster_queue_borrowing_limit` | Gauge | Reports the ClusterQueue's resource borrowing limit | `cohort`: The cohort in which the queue belongs<br> `cluster_queue`: The name of the ClusterQueue<br> `flavor`: referenced flavor<br> `resource`: The resource name |
| `kueue_cluster_queue_lending_limit` | Gauge | Reports the cluster_queue's resource lending limit within all the flavors | `cohort`: The cohort in which the queue belongs<br> `cluster_queue`: The name of the ClusterQueue<br> `flavor`: referenced flavor<br> `resource`: The resource name |
| `kueue_cluster_queue_weighted_share` | Gauge | Reports a value that representing the maximum of the ratios of usage above nominal quota to the lendable resources in the cohort, among all the resources provided by the ClusterQueue. | `cluster_queue`: The name of the ClusterQueue |

0 comments on commit a1c4c4b

Please sign in to comment.