Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for Prometheus metrics in Training Operator #3894

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions content/en/docs/components/training/user-guides/prometheus.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
+++
title = "Prometheus Monitoring"
description = "Prometheus Metrics for the Training Operator"
weight = 70
+++

This guide explains how to monitor Kubeflow training jobs using Prometheus metrics. The Training Operator exposes these metrics, providing essential insights into the status of distributed machine learning workloads.

{{< note >}}
Metrics are only generated in response to specific events. For example, job creation metrics will only appear after a job has been created. If a metric is not visible, it may be because the corresponding event has not occurred yet.
{{< /note >}}

## Prometheus Metrics for Training Operator
The Training Operator includes a built-in `/metrics` endpoint exposes Prometheus metrics. This feature is enabled by default and requires no additional configuration for basic use.

### Configuring Metrics Port
By default, metrics are exposed on port 8080. If you want to change the default port for metrics exporting, simply add the `metrics-bind-address` argument.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is only available if it's enabled though, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we mention that the metrics can be scraped from any IP address by default?
And then we can mention that limiting the IP address by "x.x.x.x:xxxx".


For example, to change the metrics port to 8082:
```yaml
# deployment.yaml for the Training Operator
spec:
containers:
- command:
- /manager
image: kubeflow/training-operator
name: training-operator
ports:
- containerPort: 8080
- containerPort: 9443
name: webhook-server
protocol: TCP
args:
- "--metrics-bind-address=:8082" # Metrics port changed to 8082
```
Comment on lines +16 to +35
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Thank you for the suggestion. I've added a section on how to configure the default port for metrics exporting. Please let me know if you have further comments.

### Accessing the Metrics
The method to access these metrics may vary depending on your Kubernetes setup and environment. For example, use the following command for local environments:
```
kubectl port-forward -n kubeflow deployment/training-operator 8080:8080
```

Then you'll see metrics in this format via `http://localhost:8080/metrics`:
```
# HELP training_operator_jobs_created_total Counts number of jobs created
# TYPE training_operator_jobs_created_total counter
training_operator_jobs_created_total{framework="tensorflow",job_namespace="kubeflow"} 7
```

## List of Job Metrics

| Metric name | Description | Labels |
|------------------------------------|---------|--------------------------|------------------------------------------------------|
| `training_operator_jobs_created_total` | Total number of jobs created | `namespace`, `framework` |
| `training_operator_jobs_deleted_total` | Total number of jobs deleted | `namespace`, `framework` |
| `training_operator_jobs_successful_total` | Total number of successful jobs | `namespace`, `framework` |
| `training_operator_jobs_failed_total` | Total number of failed jobs | `namespace`, `framework` |
| `training_operator_jobs_restarted_total` | Total number of restarted jobs | `namespace`, `framework`|

Labels information can be interpreted as follows:
| Label name | Description |
|------------------------------------|---------|--------------------------|
| `namespace` | The Kubernetes namespace where the job is running |
| `framework` | The machine learning framework used (e.g. TensorFlow,PyTorch) |



Comment on lines +65 to +66
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change