Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for Prometheus metrics in Training Operator #3894

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

sophie0730
Copy link

Follow up from kubeflow/training-operator#2254

Description

This PR adds a new section to the documentation explaining how to monitor Kubeflow training jobs using Prometheus metrics exposed by the Training Operator.

Changes

  • Rewrote and added a new page titled "Prometheus Monitoring" under the Training Operator documentation.
  • Explained how to access the Prometheus metrics for the Training Operator.
  • Provided a detailed list of relevant metrics, including description and label information.

Copy link

Hi @sophie0730. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@Arhell Arhell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @sophie0730!
@kubeflow/wg-training-leads @StefanoFioravanzo @hbelmiro Please take a look.

The Training Operator includes a built-in `/metrics` endpoint exposes Prometheus metrics. This feature is enabled by default and requires no additional configuration for basic use.

#### Accessing the Metrics
By default, the metrics are exposed on port 8080. The method to access these metrics may vary depending on your Kubernetes setup and environment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain how to change the default port for metrics exporting: https://github.com/kubeflow/training-operator/blob/master/cmd/training-operator.v1/main.go#L83

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already added the information on changing the default port for metrics exporting. Thanks.

@@ -0,0 +1,48 @@
+++
title = "Prometheus Monitoring"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can you add this guide under user guides ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Have moved this guide under user guides section.

content/en/docs/components/training/prometheus.md Outdated Show resolved Hide resolved

Important: Metrics are only generated in response to specific events. For example, job creation metrics will only appear after a job has been created. If a metric is not visible, it may be because the corresponding event has not occurred yet.

These metrics help you understand how your training jobs are doing. You can use this information to fix problems and make your jobs run better.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, you can move this statement to the beginning of this doc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion! After reviewing, I think this sentence(line 46) seems to duplicate information already mentioned at the beginning of the document. To avoid redundancy, I’ve decided to remove it. Thanks!

| `framework` | The machine learning framework used(e.g. TensorFlow,PyTorch) |


Important: Metrics are only generated in response to specific events. For example, job creation metrics will only appear after a job has been created. If a metric is not visible, it may be because the corresponding event has not occurred yet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've incorporated the Note template, thanks for the suggestion!

sophie0730 and others added 3 commits October 7, 2024 22:30
Co-authored-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Sophie Hsu <[email protected]>
1. Add configuring metrics port section
2. Remove duplicate sentence
3. Use Note template for the consistent style
4. Move the doc under the user-guides directory

Signed-off-by: Sophie Hsu <[email protected]>
Comment on lines +16 to +35
### Configuring Metrics Port
By default, metrics are exposed on port 8080. If you want to change the default port for metrics exporting, simply add the `metrics-bind-address` argument.

For example, to change the metrics port to 8082:
```yaml
# deployment.yaml for the Training Operator
spec:
containers:
- command:
- /manager
image: kubeflow/training-operator
name: training-operator
ports:
- containerPort: 8080
- containerPort: 9443
name: webhook-server
protocol: TCP
args:
- "--metrics-bind-address=:8082" # Metrics port changed to 8082
```
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Thank you for the suggestion. I've added a section on how to configure the default port for metrics exporting. Please let me know if you have further comments.

Co-authored-by: Helber Belmiro <[email protected]>
Signed-off-by: Sophie Hsu <[email protected]>
Copy link
Contributor

@hbelmiro hbelmiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great to have!

The Training Operator includes a built-in `/metrics` endpoint exposes Prometheus metrics. This feature is enabled by default and requires no additional configuration for basic use.

### Configuring Metrics Port
By default, metrics are exposed on port 8080. If you want to change the default port for metrics exporting, simply add the `metrics-bind-address` argument.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is only available if it's enabled though, right?

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this.

The Training Operator includes a built-in `/metrics` endpoint exposes Prometheus metrics. This feature is enabled by default and requires no additional configuration for basic use.

### Configuring Metrics Port
By default, metrics are exposed on port 8080. If you want to change the default port for metrics exporting, simply add the `metrics-bind-address` argument.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we mention that the metrics can be scraped from any IP address by default?
And then we can mention that limiting the IP address by "x.x.x.x:xxxx".

Comment on lines +65 to +66


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants