-
Notifications
You must be signed in to change notification settings - Fork 774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation for Prometheus metrics in Training Operator #3894
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Sophie Hsu <[email protected]>
Hi @sophie0730. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @sophie0730!
@kubeflow/wg-training-leads @StefanoFioravanzo @hbelmiro Please take a look.
The Training Operator includes a built-in `/metrics` endpoint exposes Prometheus metrics. This feature is enabled by default and requires no additional configuration for basic use. | ||
|
||
#### Accessing the Metrics | ||
By default, the metrics are exposed on port 8080. The method to access these metrics may vary depending on your Kubernetes setup and environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain how to change the default port for metrics exporting: https://github.com/kubeflow/training-operator/blob/master/cmd/training-operator.v1/main.go#L83
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've already added the information on changing the default port for metrics exporting. Thanks.
@@ -0,0 +1,48 @@ | |||
+++ | |||
title = "Prometheus Monitoring" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can you add this guide under user guides ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Have moved this guide under user guides section.
|
||
Important: Metrics are only generated in response to specific events. For example, job creation metrics will only appear after a job has been created. If a metric is not visible, it may be because the corresponding event has not occurred yet. | ||
|
||
These metrics help you understand how your training jobs are doing. You can use this information to fix problems and make your jobs run better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, you can move this statement to the beginning of this doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion! After reviewing, I think this sentence(line 46) seems to duplicate information already mentioned at the beginning of the document. To avoid redundancy, I’ve decided to remove it. Thanks!
| `framework` | The machine learning framework used(e.g. TensorFlow,PyTorch) | | ||
|
||
|
||
Important: Metrics are only generated in response to specific events. For example, job creation metrics will only appear after a job has been created. If a metric is not visible, it may be because the corresponding event has not occurred yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want you can use the Note template, like here: https://www.kubeflow.org/docs/about/membership/#:~:text=Members-,Note,-Detailed%20documentation%20for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've incorporated the Note template, thanks for the suggestion!
Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Sophie Hsu <[email protected]>
1. Add configuring metrics port section 2. Remove duplicate sentence 3. Use Note template for the consistent style 4. Move the doc under the user-guides directory Signed-off-by: Sophie Hsu <[email protected]>
### Configuring Metrics Port | ||
By default, metrics are exposed on port 8080. If you want to change the default port for metrics exporting, simply add the `metrics-bind-address` argument. | ||
|
||
For example, to change the metrics port to 8082: | ||
```yaml | ||
# deployment.yaml for the Training Operator | ||
spec: | ||
containers: | ||
- command: | ||
- /manager | ||
image: kubeflow/training-operator | ||
name: training-operator | ||
ports: | ||
- containerPort: 8080 | ||
- containerPort: 9443 | ||
name: webhook-server | ||
protocol: TCP | ||
args: | ||
- "--metrics-bind-address=:8082" # Metrics port changed to 8082 | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Thank you for the suggestion. I've added a section on how to configure the default port for metrics exporting. Please let me know if you have further comments.
Co-authored-by: Helber Belmiro <[email protected]> Signed-off-by: Sophie Hsu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great to have!
The Training Operator includes a built-in `/metrics` endpoint exposes Prometheus metrics. This feature is enabled by default and requires no additional configuration for basic use. | ||
|
||
### Configuring Metrics Port | ||
By default, metrics are exposed on port 8080. If you want to change the default port for metrics exporting, simply add the `metrics-bind-address` argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is only available if it's enabled though, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this.
The Training Operator includes a built-in `/metrics` endpoint exposes Prometheus metrics. This feature is enabled by default and requires no additional configuration for basic use. | ||
|
||
### Configuring Metrics Port | ||
By default, metrics are exposed on port 8080. If you want to change the default port for metrics exporting, simply add the `metrics-bind-address` argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we mention that the metrics can be scraped from any IP address by default?
And then we can mention that limiting the IP address by "x.x.x.x:xxxx".
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up from kubeflow/training-operator#2254
Description
This PR adds a new section to the documentation explaining how to monitor Kubeflow training jobs using Prometheus metrics exposed by the Training Operator.
Changes