Add documentation for Prometheus metrics in Training Operator #3894

sophie0730 · 2024-10-05T17:45:20Z

Follow up from kubeflow/training-operator#2254

Description

This PR adds a new section to the documentation explaining how to monitor Kubeflow training jobs using Prometheus metrics exposed by the Training Operator.

Changes

Rewrote and added a new page titled "Prometheus Monitoring" under the Training Operator documentation.
Explained how to access the Prometheus metrics for the Training Operator.
Provided a detailed list of relevant metrics, including description and label information.

Signed-off-by: Sophie Hsu <[email protected]>

google-oss-prow · 2024-10-05T17:45:31Z

Hi @sophie0730. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow · 2024-10-05T17:45:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

content/en/docs/components/training/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Arhell

/ok-to-test

andreyvelich

Thank you for this @sophie0730!
@kubeflow/wg-training-leads @StefanoFioravanzo @hbelmiro Please take a look.

andreyvelich · 2024-10-07T12:38:01Z

content/en/docs/components/training/prometheus.md

+The Training Operator includes a built-in `/metrics` endpoint exposes Prometheus metrics. This feature is enabled by default and requires no additional configuration for basic use.
+
+#### Accessing the Metrics
+By default, the metrics are exposed on port 8080. The method to access these metrics may vary depending on your Kubernetes setup and environment.


Please explain how to change the default port for metrics exporting: https://github.com/kubeflow/training-operator/blob/master/cmd/training-operator.v1/main.go#L83

I've already added the information on changing the default port for metrics exporting. Thanks.

andreyvelich · 2024-10-07T12:40:05Z

content/en/docs/components/training/prometheus.md

@@ -0,0 +1,48 @@
+++
+title = "Prometheus Monitoring"


Please can you add this guide under user guides ?

Sure. Have moved this guide under user guides section.

content/en/docs/components/training/prometheus.md

andreyvelich · 2024-10-07T12:42:36Z

content/en/docs/components/training/prometheus.md

+
+Important: Metrics are only generated in response to specific events. For example, job creation metrics will only appear after a job has been created. If a metric is not visible, it may be because the corresponding event has not occurred yet.
+
+These metrics help you understand how your training jobs are doing. You can use this information to fix problems and make your jobs run better.


I think, you can move this statement to the beginning of this doc.

Thank you for the suggestion! After reviewing, I think this sentence(line 46) seems to duplicate information already mentioned at the beginning of the document. To avoid redundancy, I’ve decided to remove it. Thanks!

andreyvelich · 2024-10-07T12:43:18Z

content/en/docs/components/training/prometheus.md

+| `framework` | The machine learning framework used(e.g. TensorFlow,PyTorch) | 
+
+
+Important: Metrics are only generated in response to specific events. For example, job creation metrics will only appear after a job has been created. If a metric is not visible, it may be because the corresponding event has not occurred yet.


If you want you can use the Note template, like here: https://www.kubeflow.org/docs/about/membership/#:~:text=Members-,Note,-Detailed%20documentation%20for

I've incorporated the Note template, thanks for the suggestion!

Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Sophie Hsu <[email protected]>

…nitoring

1. Add configuring metrics port section 2. Remove duplicate sentence 3. Use Note template for the consistent style 4. Move the doc under the user-guides directory Signed-off-by: Sophie Hsu <[email protected]>

sophie0730 · 2024-10-07T16:33:30Z

content/en/docs/components/training/user-guides/prometheus.md

+### Configuring Metrics Port
+By default, metrics are exposed on port 8080. If you want to change the default port for metrics exporting, simply add the `metrics-bind-address` argument.
+
+For example, to change the metrics port to 8082:
+```yaml
+# deployment.yaml for the Training Operator
+spec:
+ containers:
+ - command:
+ - /manager
+ image: kubeflow/training-operator
+ name: training-operator
+ ports:
+ - containerPort: 8080
+ - containerPort: 9443
+ name: webhook-server
+ protocol: TCP
+ args:
+ - "--metrics-bind-address=:8082" # Metrics port changed to 8082
+```


@andreyvelich Thank you for the suggestion. I've added a section on how to configure the default port for metrics exporting. Please let me know if you have further comments.

content/en/docs/components/training/user-guides/prometheus.md

Co-authored-by: Helber Belmiro <[email protected]> Signed-off-by: Sophie Hsu <[email protected]>

hbelmiro

/lgtm

terrytangyuan

This is great to have!

terrytangyuan · 2024-10-10T01:04:31Z

content/en/docs/components/training/user-guides/prometheus.md

+The Training Operator includes a built-in `/metrics` endpoint exposes Prometheus metrics. This feature is enabled by default and requires no additional configuration for basic use.
+
+### Configuring Metrics Port
+By default, metrics are exposed on port 8080. If you want to change the default port for metrics exporting, simply add the `metrics-bind-address` argument.


The default is only available if it's enabled though, right?

tenzen-y

Thanks for doing this.

tenzen-y · 2024-10-10T19:41:32Z

content/en/docs/components/training/user-guides/prometheus.md

+The Training Operator includes a built-in `/metrics` endpoint exposes Prometheus metrics. This feature is enabled by default and requires no additional configuration for basic use.
+
+### Configuring Metrics Port
+By default, metrics are exposed on port 8080. If you want to change the default port for metrics exporting, simply add the `metrics-bind-address` argument.


Could we mention that the metrics can be scraped from any IP address by default?
And then we can mention that limiting the IP address by "x.x.x.x:xxxx".

tenzen-y · 2024-10-10T19:41:46Z

content/en/docs/components/training/user-guides/prometheus.md

+
+

Suggested change

Add Prometheus metrics guild for Training Operator

5de277d

Signed-off-by: Sophie Hsu <[email protected]>

google-oss-prow bot added the needs-ok-to-test label Oct 5, 2024

google-oss-prow bot added the size/M label Oct 5, 2024

google-oss-prow bot requested review from andreyvelich and gaocegege October 5, 2024 17:45

Arhell reviewed Oct 6, 2024

View reviewed changes

google-oss-prow bot added ok-to-test and removed needs-ok-to-test labels Oct 6, 2024

andreyvelich reviewed Oct 7, 2024

View reviewed changes

sophie0730 and others added 3 commits October 7, 2024 22:30

Correct formating in Label description

f4f9ff7

Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Sophie Hsu <[email protected]>

Merge remote-tracking branch 'upstream/master' into doc/prometheus-mo…

05d606a

…nitoring

Incorporate feedback:

0457165

1. Add configuring metrics port section 2. Remove duplicate sentence 3. Use Note template for the consistent style 4. Move the doc under the user-guides directory Signed-off-by: Sophie Hsu <[email protected]>

sophie0730 commented Oct 7, 2024

View reviewed changes

hbelmiro reviewed Oct 7, 2024

View reviewed changes

content/en/docs/components/training/user-guides/prometheus.md Outdated Show resolved Hide resolved

Clarify labels information interpretation

0b4c662

Co-authored-by: Helber Belmiro <[email protected]> Signed-off-by: Sophie Hsu <[email protected]>

hbelmiro reviewed Oct 8, 2024

View reviewed changes

google-oss-prow bot assigned hbelmiro Oct 8, 2024

google-oss-prow bot added the lgtm label Oct 8, 2024

terrytangyuan reviewed Oct 10, 2024

View reviewed changes

tenzen-y reviewed Oct 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add documentation for Prometheus metrics in Training Operator #3894

Add documentation for Prometheus metrics in Training Operator #3894

sophie0730 commented Oct 5, 2024

google-oss-prow bot commented Oct 5, 2024

google-oss-prow bot commented Oct 5, 2024

Arhell left a comment

andreyvelich left a comment

andreyvelich Oct 7, 2024

sophie0730 Oct 7, 2024

andreyvelich Oct 7, 2024

sophie0730 Oct 7, 2024

andreyvelich Oct 7, 2024

sophie0730 Oct 7, 2024

andreyvelich Oct 7, 2024

sophie0730 Oct 7, 2024

sophie0730 Oct 7, 2024

hbelmiro left a comment

terrytangyuan left a comment

terrytangyuan Oct 10, 2024

tenzen-y left a comment

tenzen-y Oct 10, 2024

tenzen-y Oct 10, 2024


		Important: Metrics are only generated in response to specific events. For example, job creation metrics will only appear after a job has been created. If a metric is not visible, it may be because the corresponding event has not occurred yet.

		These metrics help you understand how your training jobs are doing. You can use this information to fix problems and make your jobs run better.

		\| `framework` \| The machine learning framework used(e.g. TensorFlow,PyTorch) \|


		Important: Metrics are only generated in response to specific events. For example, job creation metrics will only appear after a job has been created. If a metric is not visible, it may be because the corresponding event has not occurred yet.

Add documentation for Prometheus metrics in Training Operator #3894

Are you sure you want to change the base?

Add documentation for Prometheus metrics in Training Operator #3894

Conversation

sophie0730 commented Oct 5, 2024

Description

Changes

google-oss-prow bot commented Oct 5, 2024

google-oss-prow bot commented Oct 5, 2024

Arhell left a comment

Choose a reason for hiding this comment

andreyvelich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hbelmiro left a comment

Choose a reason for hiding this comment

terrytangyuan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment