Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A metrics collector for Kubeflow Pipeline Metrics artifacts #2019

Open
votti opened this issue Nov 17, 2022 · 13 comments
Open

A metrics collector for Kubeflow Pipeline Metrics artifacts #2019

votti opened this issue Nov 17, 2022 · 13 comments

Comments

@votti
Copy link

votti commented Nov 17, 2022

/kind feature

Describe the solution you'd like
Currently a aim is to do parameter tuning over pipelines in katib (#1914, #1993).

Kubeflow pipelines allow for dedicated metrics artifacts:
https://kubeflow-pipelines.readthedocs.io/en/master/source/dsl.html?h=metrics#kfp.dsl.Metrics
https://www.kubeflow.org/docs/components/pipelines/v1/sdk/pipelines-metrics/

Having a dedicated Katib sidecar metrics collector that collects the metrics from this artifacts, would make pipelines and katib work together quite nicely.

The current workaround is to use the stdout collector, but this causes issues with the complex commands in pipeline components (#1914, will add dedicated issue soon).

Anything else you would like to add:


Love this feature? Give it a 👍 We prioritize the features with the most 👍

@votti
Copy link
Author

votti commented Nov 19, 2022

I think I may give this a go - I would try to build this in Python analogous to the tfevent-metricscollector.
Does this sound like a reasonable approach? I am also happy for any other suggestion.

@votti
Copy link
Author

votti commented Feb 9, 2023

Small update:
I have now a metrics collector for kubeflow v1 pipelines that I think should work and according to the logs already manages to caputre the pipeline metrics artifacts(modeled after tfevent-metricscollector).

What I am failing is to pass the current trial name to the custom connector in the metricsCollectorSpec
Essentially I am using the a very similar configuration as in the custom connector example here:
https://github.com/kubeflow/katib/blob/master/examples/v1beta1/metrics-collector/custom-metrics-collector.yaml#L13-L35

My cli metricscollector takes an argument "-t" or "--trial_name" with the trial name to use for reporting (exactly as the tfevent-metricscollector).
Would maybe someone know a hint how to configure this such that the current trial-name would be passed as arg?

@votti
Copy link
Author

votti commented Feb 10, 2023

I am now really a bit confused:
Reading the source code of the metrics collector sidecar injection inject_webhook, it looks to me as if the trial name should be actually added to the args:

args := []string{"-t", trial.Name, "-m", metricNames, "-o-type", string(trial.Spec.Objective.Type), "-s-db", katibmanagerv1beta1.GetDBManagerAddr()}

Yet looking at the pods Katib creates, all these arguments seem to be missing.
Is there anything I do not see?

My current section to specify the metrics collector:

  metricsCollectorSpec:
    source:
      fileSystemPath:
        path: "/tmp/outputs/mlpipeline_metrics/data"
        kind: File
    collector:
      customCollector:
        image: votti/kfpv1-metricscollector:v0.0.7
        imagePullPolicy: Always
        name: custom-metrics-logger-and-collector
      kind: Custom

Which creates a specification as:

 - image: votti/kfpv1-metricscollector:v0.0.7
   imagePullPolicy: Always
   name: custom-metrics-logger-and-collector
   resources: {}
   terminationMessagePath: /dev/termination-log
   terminationMessagePolicy: File
   volumeMounts:
   - mountPath: /tmp/outputs/mlpipeline_metrics
     name: metrics-volume
   - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
     name: kube-api-access-rnmkw
     readOnly: true

@andreyvelich
Copy link
Member

Thank you for working on this @votti!
Would it be easier to use push-based metrics collector for such use-cases (ref: #577)?
Then we don't even need a sidecar to collect metrics.

cc @johnugeorge @gaocegege @tenzen-y

votti pushed a commit to d-one/katib that referenced this issue Feb 10, 2023
Closesly modelled after the tfevent-metricscollector.
Currently not yet working, as there are issues
that the arguments from the `injector_webhoook`
are somehow not passed.

Addresses: kubeflow#2019
votti added a commit to votti/katib-exploration that referenced this issue Feb 10, 2023
This uses the new custom KFP V1 metrics collector that can directly extract
metrics from Kubeflow Pipeline metrics.

With this collector, to measure metrics of a Kubeflow pipeline
only requires to a) add the label which step is the `model-training`
b) diseable caching for this step
c) configure the katib metrics collector.

Also all the information is added now, such that the Katib pipeline
can be run via the KatibClient.

Addresses:
- kubeflow/katib#1914
- kubeflow/katib#2019

k
@votti
Copy link
Author

votti commented Feb 10, 2023

I now managed to implement a working metrics collector for Kubeflow Pipeline V1 Metrics artifacts:
https://github.com/d-one/katib/tree/feature/kfpv1-metricscollector/cmd/metricscollector/v1beta1/kfpv1-metricscollector

For a full example how this is used see: https://github.com/votti/katib-exploration/blob/main/notebooks/mnist_pipeline_v1.ipynb

@Push: I think it is an interesting idea to build a dedicated KubeflowPipeline component that can push metrics to Katib.
Challenges I see here is how to pass the current trial_name. Otherwise the component could be built quite similar to the kfpv1-metricscollector.

votti added a commit to d-one/katib that referenced this issue Feb 15, 2023
This example illustrates how a full kfp pipeline can
be tuned using Katib.

It is based on a metrics collector to collect kubeflow
pipeline metrics (kubeflow#2019). This is used as a Custom Collector.

Addresses: kubeflow#1914, kubeflow#2019
votti pushed a commit to d-one/katib that referenced this issue Jul 18, 2023
Closesly modelled after the tfevent-metricscollector.
Currently not yet working, as there are issues
that the arguments from the `injector_webhoook`
are somehow not passed.

Addresses: kubeflow#2019
votti added a commit to d-one/katib that referenced this issue Jul 18, 2023
This example illustrates how a full kfp pipeline can
be tuned using Katib.

It is based on a metrics collector to collect kubeflow
pipeline metrics (kubeflow#2019). This is used as a Custom Collector.

Addresses: kubeflow#1914, kubeflow#2019
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@AlexandreBrown
Copy link

Hello, any update for KFP v2?
Cheers!

@andreyvelich
Copy link
Member

@AlexandreBrown We've worked on Katib + KFP example in this PR: #2118
Any help and review for this PR are appreciated!

@AlexandreBrown
Copy link

@AlexandreBrown We've worked on Katib + KFP example in this PR: #2118
Any help and review for this PR are appreciated!

Great to see progress, was this PR made for kfp v2 or only v1?

@tenzen-y
Copy link
Member

Great to see progress, was this PR made for kfp v2 or only v1?

That PR is only for v1.

@votti
Copy link
Author

votti commented Aug 29, 2023

@AlexandreBrown
This is based on V1 as I only managed to compile the pipeline in KFP V1 as an Argo Workflow manifest.
If there is a way to export KFP V2 as Argo workflow it should be straightforward to use V2 as well.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@tenzen-y
Copy link
Member

/lifecycle frozen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants