Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: Add metrics label filter configuration #1444

Conversation

nap32
Copy link
Contributor

@nap32 nap32 commented Sep 10, 2023

Currently, metrics are all-or-nothing.
Certain labels may cause cardinality issues.

This patch introduces a new configuration option - MetricsLabelFilter. It is an allow-list for configuring namespace, workload, pod, and binary. Labels that utilize these fields will only add them if configured for it.

Fixes: #1037

@nap32 nap32 requested a review from a team as a code owner September 10, 2023 01:39
@nap32 nap32 requested a review from olsajiri September 10, 2023 01:39
@netlify
Copy link

netlify bot commented Sep 10, 2023

Deploy Preview for tetragon ready!

Name Link
🔨 Latest commit 892b25d
🔍 Latest deploy log https://app.netlify.com/sites/tetragon/deploys/65119725df4c010008fbfc35
😎 Deploy Preview https://deploy-preview-1444--tetragon.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@lambdanis lambdanis self-requested a review September 11, 2023 11:42
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
@lambdanis lambdanis added release-note/minor This PR introduces a minor user-visible change area/metrics Related to prometheus metrics labels Sep 11, 2023
@nap32 nap32 force-pushed the pr/nap32/1037-add-metric-configuration-options branch from ad14d78 to 96a3a35 Compare September 11, 2023 15:58
@jrfastab
Copy link
Contributor

This looks reasonable to me. @lambdanis any opinions?

Copy link
Contributor

@lambdanis lambdanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this @nap32! The overall approach looks good to me. I left a few comments - mostly considering maintainability, I think we could make this feature a bit easier to use (when adding metrics) and extend.

Without analysing all possibilities in detail it's unclear to me what happens when a user passes a label that's not in KnownMetricLabelFilters or passes labels in different order. Ideally, we would have some validation for such cases to prevent unexpected behaviour. But if it's a big effort, then having at least test cases covering them would be good, so that the behaviour is somehow documented.

Regarding tests, it looks like the global config change of MetricsLabelFilter leaks into other tests breaking them?

pkg/option/config.go Outdated Show resolved Hide resolved
install/kubernetes/values.yaml Show resolved Hide resolved
pkg/metrics/eventmetrics/eventmetrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Show resolved Hide resolved
@nap32
Copy link
Contributor Author

nap32 commented Sep 12, 2023

Thanks for tackling this @nap32! The overall approach looks good to me. I left a few comments - mostly considering maintainability, I think we could make this feature a bit easier to use (when adding metrics) and extend.

Awesome! I had responses to two of the comments you left for feedback, happy to revise.
I just want to make sure we're aligned and aware of the trade-off for developer usage.

Without analysing all possibilities in detail it's unclear to me what happens when a user passes a label that's not in KnownMetricLabelFilters or passes labels in different order. Ideally, we would have some validation for such cases to prevent unexpected behaviour. But if it's a big effort, then having at least test cases covering them would be good, so that the behaviour is somehow documented.

FilterMetricLabels expects that the last strings passed to it are the ordered labels/values in KnownMetricFilterLabels.
If you don't pass the ordered labels in KnownMetricLabelFilters last, the wrong labels/values might be left in/out.
The behavior would be similar to passing values in the incorrect order for .WithLabelValues(...) (but it won't raise an error).

I don't think it can be checked for ordering as it is implemented.
The function is used for both defining and incrementing metrics.

If you'd prefer to make error handling more explicit,
I can split the function into two and add some state to track the mapping?
I'll try to add to the comments to make expected usage more clear.

Regarding tests, it looks like the global config change of MetricsLabelFilter leaks into other tests breaking them?

I've corrected this and the other issues you've raised, thanks for pointing those out!

@nap32 nap32 force-pushed the pr/nap32/1037-add-metric-configuration-options branch from 96a3a35 to 6c6126c Compare September 12, 2023 18:12
pkg/option/config.go Outdated Show resolved Hide resolved
@nap32 nap32 force-pushed the pr/nap32/1037-add-metric-configuration-options branch 2 times, most recently from 129a569 to 3a37893 Compare September 12, 2023 18:40
@lambdanis
Copy link
Contributor

Thank you for the updates and responses @nap32. I think we should reuse consts.KnownMetricLabelFilters in metrics definitions and then we'll be good to go.

Regarding edge cases - one thing I was considering is a user passing in Helm values unexpected labels, for example ["type", "namespace", "ip", "pod"]. IIUC, this is a valid config, and will result in "namespace" and "binary" being included in metrics, and "workload" and "pod" excluded. The additional "type" and "ip" labels passed won't change anything, right?

@nap32
Copy link
Contributor Author

nap32 commented Sep 13, 2023

Thank you for the updates and responses @nap32. I think we should reuse consts.KnownMetricLabelFilters in metrics definitions and then we'll be good to go.

Got it! I've introduced the change.

Regarding edge cases - one thing I was considering is a user passing in Helm values unexpected labels, for example ["type", "namespace", "ip", "pod"]. IIUC, this is a valid config, and will result in "namespace" and "binary" being included in metrics, and "workload" and "pod" excluded. The additional "type" and "ip" labels passed won't change anything, right?

I think there is a typo in the example above, but the expectation is correct.
If [ "namespace", "ip", "type", "binary" ] is passed as a configuration:

  • "namespace" and "binary" would be included in the metrics.
  • "ip" and "type" would be ignored, since they aren't in the consts.KnownMetricFilterLabels.
  • "workload" and "pod" would not be included in the metrics.

@nap32 nap32 force-pushed the pr/nap32/1037-add-metric-configuration-options branch from 3a37893 to fae0286 Compare September 13, 2023 20:49
@lambdanis
Copy link
Contributor

I think there is a typo in the example above, but the expectation is correct.

Yes, thanks!

@nap32
Copy link
Contributor Author

nap32 commented Sep 18, 2023

Hi @lambdanis -

I ran the suggested command from the lint helm chart job (cd install/kubernetes && ./test.sh) and the tests passed.
Is there something else I should be doing with this to validate the test locally?

Do you have any suggestions on how I might troubleshoot the e2e tests?
Is there any documentation on running these locally w/ my own kind cluster, or can I copy-paste the command?
I'm guessing the pod crashes and then the port forwarding fails because its not there anymore:

Copy link
Contributor

@lambdanis lambdanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nap32 I usually run Tetragon in a local kind cluster (see docs). Then using kubectl you can check why the pod is crashing, read the logs, etc.

I've run your PR locally, there are two main issues that need fixing: reading the config value and defining/registering metrics. I suggested how to approach them in comments. Let me know if something is unclear.

Thanks a lot for working on that! And sorry for the delayed review.

cmd/tetragon/flags.go Outdated Show resolved Hide resolved
pkg/metrics/consts/consts.go Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/eventmetrics/eventmetrics.go Outdated Show resolved Hide resolved
@lambdanis lambdanis self-assigned this Sep 21, 2023
@nap32
Copy link
Contributor Author

nap32 commented Sep 24, 2023

Working on testing the metric changes end-to-end --
I found standing up a local Prometheus server to test reasonable.

  1. Run the local tetragon build in kind as described here: https://tetragon.cilium.io/docs/contribution-guide/development-setup/#running-tetragon-in-kind
  2. Grab the ClusterIP for the metrics endpoint: kubectl get services -A
  3. Set the ClusterIP and metrics port in the prometheus-config.yaml provided below.
  4. You can't rely on a NodePort because kind's cluster is running in a container. Instead, kubectl port-forward service/prometheus 9090:9090.
  5. Open the Prometheus UI in your host's browser to query metrics and review labels: http://localhost:9090/
  6. Deploy the demo app: kubectl create -f https://raw.githubusercontent.com/cilium/cilium/v1.11/examples/minikube/http-sw-app.yaml

prometheus-service.yaml

apiVersion: v1
kind: Service
metadata:
  name: prometheus
spec:
  type: NodePort
  ports:
    - port: 9090
      targetPort: 9090
  selector:
    app: prometheus

prometheus-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus:latest
          args:
            - "--config.file=/etc/prometheus/prometheus.yml"
          ports:
            - containerPort: 8080
          volumeMounts:
            - name: prometheus-config-vol
              mountPath: /etc/prometheus/
      volumes:
        - name: prometheus-config-vol
          configMap:
            name: prometheus-config

prometheus-config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets: ['<TETRAGON CLUSTER IP>:2112']

@nap32 nap32 force-pushed the pr/nap32/1037-add-metric-configuration-options branch 2 times, most recently from f2bcebb to 164a14a Compare September 24, 2023 21:10
@nap32
Copy link
Contributor Author

nap32 commented Sep 24, 2023

I'm not seeing the labels attached to metrics like I'd expect to,
I still need to work on this PR.

@nap32 nap32 force-pushed the pr/nap32/1037-add-metric-configuration-options branch from 164a14a to 861bb2b Compare September 25, 2023 00:34
@nap32 nap32 force-pushed the pr/nap32/1037-add-metric-configuration-options branch 2 times, most recently from 3afbd0f to 7bf5a69 Compare September 25, 2023 00:39
@lambdanis
Copy link
Contributor

@nap32 I see there are conflicts with the main branch, can you rebase your PR?

It looks good to me now. I tested your changes locally and I saw the labels I expected.

For a simple local test you can also access the metrics endpoint directly, without running Prometheus. Port-forward the service:

kubectl -n kube-system port-forward svc/tetragon 2112:2112

and then you can read the metrics in the text format at localhost:2112/metrics.

@nap32 nap32 force-pushed the pr/nap32/1037-add-metric-configuration-options branch from 7bf5a69 to 892b25d Compare September 25, 2023 14:20
@nap32
Copy link
Contributor Author

nap32 commented Sep 25, 2023

@nap32 I see there are conflicts with the main branch, can you rebase your PR?

Yes!

It looks good to me now. I tested your changes locally and I saw the labels I expected.

For a simple local test you can also access the metrics endpoint directly, without running Prometheus. Port-forward the service:

kubectl -n kube-system port-forward svc/tetragon 2112:2112

and then you can read the metrics in the text format at localhost:2112/metrics.

Wow, this is a simple and straightforward way to test -- thanks.
I was over-complicating it. 👍

Copy link
Contributor

@lambdanis lambdanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a single lint, but other than that looks good!

pkg/metrics/metrics.go Outdated Show resolved Hide resolved
@nap32 nap32 force-pushed the pr/nap32/1037-add-metric-configuration-options branch from 892b25d to b609b5e Compare September 25, 2023 18:33
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
Currently, metrics are all-or-nothing.
Certain labels may cause cardinality issues.

This patch introduces a new configuration option - MetricsLabelFilter.
It is an allow-list for configuring namespace, workload, pod, and binary.
Labels that utilize these fields will only add them if configured for it.

Fixes: cilium#1037

Signed-off-by: Nick Peluso <[email protected]>
@nap32 nap32 force-pushed the pr/nap32/1037-add-metric-configuration-options branch from b609b5e to f140e83 Compare September 27, 2023 15:27
Copy link
Contributor

@lambdanis lambdanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing this! 🙇‍♀️

@michi-covalent michi-covalent merged commit 913b64a into cilium:main Sep 29, 2023
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Related to prometheus metrics release-note/minor This PR introduces a minor user-visible change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make prometheus metric pod label optional
4 participants