Wait for the old CRD Manager to stop before starting a new one #1778

ptodev · 2024-09-27T12:56:21Z

PR Description

This PR fixes an issue which was reported on the community slack. A person ran into this error when doing a config reload:

failed to register service discovery metrics: failed to create service discovery refresh metrics
error running crd manager

I could not reproduce the error, but I think it happens due to the new CRD Manager starting up before the old one has had a chance to stop and unregister its metrics. I'm not sure how to test this in a unit test, since we'd need to make the CRD Manager stop more slowly somehow. We'd probably have to refactor the code to be more unit testable, so for now I hope we can fix the bug without unit testing it.

I tested my change locally with a config like this, just to make sure the waitgroup functions ok:

Alloy config

discovery.kubernetes "nodes" {
  kubeconfig_file = "/Users/paulintodev/.kube/config"
  role = "node"
}

discovery.kubernetes "services" {
  kubeconfig_file = "/Users/paulintodev/.kube/config"
  role = "service"
}

// ServiceMonitor resources
prometheus.operator.servicemonitors "servicemonitors" {
  client {
    kubeconfig_file = "/Users/paulintodev/.kube/config"
  }
  clustering {
    enabled = false
  }
  forward_to = []
}

// PodMonitor resources
prometheus.operator.podmonitors "podmonitors" {
  client {
    kubeconfig_file = "/Users/paulintodev/.kube/config"
  }
  clustering {
    enabled = false
  }
  forward_to = []
}

// Probe resources
prometheus.operator.probes "probes" {
  client {
    kubeconfig_file = "/Users/paulintodev/.kube/config"
  }
  clustering {
    enabled = false
  }
  forward_to = []
}

logging {
  level = "info"
  format = "logfmt"
}

I changed the clustering/enabled value, then triggered a config reload via curl localhost:12345/-/reload. Then I shut down Alloy using Ctrl + C. The reload and shutdown went ok.

PR Checklist

CHANGELOG.md updated
Documentation added
Tests updated
Config converters updated

mattdurham · 2024-09-27T14:25:23Z

internal/component/prometheus/operator/common/component.go

 			innerCtx, cancel = context.WithCancel(ctx)
+			runWg.Add(1)


Is it possible to get a test on this? Other than that looks good.

I pushed a commit which outlines how a test could potentially work. Would this be ok? I am not sure how else to test it.

joke · 2024-10-30T08:25:27Z

The same thing happens to me if I do configuration changes.

ptodev requested a review from a team as a code owner September 27, 2024 12:56

mattdurham reviewed Sep 27, 2024

View reviewed changes

mattdurham self-assigned this Sep 27, 2024

ptodev added 2 commits October 16, 2024 18:57

Wait for the old CRD Manager to stop before starting a new one

a78c1fd

Write a test

e6001c3

ptodev force-pushed the ptodev/fix-prom-operator-reload branch from bd42108 to e6001c3 Compare October 16, 2024 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for the old CRD Manager to stop before starting a new one #1778

Wait for the old CRD Manager to stop before starting a new one #1778

ptodev commented Sep 27, 2024 •

edited

Loading

mattdurham Sep 27, 2024

ptodev Oct 2, 2024

joke commented Oct 30, 2024

Wait for the old CRD Manager to stop before starting a new one #1778

Are you sure you want to change the base?

Wait for the old CRD Manager to stop before starting a new one #1778

Conversation

ptodev commented Sep 27, 2024 • edited Loading

PR Description

PR Checklist

mattdurham Sep 27, 2024

Choose a reason for hiding this comment

ptodev Oct 2, 2024

Choose a reason for hiding this comment

joke commented Oct 30, 2024

ptodev commented Sep 27, 2024 •

edited

Loading