Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for the old CRD Manager to stop before starting a new one #1778

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ptodev
Copy link
Contributor

@ptodev ptodev commented Sep 27, 2024

PR Description

This PR fixes an issue which was reported on the community slack. A person ran into this error when doing a config reload:

failed to register service discovery metrics: failed to create service discovery refresh metrics
error running crd manager

I could not reproduce the error, but I think it happens due to the new CRD Manager starting up before the old one has had a chance to stop and unregister its metrics. I'm not sure how to test this in a unit test, since we'd need to make the CRD Manager stop more slowly somehow. We'd probably have to refactor the code to be more unit testable, so for now I hope we can fix the bug without unit testing it.

I tested my change locally with a config like this, just to make sure the waitgroup functions ok:

Alloy config
discovery.kubernetes "nodes" {
  kubeconfig_file = "/Users/paulintodev/.kube/config"
  role = "node"
}

discovery.kubernetes "services" {
  kubeconfig_file = "/Users/paulintodev/.kube/config"
  role = "service"
}

// ServiceMonitor resources
prometheus.operator.servicemonitors "servicemonitors" {
  client {
    kubeconfig_file = "/Users/paulintodev/.kube/config"
  }
  clustering {
    enabled = false
  }
  forward_to = []
}

// PodMonitor resources
prometheus.operator.podmonitors "podmonitors" {
  client {
    kubeconfig_file = "/Users/paulintodev/.kube/config"
  }
  clustering {
    enabled = false
  }
  forward_to = []
}

// Probe resources
prometheus.operator.probes "probes" {
  client {
    kubeconfig_file = "/Users/paulintodev/.kube/config"
  }
  clustering {
    enabled = false
  }
  forward_to = []
}

logging {
  level = "info"
  format = "logfmt"
}

I changed the clustering/enabled value, then triggered a config reload via curl localhost:12345/-/reload. Then I shut down Alloy using Ctrl + C. The reload and shutdown went ok.

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated
  • Config converters updated

@ptodev ptodev requested a review from a team as a code owner September 27, 2024 12:56
innerCtx, cancel = context.WithCancel(ctx)
runWg.Add(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to get a test on this? Other than that looks good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit which outlines how a test could potentially work. Would this be ok? I am not sure how else to test it.

@mattdurham mattdurham self-assigned this Sep 27, 2024
@ptodev ptodev force-pushed the ptodev/fix-prom-operator-reload branch from bd42108 to e6001c3 Compare October 16, 2024 17:57
@joke
Copy link

joke commented Oct 30, 2024

The same thing happens to me if I do configuration changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants