-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random ruler delivery errors to Alertmanager #4724
Comments
Hi @yafanasiev - would you be able to provide a snippet of your logs from the alertmanager instances? That would be very helpful. |
Nothing out of order, just a bunch of these:
|
Interesting. I noticed we're using the headless service in the Helm chart, shouldn't make any difference, but we should be using the standard service to load balance over alertmanager replicas. I checked out internal environments, though we only had r232 deployed for a week, we do get very infrequent EOF errors, but they are so irregular I would assume they're just pod restarts or node issues. I will run the Helm chart later this week and see if I can reproduce. It could be worth looking at |
Tried a bit of testing for the past week. We updated to Mimir 2.8.0, and also scaled and tweaked CoreDNS deployment to make sure this is not a DNS issue - still the same result. Metrics show that all requests to Alertmanagers succeed, so this might be a connectivity issue, although we don't experience any issues with other workloads in the cluster. Maybe it might make sense to switch over to ClusterIP service for Alertmanagers? |
Hey @yafanasiev. Did you solved this problem? |
Hey @ItsMisterP. As of Mimir 2.10.3 (latest) we still observe the same behaviour: As it does not seem to have actual functional impact on alerting functionality, we deprioritized it from our side. Still, it would've been good to find a culprit. |
Yeah facing the same issue... |
Coming back to this one very late - apologies. I have noticed these errors in one of our dev environments at Grafana, which we deploy using Helm, however whilst they aren't aligned to rollouts or restarts, they aren't continuous either. That said I think it's worth eliminating the difference between jsonnet/Helm, so I've put up a PR here: #7892 You can test this manually by adding the following to your custom values yaml:
|
I believe this is because the Prometheus code that the Ruler uses hard coded 5 minute idle timeout:
But the default timeout for Alertmanager is 2 minutes (inherited from dskit):
The solution is to increase the idle timeout on Alertmanager, e.g.
I'll work on a jsonnet and Helm update to add this option. |
Describe the bug
Ruler randomly errors out on delivering alerts to Alertmanager. Percentage of errors is small but consistent.
To Reproduce
Steps to reproduce the behavior:
Expected behaviour
All alerts are delivered without errors.
Environment
Additional Context
Both ruler and alertmanager are configured to store their state in the same S3 bucket (different from the metrics one). Snappy compression enabled across all components,
persist_interval
for alertmanager changed to 30s. Both ruler and alertmanager are scaled to 3 replicas, alertmanager as zone-aware. During the observation period (~3 days) no pod restarts, node changes or similar issues in the same cluster were found.Please let me know if you need any additional info.
The text was updated successfully, but these errors were encountered: