Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random ruler delivery errors to Alertmanager #4724

Closed
yafanasiev opened this issue Apr 13, 2023 · 9 comments · Fixed by #8192 · May be fixed by #9851
Closed

Random ruler delivery errors to Alertmanager #4724

yafanasiev opened this issue Apr 13, 2023 · 9 comments · Fixed by #8192 · May be fixed by #9851

Comments

@yafanasiev
Copy link

Describe the bug

Ruler randomly errors out on delivering alerts to Alertmanager. Percentage of errors is small but consistent.

To Reproduce

Steps to reproduce the behavior:

  1. Set up some firing alerts.
  2. Observe delivery errors through metrics and ruler logs:
ts=2023-04-13T16:04:46.063665875Z caller=notifier.go:532 level=error user=<redacted> alertmanager=http://10-70-0-182.mimir-alertmanager-headless.mimir.svc.cluster.local.:8080/alertmanager/api/v2/alerts count=1 msg="Error sending alert" err="Post \"http://10-70-0-182.mimir-alertmanager-headless.mimir.svc.cluster.local.:8080/alertmanager/api/v2/alerts\": EOF"                                                                                                                                                                                                                                                                          
ts=2023-04-13T16:04:46.064242233Z caller=notifier.go:532 level=error user=<redacted> alertmanager=http://10-70-87-7.mimir-alertmanager-headless.mimir.svc.cluster.local.:8080/alertmanager/api/v2/alerts count=1 msg="Error sending alert" err="Post \"http://10-70-87-7.mimir-alertmanager-headless.mimir.svc.cluster.local.:8080/alertmanager/api/v2/alerts\": EOF"                                                                                                                                                                                                                                                          
ts=2023-04-13T16:07:13.444813751Z caller=notifier.go:532 level=error user=<redacted> alertmanager=http://10-70-87-7.mimir-alertmanager-headless.mimir.svc.cluster.local.:8080/alertmanager/api/v2/alerts count=1 msg="Error sending alert" err="Post \"http://10-70-87-7.mimir-alertmanager-headless.mimir.svc.cluster.local.:8080/alertmanager/api/v2/alerts\": EOF"                                                                                                                                                                                                                                                                     
ts=2023-04-13T16:07:13.445020393Z caller=notifier.go:532 level=error user=<redacted> alertmanager=http://10-70-0-182.mimir-alertmanager-headless.mimir.svc.cluster.local.:8080/alertmanager/api/v2/alerts count=1 msg="Error sending alert" err="Post \"http://10-70-0-182.mimir-alertmanager-headless.mimir.svc.cluster.local.:8080/alertmanager/api/v2/alerts\": EOF"

Expected behaviour

All alerts are delivered without errors.

Environment

  • Infrastructure: AWS EKS v1.25 with multi-AZ Graviton nodes
  • Deployment tool: Helm chart mimir-distributed, version 4.4.0-weekly.232

Additional Context

Both ruler and alertmanager are configured to store their state in the same S3 bucket (different from the metrics one). Snappy compression enabled across all components, persist_interval for alertmanager changed to 30s. Both ruler and alertmanager are scaled to 3 replicas, alertmanager as zone-aware. During the observation period (~3 days) no pod restarts, node changes or similar issues in the same cluster were found.

Please let me know if you need any additional info.

Screenshot 2023-04-13 at 20 52 22

@stevesg
Copy link
Contributor

stevesg commented May 5, 2023

Hi @yafanasiev - would you be able to provide a snippet of your logs from the alertmanager instances? That would be very helpful.

@yafanasiev
Copy link
Author

Nothing out of order, just a bunch of these:

ts=2023-05-05T22:07:55.129470387Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:08:10.128898769Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:08:25.128885991Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:08:40.129303338Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:08:55.129632522Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:09:10.129309207Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:09:25.129543041Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:09:40.129747895Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:09:55.129029923Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:10:10.129817347Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:10:25.129035513Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:10:40.129916923Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:10:55.129552258Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:11:10.129897851Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:11:25.129470059Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:11:40.128947251Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:11:55.129487249Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:12:10.129337338Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:12:25.128901797Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"

@stevesg
Copy link
Contributor

stevesg commented May 8, 2023

Interesting. I noticed we're using the headless service in the Helm chart, shouldn't make any difference, but we should be using the standard service to load balance over alertmanager replicas.

I checked out internal environments, though we only had r232 deployed for a week, we do get very infrequent EOF errors, but they are so irregular I would assume they're just pod restarts or node issues.

I will run the Helm chart later this week and see if I can reproduce.

It could be worth looking at sum by (pod, status_code) (cortex_request_duration_seconds_count{route="api_v1_alerts"}, to see if the requests received by alertmanager is matching the number of notifications sent by the ruler, and whether alertmanager is registering the request as a failure (as opposed to the ruler losing connectivity for some reason, which is more typical for EOFs)

@yafanasiev
Copy link
Author

Tried a bit of testing for the past week. We updated to Mimir 2.8.0, and also scaled and tweaked CoreDNS deployment to make sure this is not a DNS issue - still the same result. Metrics show that all requests to Alertmanagers succeed, so this might be a connectivity issue, although we don't experience any issues with other workloads in the cluster. Maybe it might make sense to switch over to ClusterIP service for Alertmanagers?

@ItsMisterP
Copy link

Hey @yafanasiev. Did you solved this problem?

@yafanasiev
Copy link
Author

Hey @ItsMisterP. As of Mimir 2.10.3 (latest) we still observe the same behaviour:
Screenshot 2023-10-27 at 16 57 32

As it does not seem to have actual functional impact on alerting functionality, we deprioritized it from our side. Still, it would've been good to find a culprit.

@ItsMisterP
Copy link

Yeah facing the same issue...

@stevesg
Copy link
Contributor

stevesg commented Apr 12, 2024

Coming back to this one very late - apologies.

I have noticed these errors in one of our dev environments at Grafana, which we deploy using Helm, however whilst they aren't aligned to rollouts or restarts, they aren't continuous either.

That said I think it's worth eliminating the difference between jsonnet/Helm, so I've put up a PR here: #7892

You can test this manually by adding the following to your custom values yaml:

mimir:
  structuredConfig:
    ruler:
      alertmanager_url: dnssrvnoa+http://_http-metrics/._tcp.{{ template "mimir.fullname" . }}-alertmanager.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}/alertmanager

@stevesg
Copy link
Contributor

stevesg commented May 27, 2024

I believe this is because the Prometheus code that the Ruler uses hard coded 5 minute idle timeout:
https://github.com/prometheus/common/blob/6b9921f9eba2cd74f2caca0d713bb0a6eb7ef1b9/config/http_config.go#L55

		idleConnTimeout: 5 * time.Minute,

But the default timeout for Alertmanager is 2 minutes (inherited from dskit):
https://github.com/grafana/dskit/blob/ab41af527fd542eced4c6618122748935d03bdcc/server/server.go#L187

	f.DurationVar(&cfg.HTTPServerIdleTimeout, "server.http-idle-timeout", 120*time.Second, "Idle timeout for HTTP server")

The solution is to increase the idle timeout on Alertmanager, e.g.

-server.http-idle-timeout=360s

I'll work on a jsonnet and Helm update to add this option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants