Random ruler delivery errors to Alertmanager #4724

yafanasiev · 2023-04-13T17:53:55Z

Describe the bug

Ruler randomly errors out on delivering alerts to Alertmanager. Percentage of errors is small but consistent.

To Reproduce

Steps to reproduce the behavior:

Set up some firing alerts.
Observe delivery errors through metrics and ruler logs:

ts=2023-04-13T16:04:46.063665875Z caller=notifier.go:532 level=error user=<redacted> alertmanager=http://10-70-0-182.mimir-alertmanager-headless.mimir.svc.cluster.local.:8080/alertmanager/api/v2/alerts count=1 msg="Error sending alert" err="Post \"http://10-70-0-182.mimir-alertmanager-headless.mimir.svc.cluster.local.:8080/alertmanager/api/v2/alerts\": EOF"                                                                                                                                                                                                                                                                          
ts=2023-04-13T16:04:46.064242233Z caller=notifier.go:532 level=error user=<redacted> alertmanager=http://10-70-87-7.mimir-alertmanager-headless.mimir.svc.cluster.local.:8080/alertmanager/api/v2/alerts count=1 msg="Error sending alert" err="Post \"http://10-70-87-7.mimir-alertmanager-headless.mimir.svc.cluster.local.:8080/alertmanager/api/v2/alerts\": EOF"                                                                                                                                                                                                                                                          
ts=2023-04-13T16:07:13.444813751Z caller=notifier.go:532 level=error user=<redacted> alertmanager=http://10-70-87-7.mimir-alertmanager-headless.mimir.svc.cluster.local.:8080/alertmanager/api/v2/alerts count=1 msg="Error sending alert" err="Post \"http://10-70-87-7.mimir-alertmanager-headless.mimir.svc.cluster.local.:8080/alertmanager/api/v2/alerts\": EOF"                                                                                                                                                                                                                                                                     
ts=2023-04-13T16:07:13.445020393Z caller=notifier.go:532 level=error user=<redacted> alertmanager=http://10-70-0-182.mimir-alertmanager-headless.mimir.svc.cluster.local.:8080/alertmanager/api/v2/alerts count=1 msg="Error sending alert" err="Post \"http://10-70-0-182.mimir-alertmanager-headless.mimir.svc.cluster.local.:8080/alertmanager/api/v2/alerts\": EOF"

Expected behaviour

All alerts are delivered without errors.

Environment

Infrastructure: AWS EKS v1.25 with multi-AZ Graviton nodes
Deployment tool: Helm chart mimir-distributed, version 4.4.0-weekly.232

Additional Context

Both ruler and alertmanager are configured to store their state in the same S3 bucket (different from the metrics one). Snappy compression enabled across all components, persist_interval for alertmanager changed to 30s. Both ruler and alertmanager are scaled to 3 replicas, alertmanager as zone-aware. During the observation period (~3 days) no pod restarts, node changes or similar issues in the same cluster were found.

Please let me know if you need any additional info.

The text was updated successfully, but these errors were encountered:

stevesg · 2023-05-05T21:41:18Z

Hi @yafanasiev - would you be able to provide a snippet of your logs from the alertmanager instances? That would be very helpful.

yafanasiev · 2023-05-05T22:15:20Z

Nothing out of order, just a bunch of these:

ts=2023-05-05T22:07:55.129470387Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:08:10.128898769Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:08:25.128885991Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:08:40.129303338Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:08:55.129632522Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:09:10.129309207Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:09:25.129543041Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:09:40.129747895Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:09:55.129029923Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:10:10.129817347Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:10:25.129035513Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:10:40.129916923Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:10:55.129552258Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:11:10.129897851Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:11:25.129470059Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:11:40.128947251Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:11:55.129487249Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:12:10.129337338Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2023-05-05T22:12:25.128901797Z caller=multitenant.go:523 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"

stevesg · 2023-05-08T18:44:56Z

Interesting. I noticed we're using the headless service in the Helm chart, shouldn't make any difference, but we should be using the standard service to load balance over alertmanager replicas.

I checked out internal environments, though we only had r232 deployed for a week, we do get very infrequent EOF errors, but they are so irregular I would assume they're just pod restarts or node issues.

I will run the Helm chart later this week and see if I can reproduce.

It could be worth looking at sum by (pod, status_code) (cortex_request_duration_seconds_count{route="api_v1_alerts"}, to see if the requests received by alertmanager is matching the number of notifications sent by the ruler, and whether alertmanager is registering the request as a failure (as opposed to the ruler losing connectivity for some reason, which is more typical for EOFs)

yafanasiev · 2023-05-17T07:41:46Z

Tried a bit of testing for the past week. We updated to Mimir 2.8.0, and also scaled and tweaked CoreDNS deployment to make sure this is not a DNS issue - still the same result. Metrics show that all requests to Alertmanagers succeed, so this might be a connectivity issue, although we don't experience any issues with other workloads in the cluster. Maybe it might make sense to switch over to ClusterIP service for Alertmanagers?

ItsMisterP · 2023-10-27T14:54:59Z

Hey @yafanasiev. Did you solved this problem?

yafanasiev · 2023-10-27T14:59:03Z

Hey @ItsMisterP. As of Mimir 2.10.3 (latest) we still observe the same behaviour:

As it does not seem to have actual functional impact on alerting functionality, we deprioritized it from our side. Still, it would've been good to find a culprit.

ItsMisterP · 2023-10-27T15:04:07Z

Yeah facing the same issue...

stevesg · 2024-04-12T13:05:19Z

Coming back to this one very late - apologies.

I have noticed these errors in one of our dev environments at Grafana, which we deploy using Helm, however whilst they aren't aligned to rollouts or restarts, they aren't continuous either.

That said I think it's worth eliminating the difference between jsonnet/Helm, so I've put up a PR here: #7892

You can test this manually by adding the following to your custom values yaml:

mimir:
  structuredConfig:
    ruler:
      alertmanager_url: dnssrvnoa+http://_http-metrics/._tcp.{{ template "mimir.fullname" . }}-alertmanager.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}/alertmanager

stevesg · 2024-05-27T20:10:33Z

I believe this is because the Prometheus code that the Ruler uses hard coded 5 minute idle timeout:
https://github.com/prometheus/common/blob/6b9921f9eba2cd74f2caca0d713bb0a6eb7ef1b9/config/http_config.go#L55

		idleConnTimeout: 5 * time.Minute,

But the default timeout for Alertmanager is 2 minutes (inherited from dskit):
https://github.com/grafana/dskit/blob/ab41af527fd542eced4c6618122748935d03bdcc/server/server.go#L187

	f.DurationVar(&cfg.HTTPServerIdleTimeout, "server.http-idle-timeout", 120*time.Second, "Idle timeout for HTTP server")

The solution is to increase the idle timeout on Alertmanager, e.g.

-server.http-idle-timeout=360s

I'll work on a jsonnet and Helm update to add this option.

stevesg mentioned this issue May 28, 2024

Alertmanager: Set -server.http-idle-timeout. #8192

Merged

stevesg closed this as completed in #8192 Jun 27, 2024

daanschipper mentioned this issue Nov 7, 2024

fix zone aware alertmanager http idle timeout #9851

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random ruler delivery errors to Alertmanager #4724

Random ruler delivery errors to Alertmanager #4724

yafanasiev commented Apr 13, 2023

stevesg commented May 5, 2023

yafanasiev commented May 5, 2023

stevesg commented May 8, 2023

yafanasiev commented May 17, 2023

ItsMisterP commented Oct 27, 2023

yafanasiev commented Oct 27, 2023

ItsMisterP commented Oct 27, 2023

stevesg commented Apr 12, 2024

stevesg commented May 27, 2024

Random ruler delivery errors to Alertmanager #4724

Random ruler delivery errors to Alertmanager #4724

Comments

yafanasiev commented Apr 13, 2023

Describe the bug

To Reproduce

Expected behaviour

Environment

Additional Context

stevesg commented May 5, 2023

yafanasiev commented May 5, 2023

stevesg commented May 8, 2023

yafanasiev commented May 17, 2023

ItsMisterP commented Oct 27, 2023

yafanasiev commented Oct 27, 2023

ItsMisterP commented Oct 27, 2023

stevesg commented Apr 12, 2024

stevesg commented May 27, 2024