chore(flags): add retries to is_postgres_connected_check for flag matching paths #26708

havenbarnes · 2024-12-06T02:06:09Z

Problem

We're still trying to root cause the recent spike in healthcheck_failed flag evaluation errors (conversation here https://posthog.slack.com/archives/C0185UNBSJZ/p1733369640212279), but for the short term this change will require 3 straight SELECT 1 failures before determining PG to be down when during flag matching 🩹

We see in Loki logs that this healthcheck is often failing with django.db.utils.OperationalError: consuming input failed: query_wait_timeout server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request., especially when the decide request is interacting with the PG writer.

We've also identified some over-reporting of healthcheck_failed when write paths fail, but short-circuiting other requests would only use read replicas. This will be addressed in a follow up PR

Changes

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Does this work well for both Cloud and self-hosted?

Yes

How did you test this code?

Hitting decide locally works as expected and with "errorsWhileComputingFlags": false,

…ching paths

neilkakkar · 2024-12-06T07:49:17Z

posthog/database_healthcheck.py

+                logger.exception(
+                    f"failed to connect to postgres {DATABASE_FOR_FLAG_MATCHING} node attempt {i + 1} of 3"
+                )
+                time.sleep(0.3)


flyby that this will make some requests unacceptably slow. It's much better imo to have error computing flags fire (which doesn't mean none of the flags were evaluated, just that the ones relying on the db weren't), than to slow down the response by upto a second occasionally

A better way to achieve the same as retries here would be to increase the query wait timeout setting, but ideally we'd tune the read replica connection settings and instances so these timeouts don't happen often. I bet when this issue happens it shows up on the pgbouncer graph as no. of waiting connections spiking (to pgbouncer, not to the db - that's been the case in the past)

Yep that makes sense. Was thinking too much about customers who are dealing with our worst case

chore(flags): add retries to is_postgres_connected_check for flag mat…

dd2272c

…ching paths

havenbarnes requested a review from dmarticus December 6, 2024 02:06

tweak

a686337

neilkakkar reviewed Dec 6, 2024

View reviewed changes

havenbarnes closed this Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(flags): add retries to is_postgres_connected_check for flag matching paths #26708

chore(flags): add retries to is_postgres_connected_check for flag matching paths #26708

havenbarnes commented Dec 6, 2024 •

edited

Loading

neilkakkar Dec 6, 2024

neilkakkar Dec 6, 2024

havenbarnes Dec 6, 2024

chore(flags): add retries to is_postgres_connected_check for flag matching paths #26708

chore(flags): add retries to is_postgres_connected_check for flag matching paths #26708

Conversation

havenbarnes commented Dec 6, 2024 • edited Loading

Problem

Changes

Does this work well for both Cloud and self-hosted?

How did you test this code?

neilkakkar Dec 6, 2024

Choose a reason for hiding this comment

neilkakkar Dec 6, 2024

Choose a reason for hiding this comment

havenbarnes Dec 6, 2024

Choose a reason for hiding this comment

havenbarnes commented Dec 6, 2024 •

edited

Loading