Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(flags): add retries to is_postgres_connected_check for flag matching paths #26708

Closed
wants to merge 2 commits into from

Conversation

havenbarnes
Copy link
Contributor

@havenbarnes havenbarnes commented Dec 6, 2024

Problem

We're still trying to root cause the recent spike in healthcheck_failed flag evaluation errors (conversation here https://posthog.slack.com/archives/C0185UNBSJZ/p1733369640212279), but for the short term this change will require 3 straight SELECT 1 failures before determining PG to be down when during flag matching 🩹

We see in Loki logs that this healthcheck is often failing with django.db.utils.OperationalError: consuming input failed: query_wait_timeout server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request., especially when the decide request is interacting with the PG writer.

We've also identified some over-reporting of healthcheck_failed when write paths fail, but short-circuiting other requests would only use read replicas. This will be addressed in a follow up PR

Changes

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Does this work well for both Cloud and self-hosted?

Yes

How did you test this code?

Hitting decide locally works as expected and with "errorsWhileComputingFlags": false,

@havenbarnes havenbarnes requested a review from dmarticus December 6, 2024 02:06
logger.exception(
f"failed to connect to postgres {DATABASE_FOR_FLAG_MATCHING} node attempt {i + 1} of 3"
)
time.sleep(0.3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flyby that this will make some requests unacceptably slow. It's much better imo to have error computing flags fire (which doesn't mean none of the flags were evaluated, just that the ones relying on the db weren't), than to slow down the response by upto a second occasionally

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better way to achieve the same as retries here would be to increase the query wait timeout setting, but ideally we'd tune the read replica connection settings and instances so these timeouts don't happen often. I bet when this issue happens it shows up on the pgbouncer graph as no. of waiting connections spiking (to pgbouncer, not to the db - that's been the case in the past)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that makes sense. Was thinking too much about customers who are dealing with our worst case

@havenbarnes havenbarnes closed this Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants