Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2.44.0.cli-migrations-v3: No retry for postgres database connection, container frozen #10595

Open
raweber42 opened this issue Nov 13, 2024 · 0 comments
Labels
k/bug Something isn't working

Comments

@raweber42
Copy link

Version Information

Server Version:
CLI Version (for CLI related issue): v2.44.0.cli-migrations-v3

Environment

self-hosted

What is the current behaviour?

We are running hasura in our kubernetes cluster. We have a postgres DB (it's ephemeral) in a container deployed next to it. On startup, hasura is faster than the postgres DB. So I naturally see the log entry

{"error":"connection error","path":"$","code":"postgres-error","internal":"connection to server at \"localhost\" (127.0.0.1), port 5433 failed: Connection refused\n\tIs the server running on that host and accepting TCP/IP connections?\nconnection to server at \"localhost\" (::1), port 5433 failed: Cannot assign requested address\n\tIs the server running on that host and accepting TCP/IP connections?\n"}

The problem that I have is, that hasura does not retry the postgres connection. There is no additional logging until the hasura gets killed by HASURA_GRAPHQL_MIGRATIONS_SERVER_TIMEOUT. Once a new container is spun up by kubernetes, the postgres db is ready and everything works fine.

I don't remember that we had a similar issue before. So it might be a regression issue, because we've been using the same setup for several months now.

The central question here is: Is there a retry mechanism for the database connection of the temporary server that's being created by the cli-migrations-v3 image?. From what I can see, there is not. Even when running the (tests)[https://github.com/hasura/graphql-engine/tree/master/packaging/cli-migrations/v3/test] in the hasura repo, I can see the same behavior if the postgres DB is not already available when hasura starts up.

I am willing to contribute to the project to fix this, if necessary!

What is the expected behaviour?

The container retries the DB connection in a (optional: configurable) interval.

How to reproduce the issue?

  1. Use the docker-compose file from the (test folder)[https://github.com/hasura/graphql-engine/blob/master/packaging/cli-migrations/v3/test/docker-compose.yaml].
  2. Run docker-compose up
  3. Check the logs of the hasura container and see that there is no retry for the database connection
  4. See that the container gets killed once HASURA_GRAPHQL_MIGRATIONS_SERVER_TIMEOUT has been reached.

Screenshots or Screencast

Please provide any traces or logs that could help here.

In the (test)[https://github.com/hasura/graphql-engine/blob/master/packaging/cli-migrations/v3/test/test.sh] section of the image I can see that the postgres DB is spun up before the hasura instance. Maybe it's a coincident, but this might approve my suspicion that there is no repeated check of the DB connection in the hasura instance.

Any possible solutions/workarounds you're aware of?

Implement polling/retrying for the database connection.

Keywords

auto-migrate, cli-migrations-v3, database, postgres, metadata

@raweber42 raweber42 added the k/bug Something isn't working label Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
k/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant