Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database: improve handling of connection failure #203

Open
mdonadoni opened this issue Sep 14, 2023 · 0 comments
Open

database: improve handling of connection failure #203

mdonadoni opened this issue Sep 14, 2023 · 0 comments
Labels

Comments

@mdonadoni
Copy link
Member

It can happen that connecting to external database takes a long time or never completes:

$ psql -h db.example.org -p 1234  -U reana reana
[... wait forever ...]

When this happens, all the REANA components that depend on the database get stuck. As an example, the REST API of r-server will stop replying, and the logs will show:

*** uWSGI listen queue of socket "127.0.0.1:45841" (fd: 3) full !!! (100/100) ***
*** uWSGI listen queue of socket "127.0.0.1:45841" (fd: 3) full !!! (101/100) ***
*** uWSGI listen queue of socket "127.0.0.1:45841" (fd: 3) full !!! (101/100) ***
[...]

Note that these error messages might appear after many minutes, when the connections queue is full. In the meantime, nothing is printed to the logs, which makes it difficult to understand what is going wrong. As an example, one might think that the server is not reachable, maybe due to Ingresses that are not configured correctly.

Also note that in this case the components do not self-heal and a restart of the pod is needed.

We should improve this, maybe by tweaking the configuration to put a timeout on database connections, so that we can at least return a 500 error in a timely manner and if possible self-heal when the database is back online.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant