Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the number of created connections #1034

Open
wants to merge 18 commits into
base: dev
Choose a base branch
from

Conversation

gnn
Copy link
Collaborator

@gnn gnn commented Nov 8, 2022

Fixes #799.

Before merging into dev-branch, please make sure that

  • the CHANGELOG.rst was updated.
  • new and adjusted code is formated using black and isort.
  • the Dataset-version is updated when existing datasets are adjusted.
  • the branch was merged into the current "continuous-integration/run-everything"-branch.
  • the workflow is running successful in Test mode.
  • the workflow is running successful in Everything mode.

The first line of the docstring was too long, because it was longer than
the absolute limit of 79 characters and because it was longer than the
limit of 72 characters for free flowing text. So I shortened it. And
because shortening was done easiest by moving the remark in parenthesis to
the long description, that's what I did.
The main complainer was `flake8` when run via pre-commit hooks, but all
linters respecting `# noqa` comments should be quieted by this.
According to the [SQLAlchemy documentation][0], the `Engine` should be
"held globally", but also "initialized per process". Since parallelizing
the workflow was the whole point of "egon-data" holding the `Engine`
globally wasn't really an option, but I went overboard with an API that
creates a new `Engine` for every `Session`. Fortunately creating the
`Engine` through a factory function allows us to cache the `Engine` on a
per process basis. This should hit the sweet spot demanded by the
[SQLAlchemy documentation][0].

[0]: https://docs.sqlalchemy.org/en/13/core/connections.html#basic-usage
The "problem" that this solves is that SQLAlchemy has a weird quirk in
that a `Query` returns data which is structured differently depending on
what is queried. If a single mapped class is queried, the query returns
a list of instances of the mapped class where each instance corresponds
to a row of the query result. This case is a bit harder to convert to
dictionaries, because one has to make use of the `__table__` attribute.
All other cases, i.e. querying multiple mapped classes, explicitly
listing the columns to query or a combination of both, results in a list
of keyed tuples, which are much easier to convert to dictionaries.
The helper implemented here combines both cases.
This context manager can be used everywhere, where a `session` is
needed to interact with the ORM at the same time as a `connection` to
get more direct access to the database. The `session` and the
`connection` share the same transaction and everything will be properly
committed and closed when exiting the context manager.
Theoretically one could also use `session.execute` instead of using the
`connection` obtained from `db.access()`, but this is illustrative and
safe, since I'm not sure whether `session.execute` behaves exactly the
same as `connection.execute`.
Again, one could also have used `with engine.begin()` here, so in case
this fails, that's what we can try instead.
@gnn gnn force-pushed the fixes/#799-reduce-the-number-of-created-connections branch from 01dec7c to 67111cb Compare November 9, 2022 11:12
I tried to replace all instances I found where these functions where
used `session.bind`s outside of the `session`'s context manager. Using
objects outside their context manager is not a good pattern. These
instances worked, because `session.bind` effectively uses the underlying
engine, so it should be the same as `db.engine()`, but you never know.
Also, these uses where unnecessary because the `DataFrame`s could simply
be obtained by using the actual query results. The `GeoDataFrame`s where
a little bit harder because they expect Shapely geometries and
Geoalchemy2 defaults to a different datatype, but thankfully it also
supplies a conversion function.
In previous commits, this use of `read_postgis` was replaced with a
combination of `GeoDataFrame` and `DataFrame.from_records`. I couldn't
use the same technique here, because there's no `geom_column` argument
to `read_postgis` which means that I don't know which column to convert
using `to_shape`. While this can probably be figured out, I don't have
the time for now so it's a TODO for later.
So in order to not use the session after it is closed (which is not
strictly wrong, because we only use the `bind` attribute, but it still
leaves the door open to unknown behaviour), I'm replacing the session
with a call to `db.engine()`. Due to the per-process caching of engines,
this doesn't incur additional connections, while it also should be
identical in behaviour to using `session.bind`.
Since default parameter values are evaluated at function definition time
and stay with the function for its entire lifetime, they are essentially
the same as module level variables (at least for top level functions,
that is). So `db.engine()` as a default parameter value has the same
problems as `db.engine()` at module level and should be removed
accordingly.
Fortunately removing it is as simple as setting the default parameter
value to `None` and then checking for `None` at the start of the
function body, which is what this commit does.
Engines are not supposed to be shared across process boundaries. This is
ensured via returning distinct `engine` instances for distinct processes
from `db.engine()`. Storing `engine`s on a module level might subvert
this mechanism, so these variables get removed and replaced by
individual calls to `db.engine()`.
Some of these variable's weren't even used in their module.
@gnn gnn force-pushed the fixes/#799-reduce-the-number-of-created-connections branch from 67111cb to b65ceec Compare November 10, 2022 03:26
These `session` are used at or near the top of functions and are never
closed, potentially leaking connections. Using the `session_scoped`
decorator on the functions allows us to get a `session` for the whole
function which is automatically committed and closed at the end of the
function.
Note that one `sessionmaker` import gets removed because it's just
unused.
That way the `DELETE` statement is guaranteed to interact correctly with
the rest of the database interactions in the function, which is
important, now that the whole function uses a single session.
These sessions are not opened via a context manager and thus have to be
closed manually in order to not potentially leak connections. It might
not be necessary but it's best to be on the safe side.
Also, these are `session` usages which I couldn't somehow refactor to
working with a context manager, so this is the minimal effort to stay on
the safe side w.r.t. connection leakage.
These `with` blocks created two transactions inside functions which
where wrapped in retrying error handlers. This could potentially lead to
always failing retries because committed transactions can not be rolled
back, so errors in the second transactions trigger retries on
unchangeable state. In order to prevent this, it's best to have the
whole function be one transaction.
@gnn gnn force-pushed the fixes/#799-reduce-the-number-of-created-connections branch from b65ceec to b075f0f Compare November 10, 2022 04:22
@nesnoj nesnoj mentioned this pull request Nov 30, 2022
15 tasks
@gnn
Copy link
Collaborator Author

gnn commented Dec 7, 2022

Turns out, one doesn't have to do it all at once. Instead a small start would've been enough to at least mitigate the issue. Therefore, I'll probably abandon this PR and move it to smaller PRs.
See #1062 for a start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Postgres throws FATAL: sorry, too many clients already
1 participant