Reduce the number of created connections #1034

gnn · 2022-11-08T23:50:12Z

Fixes #799.

Before merging into `dev`-branch, please make sure that

the CHANGELOG.rst was updated.
new and adjusted code is formated using black and isort.
the Dataset-version is updated when existing datasets are adjusted.
the branch was merged into the current "continuous-integration/run-everything"-branch.
the workflow is running successful in Test mode.
the workflow is running successful in Everything mode.

The first line of the docstring was too long, because it was longer than the absolute limit of 79 characters and because it was longer than the limit of 72 characters for free flowing text. So I shortened it. And because shortening was done easiest by moving the remark in parenthesis to the long description, that's what I did.

The main complainer was `flake8` when run via pre-commit hooks, but all linters respecting `# noqa` comments should be quieted by this.

According to the [SQLAlchemy documentation][0], the `Engine` should be "held globally", but also "initialized per process". Since parallelizing the workflow was the whole point of "egon-data" holding the `Engine` globally wasn't really an option, but I went overboard with an API that creates a new `Engine` for every `Session`. Fortunately creating the `Engine` through a factory function allows us to cache the `Engine` on a per process basis. This should hit the sweet spot demanded by the [SQLAlchemy documentation][0]. [0]: https://docs.sqlalchemy.org/en/13/core/connections.html#basic-usage

The "problem" that this solves is that SQLAlchemy has a weird quirk in that a `Query` returns data which is structured differently depending on what is queried. If a single mapped class is queried, the query returns a list of instances of the mapped class where each instance corresponds to a row of the query result. This case is a bit harder to convert to dictionaries, because one has to make use of the `__table__` attribute. All other cases, i.e. querying multiple mapped classes, explicitly listing the columns to query or a combination of both, results in a list of keyed tuples, which are much easier to convert to dictionaries. The helper implemented here combines both cases.

This context manager can be used everywhere, where a `session` is needed to interact with the ORM at the same time as a `connection` to get more direct access to the database. The `session` and the `connection` share the same transaction and everything will be properly committed and closed when exiting the context manager.

Theoretically one could also use `session.execute` instead of using the `connection` obtained from `db.access()`, but this is illustrative and safe, since I'm not sure whether `session.execute` behaves exactly the same as `connection.execute`.

Again, one could also have used `with engine.begin()` here, so in case this fails, that's what we can try instead.

I tried to replace all instances I found where these functions where used `session.bind`s outside of the `session`'s context manager. Using objects outside their context manager is not a good pattern. These instances worked, because `session.bind` effectively uses the underlying engine, so it should be the same as `db.engine()`, but you never know. Also, these uses where unnecessary because the `DataFrame`s could simply be obtained by using the actual query results. The `GeoDataFrame`s where a little bit harder because they expect Shapely geometries and Geoalchemy2 defaults to a different datatype, but thankfully it also supplies a conversion function.

In previous commits, this use of `read_postgis` was replaced with a combination of `GeoDataFrame` and `DataFrame.from_records`. I couldn't use the same technique here, because there's no `geom_column` argument to `read_postgis` which means that I don't know which column to convert using `to_shape`. While this can probably be figured out, I don't have the time for now so it's a TODO for later. So in order to not use the session after it is closed (which is not strictly wrong, because we only use the `bind` attribute, but it still leaves the door open to unknown behaviour), I'm replacing the session with a call to `db.engine()`. Due to the per-process caching of engines, this doesn't incur additional connections, while it also should be identical in behaviour to using `session.bind`.

Since default parameter values are evaluated at function definition time and stay with the function for its entire lifetime, they are essentially the same as module level variables (at least for top level functions, that is). So `db.engine()` as a default parameter value has the same problems as `db.engine()` at module level and should be removed accordingly. Fortunately removing it is as simple as setting the default parameter value to `None` and then checking for `None` at the start of the function body, which is what this commit does.

Engines are not supposed to be shared across process boundaries. This is ensured via returning distinct `engine` instances for distinct processes from `db.engine()`. Storing `engine`s on a module level might subvert this mechanism, so these variables get removed and replaced by individual calls to `db.engine()`. Some of these variable's weren't even used in their module.

These `session` are used at or near the top of functions and are never closed, potentially leaking connections. Using the `session_scoped` decorator on the functions allows us to get a `session` for the whole function which is automatically committed and closed at the end of the function. Note that one `sessionmaker` import gets removed because it's just unused.

That way the `DELETE` statement is guaranteed to interact correctly with the rest of the database interactions in the function, which is important, now that the whole function uses a single session.

These sessions are not opened via a context manager and thus have to be closed manually in order to not potentially leak connections. It might not be necessary but it's best to be on the safe side. Also, these are `session` usages which I couldn't somehow refactor to working with a context manager, so this is the minimal effort to stay on the safe side w.r.t. connection leakage.

These `with` blocks created two transactions inside functions which where wrapped in retrying error handlers. This could potentially lead to always failing retries because committed transactions can not be rolled back, so errors in the second transactions trigger retries on unchangeable state. In order to prevent this, it's best to have the whole function be one transaction.

gnn · 2022-12-07T12:48:00Z

Turns out, one doesn't have to do it all at once. Instead a small start would've been enough to at least mitigate the issue. Therefore, I'll probably abandon this PR and move it to smaller PRs.
See #1062 for a start.

gnn added 8 commits November 7, 2022 13:53

Tell linters that this bare except is OK

9c0abd0

The main complainer was `flake8` when run via pre-commit hooks, but all linters respecting `# noqa` comments should be quieted by this.

Enable asdict to modify values

c3b408e

Use with access() to execute textual SQL

fa8b2bd

Theoretically one could also use `session.execute` instead of using the `connection` obtained from `db.access()`, but this is illustrative and safe, since I'm not sure whether `session.execute` behaves exactly the same as `connection.execute`.

Wrap read_postgis and read_sql in a context manager

2cc73c9

Again, one could also have used `with engine.begin()` here, so in case this fails, that's what we can try instead.

gnn force-pushed the fixes/#799-reduce-the-number-of-created-connections branch from 01dec7c to 67111cb Compare November 9, 2022 11:12

gnn added 4 commits November 10, 2022 04:22

gnn force-pushed the fixes/#799-reduce-the-number-of-created-connections branch from 67111cb to b65ceec Compare November 10, 2022 03:26

gnn added 6 commits November 10, 2022 05:21

Use session instead of db.execute_sql

ba41e72

That way the `DELETE` statement is guaranteed to interact correctly with the rest of the database interactions in the function, which is important, now that the whole function uses a single session.

Squelch flake8 complaints

482466e

Run black and isort

b075f0f

gnn force-pushed the fixes/#799-reduce-the-number-of-created-connections branch from b65ceec to b075f0f Compare November 10, 2022 04:22

nesnoj mentioned this pull request Nov 30, 2022

Desaggregate heat pumps #905

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the number of created connections #1034

Reduce the number of created connections #1034

gnn commented Nov 8, 2022

gnn commented Dec 7, 2022

Reduce the number of created connections #1034

Are you sure you want to change the base?

Reduce the number of created connections #1034

Conversation

gnn commented Nov 8, 2022

Before merging into dev-branch, please make sure that

gnn commented Dec 7, 2022

Before merging into `dev`-branch, please make sure that