-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the number of created connections #1034
Open
gnn
wants to merge
18
commits into
dev
Choose a base branch
from
fixes/#799-reduce-the-number-of-created-connections
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The first line of the docstring was too long, because it was longer than the absolute limit of 79 characters and because it was longer than the limit of 72 characters for free flowing text. So I shortened it. And because shortening was done easiest by moving the remark in parenthesis to the long description, that's what I did.
The main complainer was `flake8` when run via pre-commit hooks, but all linters respecting `# noqa` comments should be quieted by this.
According to the [SQLAlchemy documentation][0], the `Engine` should be "held globally", but also "initialized per process". Since parallelizing the workflow was the whole point of "egon-data" holding the `Engine` globally wasn't really an option, but I went overboard with an API that creates a new `Engine` for every `Session`. Fortunately creating the `Engine` through a factory function allows us to cache the `Engine` on a per process basis. This should hit the sweet spot demanded by the [SQLAlchemy documentation][0]. [0]: https://docs.sqlalchemy.org/en/13/core/connections.html#basic-usage
The "problem" that this solves is that SQLAlchemy has a weird quirk in that a `Query` returns data which is structured differently depending on what is queried. If a single mapped class is queried, the query returns a list of instances of the mapped class where each instance corresponds to a row of the query result. This case is a bit harder to convert to dictionaries, because one has to make use of the `__table__` attribute. All other cases, i.e. querying multiple mapped classes, explicitly listing the columns to query or a combination of both, results in a list of keyed tuples, which are much easier to convert to dictionaries. The helper implemented here combines both cases.
This context manager can be used everywhere, where a `session` is needed to interact with the ORM at the same time as a `connection` to get more direct access to the database. The `session` and the `connection` share the same transaction and everything will be properly committed and closed when exiting the context manager.
Theoretically one could also use `session.execute` instead of using the `connection` obtained from `db.access()`, but this is illustrative and safe, since I'm not sure whether `session.execute` behaves exactly the same as `connection.execute`.
Again, one could also have used `with engine.begin()` here, so in case this fails, that's what we can try instead.
gnn
force-pushed
the
fixes/#799-reduce-the-number-of-created-connections
branch
from
November 9, 2022 11:12
01dec7c
to
67111cb
Compare
I tried to replace all instances I found where these functions where used `session.bind`s outside of the `session`'s context manager. Using objects outside their context manager is not a good pattern. These instances worked, because `session.bind` effectively uses the underlying engine, so it should be the same as `db.engine()`, but you never know. Also, these uses where unnecessary because the `DataFrame`s could simply be obtained by using the actual query results. The `GeoDataFrame`s where a little bit harder because they expect Shapely geometries and Geoalchemy2 defaults to a different datatype, but thankfully it also supplies a conversion function.
In previous commits, this use of `read_postgis` was replaced with a combination of `GeoDataFrame` and `DataFrame.from_records`. I couldn't use the same technique here, because there's no `geom_column` argument to `read_postgis` which means that I don't know which column to convert using `to_shape`. While this can probably be figured out, I don't have the time for now so it's a TODO for later. So in order to not use the session after it is closed (which is not strictly wrong, because we only use the `bind` attribute, but it still leaves the door open to unknown behaviour), I'm replacing the session with a call to `db.engine()`. Due to the per-process caching of engines, this doesn't incur additional connections, while it also should be identical in behaviour to using `session.bind`.
Since default parameter values are evaluated at function definition time and stay with the function for its entire lifetime, they are essentially the same as module level variables (at least for top level functions, that is). So `db.engine()` as a default parameter value has the same problems as `db.engine()` at module level and should be removed accordingly. Fortunately removing it is as simple as setting the default parameter value to `None` and then checking for `None` at the start of the function body, which is what this commit does.
Engines are not supposed to be shared across process boundaries. This is ensured via returning distinct `engine` instances for distinct processes from `db.engine()`. Storing `engine`s on a module level might subvert this mechanism, so these variables get removed and replaced by individual calls to `db.engine()`. Some of these variable's weren't even used in their module.
gnn
force-pushed
the
fixes/#799-reduce-the-number-of-created-connections
branch
from
November 10, 2022 03:26
67111cb
to
b65ceec
Compare
These `session` are used at or near the top of functions and are never closed, potentially leaking connections. Using the `session_scoped` decorator on the functions allows us to get a `session` for the whole function which is automatically committed and closed at the end of the function. Note that one `sessionmaker` import gets removed because it's just unused.
That way the `DELETE` statement is guaranteed to interact correctly with the rest of the database interactions in the function, which is important, now that the whole function uses a single session.
These sessions are not opened via a context manager and thus have to be closed manually in order to not potentially leak connections. It might not be necessary but it's best to be on the safe side. Also, these are `session` usages which I couldn't somehow refactor to working with a context manager, so this is the minimal effort to stay on the safe side w.r.t. connection leakage.
These `with` blocks created two transactions inside functions which where wrapped in retrying error handlers. This could potentially lead to always failing retries because committed transactions can not be rolled back, so errors in the second transactions trigger retries on unchangeable state. In order to prevent this, it's best to have the whole function be one transaction.
gnn
force-pushed
the
fixes/#799-reduce-the-number-of-created-connections
branch
from
November 10, 2022 04:22
b65ceec
to
b075f0f
Compare
Turns out, one doesn't have to do it all at once. Instead a small start would've been enough to at least mitigate the issue. Therefore, I'll probably abandon this PR and move it to smaller PRs. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #799.
Before merging into
dev
-branch, please make sure thatCHANGELOG.rst
was updated.black
andisort
.Dataset
-version is updated when existing datasets are adjusted.Test
mode.Everything
mode.