-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQLAlchemy 2.0 upgrades to ORM usage in /lib #16434
Conversation
37e9708
to
57eecc4
Compare
9e43f2a
to
84b9eed
Compare
340cc91
to
b545931
Compare
MappedType = Base | ||
|
||
|
||
class BaseRepository: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this solving ? It looks like a lot of overhead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the class BaseRepository
? It's just a base class for all repositories which contains common functionality. So, instead of having self.session.get(self.Foo, primary_key
) in 100+ repository classes, you have it in one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate some more on this ? I may not get it, but for instance
parent_folder = trans.sa_session.query(trans.app.model.LibraryFolder).get(parent_id)
that you're replacing with parent_folder = LibraryFolderRepository(trans.sa_session).get(parent_id)
does not look more efficient and I don't think is something you'd do if you're familiar with sqlalchemy (unless that's a documented new pattern ?).
It seems that this would also be problematic if you're doing further queries and mainpulations where you're expecting ORM objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! session.query(Foo).get(pkey)
is SQLAlchemy-specific code. It's low level and it's tightly coupled with SA's API. When that API changes (and it did), or when we decide to use a different API (for whatever reason - like filter vs where, scalars vs. execute, etc.), we'll have to change that particular chain of method calls in every place we use it - there are many hundreds, if not thousands of those, which are also not easily grepped ("get" is so generic). With a repository, you change it in one place only. There's that.
Then there's code comprehension: whereas the session.get(id)
vs. foo_repository.get(id)
is not particularly convincing, any other statement is: session.scalars(select(Foo).limit(1)).first()
to return the first Foo is less readable than foo_repository.get_first()
.
If one has to write low-level SA code to do any kind of data access, that makes learning low-level SA a prerequisite for writing any code that interacts with the db. Besides, being familiar with SQLAlchemy doesn't necessarily guarantee that, for example, one would know to use limit(1)
in the case above, or remember to do that. In other words, in my opinion, wrapping library-level data access code in a repository abstraction makes the code both more maintainable (by far!) and more comprehensible.
That said, you are absolutely right about the challenge of more complex queries and manipulations: we certainly don't want to rewrite parts of SA's API! My hope is that we'll be able to tuck away the vast majority of such low level code into repositories, and eliminate a ton of duplication in the process. But yes, it's possible that there will be some logic that is too complex to be abstracted away without adding a lot of conceptual overhead - if that's the case, we'll leave such code as is, of course. I'm working my way up from trivial to simple (those are duplicated all over the code base) - to more complex, so by the time we have complex db logic to untangle, there'll be a ton of improvement done already.
Data access repositories is a pattern. I'll provide a more detailed description as it relates to our code base in the PR description. Essentially, it boils down to eliminating duplication and encapsulating low level code/logic in objects and methods with names that describe what is being done/returned (e.g. foo repo returning all deleted bars) as opposed to how it is implemented (e.g. session/scalars/select/first/etc...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tuck away the vast majority of such low level code into repositories
that IMO is not the right direction. I am -0.9 on this change, sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That gives you much tighter type checking
to me these are highly questionable redundant annotations, given that everything returned through the session already has the proper type:
# (variable) users1: Sequence[User]
users1 = session.scalars(select(User)).all()
# (variable) user: User
user = session.query(User).one()
# (variable) user_iter: Iterator[User]
user_iter = iter(session.scalars(select(User)))
see https://docs.sqlalchemy.org/en/20/changelog/whatsnew_20.html#sql-expression-typing-examples
session.execute(select(Foo)).scalar_one() foo_repo.one()
that's not always going to be what we want for something called .one()
, so why not roll with the upstream syntax ? It's not like the sqlalchemy people hate good syntax / UX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to me these are highly questionable redundant annotations, given that everything returned through the session already has the proper type:
Yes, very nice indeed - it appears SA already does the right thing! Which means we can simplify repository code and get rid of that redundant typing!
that's not always going to be what we want for something called
.one()
, so why not roll with the upstream syntax ? It's not like the sqlalchemy people hate good syntax / UX.
I'll have the data to justify this (or not justify it) shortly - I'm not there yet. It's been my impression so far that there's a sufficient amount of repetitive and fairly detailed SA code that can be easily factored out into a repo class without hiding proper types and without the need to invent any new syntax, etc.
The main goal of this PR is updating the SA code in our code base to SA 2.0. The repos came as a natural addition, after I had copied and pasted the same SA code a sufficient number of times. However, the PR is still primarily about updating the SA code. The repos are, actually, making the process easier (having FooRepository.whatever
is like placing a check mark next to the fixed code block). Most importantly, this is not an irreversible change: replacing FooRepo.get_whatever with the corresponding SA code would be a trivial sed exercise. So, in the end, if it appears the layer isn't justified, we can just remove it - the PR will still update the SA code to 2.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are just a few examples:
These are not complicated examples IMO - these are exactly the simple code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like get_with_filter_order_by_hid
- is indeed code that I think belongs in the model layer and should be tested in isolation. I think though just a static method on the HDA class or a function in that module that took in the sa_session along with the rest of the arguments you've created and just did the work directly would be simpler and feel more "python" to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right - those are quite simple. I'll move on to more complex ones - I hope I'll find better examples.
The get_with_filter_order_by_hid
- bad approach altogether: I'm doing exactly what we shouldn't do - taking perfectly normal SA syntax and restating the same imperative logic as an ugly method name which doesn't improve anything beyond reducing duplicate code. Just an iteration - so please ignore.
I think I have a better option - just pushed. The method name, I think, should say what is being fetched/done. The repo module (or some module within the model) contains all the low level SA code that details how it's done. See repository.hda
and webapps.galaxy.services.history.py
. Naturally, there will be a lot of similar methods with different names - that can be then refactored in the repositories to eliminate any duplicate constructs.
Drop unused parameters and logic from select methods in legacy controller.
Move data access method to managers.users
Replace python iteration with SA batch update
Move data access method into manager (get_jobs_to_check_at_startup)
Move data access method to managers.quotas (get_quotas) Move data access method to managers.histories (get_len_files_by_history) Move data access method to managers.groups (get_group_by_name)
Co-authored-by: Marius van den Beek <[email protected]>
Co-authored-by: Marius van den Beek <[email protected]>
ee5623a
to
b57d6ed
Compare
SA 2.0 ORM usage to-do:
Requires more work:
.query.enable_eagerloads()
FormDefinition.to_dict
bug (ref)Ref: #12541
How to test the changes?
(Select all options that apply)
License