SQLAlchemy 2.0 upgrades to ORM usage in /lib #16434

jdavcs · 2023-07-21T19:18:25Z

SA 2.0 ORM usage to-do:

Requires more work:

lib/galaxy/job_execution (requires addressing SessionlessContext, which implements a small subset of SQLAlchemy session's API)
lib/galaxy/tools (same as above: SessionlessContext)
lib/galaxy/authnz (requires rewriting all the authnz unit tests, because they are all based on a mocked subset of SQLAlchemy)
lib/galaxy/webapps/reports: handle .query.enable_eagerloads()
Fix the FormDefinition.to_dict bug (ref)

Ref: #12541

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

mvdbeek · 2023-08-17T07:57:15Z

lib/galaxy/model/repositories/__init__.py

+MappedType = Base
+
+
+class BaseRepository:


What is this solving ? It looks like a lot of overhead

You mean the class BaseRepository? It's just a base class for all repositories which contains common functionality. So, instead of having self.session.get(self.Foo, primary_key) in 100+ repository classes, you have it in one.

Can you elaborate some more on this ? I may not get it, but for instance
parent_folder = trans.sa_session.query(trans.app.model.LibraryFolder).get(parent_id) that you're replacing with parent_folder = LibraryFolderRepository(trans.sa_session).get(parent_id) does not look more efficient and I don't think is something you'd do if you're familiar with sqlalchemy (unless that's a documented new pattern ?).

It seems that this would also be problematic if you're doing further queries and mainpulations where you're expecting ORM objects.

Sure! session.query(Foo).get(pkey) is SQLAlchemy-specific code. It's low level and it's tightly coupled with SA's API. When that API changes (and it did), or when we decide to use a different API (for whatever reason - like filter vs where, scalars vs. execute, etc.), we'll have to change that particular chain of method calls in every place we use it - there are many hundreds, if not thousands of those, which are also not easily grepped ("get" is so generic). With a repository, you change it in one place only. There's that.

Then there's code comprehension: whereas the session.get(id) vs. foo_repository.get(id) is not particularly convincing, any other statement is: session.scalars(select(Foo).limit(1)).first() to return the first Foo is less readable than foo_repository.get_first().

If one has to write low-level SA code to do any kind of data access, that makes learning low-level SA a prerequisite for writing any code that interacts with the db. Besides, being familiar with SQLAlchemy doesn't necessarily guarantee that, for example, one would know to use limit(1) in the case above, or remember to do that. In other words, in my opinion, wrapping library-level data access code in a repository abstraction makes the code both more maintainable (by far!) and more comprehensible.

That said, you are absolutely right about the challenge of more complex queries and manipulations: we certainly don't want to rewrite parts of SA's API! My hope is that we'll be able to tuck away the vast majority of such low level code into repositories, and eliminate a ton of duplication in the process. But yes, it's possible that there will be some logic that is too complex to be abstracted away without adding a lot of conceptual overhead - if that's the case, we'll leave such code as is, of course. I'm working my way up from trivial to simple (those are duplicated all over the code base) - to more complex, so by the time we have complex db logic to untangle, there'll be a ton of improvement done already.

Data access repositories is a pattern. I'll provide a more detailed description as it relates to our code base in the PR description. Essentially, it boils down to eliminating duplication and encapsulating low level code/logic in objects and methods with names that describe what is being done/returned (e.g. foo repo returning all deleted bars) as opposed to how it is implemented (e.g. session/scalars/select/first/etc...)

tuck away the vast majority of such low level code into repositories

that IMO is not the right direction. I am -0.9 on this change, sorry.

That gives you much tighter type checking

to me these are highly questionable redundant annotations, given that everything returned through the session already has the proper type:

# (variable) users1: Sequence[User] users1 = session.scalars(select(User)).all() # (variable) user: User user = session.query(User).one() # (variable) user_iter: Iterator[User] user_iter = iter(session.scalars(select(User)))

see https://docs.sqlalchemy.org/en/20/changelog/whatsnew_20.html#sql-expression-typing-examples

session.execute(select(Foo)).scalar_one() foo_repo.one()

that's not always going to be what we want for something called .one(), so why not roll with the upstream syntax ? It's not like the sqlalchemy people hate good syntax / UX.

to me these are highly questionable redundant annotations, given that everything returned through the session already has the proper type:

Yes, very nice indeed - it appears SA already does the right thing! Which means we can simplify repository code and get rid of that redundant typing!

that's not always going to be what we want for something called .one(), so why not roll with the upstream syntax ? It's not like the sqlalchemy people hate good syntax / UX.

I'll have the data to justify this (or not justify it) shortly - I'm not there yet. It's been my impression so far that there's a sufficient amount of repetitive and fairly detailed SA code that can be easily factored out into a repo class without hiding proper types and without the need to invent any new syntax, etc.

The main goal of this PR is updating the SA code in our code base to SA 2.0. The repos came as a natural addition, after I had copied and pasted the same SA code a sufficient number of times. However, the PR is still primarily about updating the SA code. The repos are, actually, making the process easier (having FooRepository.whatever is like placing a check mark next to the fixed code block). Most importantly, this is not an irreversible change: replacing FooRepo.get_whatever with the corresponding SA code would be a trivial sed exercise. So, in the end, if it appears the layer isn't justified, we can just remove it - the PR will still update the SA code to 2.0.

Here are just a few examples:

These are not complicated examples IMO - these are exactly the simple code.

Something like get_with_filter_order_by_hid - is indeed code that I think belongs in the model layer and should be tested in isolation. I think though just a static method on the HDA class or a function in that module that took in the sa_session along with the rest of the arguments you've created and just did the work directly would be simpler and feel more "python" to me.

Yes, you're right - those are quite simple. I'll move on to more complex ones - I hope I'll find better examples.

The get_with_filter_order_by_hid - bad approach altogether: I'm doing exactly what we shouldn't do - taking perfectly normal SA syntax and restating the same imperative logic as an ugly method name which doesn't improve anything beyond reducing duplicate code. Just an iteration - so please ignore.

I think I have a better option - just pushed. The method name, I think, should say what is being fetched/done. The repo module (or some module within the model) contains all the low level SA code that details how it's done. See repository.hda and webapps.galaxy.services.history.py. Naturally, there will be a lot of similar methods with different names - that can be then refactored in the repositories to eliminate any duplicate constructs.

Drop unused parameters and logic from select methods in legacy controller.

Move data access method to managers.users

Replace python iteration with SA batch update

Move data access method into manager (get_jobs_to_check_at_startup)

Move data access method to managers.quotas (get_quotas) Move data access method to managers.histories (get_len_files_by_history) Move data access method to managers.groups (get_group_by_name)

Co-authored-by: Marius van den Beek <[email protected]>

jdavcs added kind/refactoring cleanup or refactoring of existing code, no functional changes area/database Galaxy's database or data access layer labels Jul 21, 2023

jdavcs added this to the 23.2 milestone Jul 21, 2023

jdavcs force-pushed the dev_sa20_fix14 branch from 1f86cad to 6667144 Compare July 21, 2023 21:13

jdavcs force-pushed the dev_sa20_fix14 branch from d4342c4 to dce5ab7 Compare August 7, 2023 18:56

jdavcs mentioned this pull request Aug 7, 2023

Move database access code out of galaxy.util #16526

Merged

4 tasks

jdavcs force-pushed the dev_sa20_fix14 branch 10 times, most recently from 37e9708 to 57eecc4 Compare August 13, 2023 20:51

jdavcs changed the title ~~[WIP] Towards SQLAlchemy 2.0 (upgrades to SA ORM usage in /lib)~~ [WIP] Data access abstraction layer + SQLAlchemy 2.0 upgrades to ORM usage in /lib Aug 13, 2023

jdavcs force-pushed the dev_sa20_fix14 branch 2 times, most recently from 9e43f2a to 84b9eed Compare August 13, 2023 21:03

This was referenced Aug 14, 2023

Remove unnecessary check: item cannot be None #16550

Merged

Rename to_dict to populate in FormDefintion to fix bug #16553

Merged

jdavcs force-pushed the dev_sa20_fix14 branch 7 times, most recently from 340cc91 to b545931 Compare August 16, 2023 22:41

mvdbeek reviewed Aug 17, 2023

View reviewed changes

jdavcs force-pushed the dev_sa20_fix14 branch from b545931 to 7232627 Compare August 17, 2023 14:01

jdavcs and others added 26 commits September 25, 2023 09:23

Fix SA2.0 ORM usage in galaxy.celery

5547675

Fix SA2.0 ORM usage in galaxy.quota

9bfc041

Fix SA2.0 ORM usage in galaxy.security

eeba6d6

Fix SA2.0 ORM usage in galaxy.queue_worker

3250ed8

Fix SA2.0 ORM usage in galaxy.workflow

19a1f90

Fix SA2.0 ORM usage in galaxy.tools

41fa1cb

Fix SA2.0 ORM usage in managers.cloud

efb0977

Fix SA2.0 ORM usage in managers.dbkeys

efe5224

Fix SA2.0 ORM usage in galaxy.webapps.base; refactor

0fd6aec

Drop unused parameters and logic from select methods in legacy controller.

Fix SA2.0 ORM usage in managers.group_roles

ca63bc8

Fix SA2.0 ORM usage in managers.group_users

1a665c3

Fix SA2.0 ORM usage in managers.groups

0e4df32

Move data access method to managers.users

Fix SA2.0 ORM usage in galaxy.webapps/reports [partially]

2084da7

Fix SA2.0 ORM usage in managers.api_keys; refactor

09c2efc

Replace python iteration with SA batch update

Fix SA2.0 ORM usage in galaxy/tool_shed/util

f355a19

Fix SA2.0 ORM usage in in galaxy.jobs [partially]

9e92c70

Move data access method into manager (get_jobs_to_check_at_startup)

Fix SA2.0 ORM usage in webapps.galaxy.controllers.history

8581d05

Fix SA2.0 ORM usage in webapps.galaxy.services

de1321b

Move data access method to managers.quotas (get_quotas) Move data access method to managers.histories (get_len_files_by_history) Move data access method to managers.groups (get_group_by_name)

Refactor get_user_by_username, get_user_by_email; use across code base

7a06b00

Fix SA2.0 ORM usage in galaxy.visualization

f5d6578

Update lib/galaxy/tools/parameters/basic.py

6da8065

Co-authored-by: Marius van den Beek <[email protected]>

Update lib/galaxy/webapps/galaxy/controllers/history.py

0b1cccd

Co-authored-by: Marius van den Beek <[email protected]>

Move get_quotas into services

1da8b1d

Raise exception if no Library found in api method

23aa506

Move get_fasta_hdas_by_history into services

4b6b85e

Move method into User model class

b57d6ed

jdavcs force-pushed the dev_sa20_fix14 branch from ee5623a to b57d6ed Compare September 25, 2023 13:23

Check if user is not null before accessing user attribute

4256d87

mvdbeek merged commit 815d8b7 into galaxyproject:dev Sep 26, 2023
42 checks passed

jdavcs mentioned this pull request Jan 13, 2024

[23.2] Rollback invalidated transaction #17280

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQLAlchemy 2.0 upgrades to ORM usage in /lib #16434

SQLAlchemy 2.0 upgrades to ORM usage in /lib #16434

jdavcs commented Jul 21, 2023 •

edited

Loading

mvdbeek Aug 17, 2023

jdavcs Aug 17, 2023

mvdbeek Aug 17, 2023

jdavcs Aug 17, 2023

mvdbeek Aug 17, 2023

mvdbeek Aug 17, 2023

jdavcs Aug 17, 2023

jmchilton Aug 17, 2023

jmchilton Aug 17, 2023

jdavcs Aug 17, 2023

SQLAlchemy 2.0 upgrades to ORM usage in /lib #16434

SQLAlchemy 2.0 upgrades to ORM usage in /lib #16434

Conversation

jdavcs commented Jul 21, 2023 • edited Loading

SA 2.0 ORM usage to-do:

Requires more work:

How to test the changes?

License

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdavcs commented Jul 21, 2023 •

edited

Loading