Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLAlchemy 2.0 upgrades to ORM usage in /lib #16434

Merged
merged 29 commits into from
Sep 26, 2023

Conversation

jdavcs
Copy link
Member

@jdavcs jdavcs commented Jul 21, 2023

SA 2.0 ORM usage to-do:

  • lib/galaxy/actions
  • lib/galaxy/celery
  • lib/galaxy/datatypes
  • lib/galaxy/jobs
  • lib/galaxy/managers
  • lib/galaxy/metadata
  • lib/galaxy/model
  • lib/galaxy/queue_worker
  • lib/galaxy/quota
  • lib/galaxy/security
  • lib/galaxy/tool_shed
  • lib/galaxy/util
  • lib/galaxy/visualization
  • lib/galaxy/web
  • lib/galaxy/webapps/base
  • lib/galaxy/webapps/galaxy/api
  • lib/galaxy/webapps/galaxy/controllers
  • lib/galaxy/webapps/galaxy/services
  • lib/galaxy/webapps/reports (except enable_eagerloads)
  • lib/galaxy/workflow
  • lib/galaxy_test
  • lib/tool_shed

Requires more work:

  • lib/galaxy/job_execution (requires addressing SessionlessContext, which implements a small subset of SQLAlchemy session's API)
  • lib/galaxy/tools (same as above: SessionlessContext)
  • lib/galaxy/authnz (requires rewriting all the authnz unit tests, because they are all based on a mocked subset of SQLAlchemy)
  • lib/galaxy/webapps/reports: handle .query.enable_eagerloads()
  • Fix the FormDefinition.to_dict bug (ref)

Ref: #12541

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@jdavcs jdavcs added kind/refactoring cleanup or refactoring of existing code, no functional changes area/database Galaxy's database or data access layer labels Jul 21, 2023
@jdavcs jdavcs added this to the 23.2 milestone Jul 21, 2023
@jdavcs jdavcs force-pushed the dev_sa20_fix14 branch 10 times, most recently from 37e9708 to 57eecc4 Compare August 13, 2023 20:51
@jdavcs jdavcs changed the title [WIP] Towards SQLAlchemy 2.0 (upgrades to SA ORM usage in /lib) [WIP] Data access abstraction layer + SQLAlchemy 2.0 upgrades to ORM usage in /lib Aug 13, 2023
@jdavcs jdavcs force-pushed the dev_sa20_fix14 branch 2 times, most recently from 9e43f2a to 84b9eed Compare August 13, 2023 21:03
@jdavcs jdavcs force-pushed the dev_sa20_fix14 branch 7 times, most recently from 340cc91 to b545931 Compare August 16, 2023 22:41
MappedType = Base


class BaseRepository:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this solving ? It looks like a lot of overhead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the class BaseRepository? It's just a base class for all repositories which contains common functionality. So, instead of having self.session.get(self.Foo, primary_key) in 100+ repository classes, you have it in one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate some more on this ? I may not get it, but for instance
parent_folder = trans.sa_session.query(trans.app.model.LibraryFolder).get(parent_id) that you're replacing with parent_folder = LibraryFolderRepository(trans.sa_session).get(parent_id) does not look more efficient and I don't think is something you'd do if you're familiar with sqlalchemy (unless that's a documented new pattern ?).

It seems that this would also be problematic if you're doing further queries and mainpulations where you're expecting ORM objects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! session.query(Foo).get(pkey) is SQLAlchemy-specific code. It's low level and it's tightly coupled with SA's API. When that API changes (and it did), or when we decide to use a different API (for whatever reason - like filter vs where, scalars vs. execute, etc.), we'll have to change that particular chain of method calls in every place we use it - there are many hundreds, if not thousands of those, which are also not easily grepped ("get" is so generic). With a repository, you change it in one place only. There's that.

Then there's code comprehension: whereas the session.get(id) vs. foo_repository.get(id) is not particularly convincing, any other statement is: session.scalars(select(Foo).limit(1)).first() to return the first Foo is less readable than foo_repository.get_first().

If one has to write low-level SA code to do any kind of data access, that makes learning low-level SA a prerequisite for writing any code that interacts with the db. Besides, being familiar with SQLAlchemy doesn't necessarily guarantee that, for example, one would know to use limit(1) in the case above, or remember to do that. In other words, in my opinion, wrapping library-level data access code in a repository abstraction makes the code both more maintainable (by far!) and more comprehensible.

That said, you are absolutely right about the challenge of more complex queries and manipulations: we certainly don't want to rewrite parts of SA's API! My hope is that we'll be able to tuck away the vast majority of such low level code into repositories, and eliminate a ton of duplication in the process. But yes, it's possible that there will be some logic that is too complex to be abstracted away without adding a lot of conceptual overhead - if that's the case, we'll leave such code as is, of course. I'm working my way up from trivial to simple (those are duplicated all over the code base) - to more complex, so by the time we have complex db logic to untangle, there'll be a ton of improvement done already.

Data access repositories is a pattern. I'll provide a more detailed description as it relates to our code base in the PR description. Essentially, it boils down to eliminating duplication and encapsulating low level code/logic in objects and methods with names that describe what is being done/returned (e.g. foo repo returning all deleted bars) as opposed to how it is implemented (e.g. session/scalars/select/first/etc...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tuck away the vast majority of such low level code into repositories

that IMO is not the right direction. I am -0.9 on this change, sorry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That gives you much tighter type checking

to me these are highly questionable redundant annotations, given that everything returned through the session already has the proper type:

# (variable) users1: Sequence[User]
users1 = session.scalars(select(User)).all()

# (variable) user: User
user = session.query(User).one()

# (variable) user_iter: Iterator[User]
user_iter = iter(session.scalars(select(User)))

see https://docs.sqlalchemy.org/en/20/changelog/whatsnew_20.html#sql-expression-typing-examples

session.execute(select(Foo)).scalar_one() foo_repo.one()

that's not always going to be what we want for something called .one(), so why not roll with the upstream syntax ? It's not like the sqlalchemy people hate good syntax / UX.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me these are highly questionable redundant annotations, given that everything returned through the session already has the proper type:

Yes, very nice indeed - it appears SA already does the right thing! Which means we can simplify repository code and get rid of that redundant typing!

that's not always going to be what we want for something called .one(), so why not roll with the upstream syntax ? It's not like the sqlalchemy people hate good syntax / UX.

I'll have the data to justify this (or not justify it) shortly - I'm not there yet. It's been my impression so far that there's a sufficient amount of repetitive and fairly detailed SA code that can be easily factored out into a repo class without hiding proper types and without the need to invent any new syntax, etc.

The main goal of this PR is updating the SA code in our code base to SA 2.0. The repos came as a natural addition, after I had copied and pasted the same SA code a sufficient number of times. However, the PR is still primarily about updating the SA code. The repos are, actually, making the process easier (having FooRepository.whatever is like placing a check mark next to the fixed code block). Most importantly, this is not an irreversible change: replacing FooRepo.get_whatever with the corresponding SA code would be a trivial sed exercise. So, in the end, if it appears the layer isn't justified, we can just remove it - the PR will still update the SA code to 2.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are just a few examples:

These are not complicated examples IMO - these are exactly the simple code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like get_with_filter_order_by_hid - is indeed code that I think belongs in the model layer and should be tested in isolation. I think though just a static method on the HDA class or a function in that module that took in the sa_session along with the rest of the arguments you've created and just did the work directly would be simpler and feel more "python" to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right - those are quite simple. I'll move on to more complex ones - I hope I'll find better examples.

The get_with_filter_order_by_hid - bad approach altogether: I'm doing exactly what we shouldn't do - taking perfectly normal SA syntax and restating the same imperative logic as an ugly method name which doesn't improve anything beyond reducing duplicate code. Just an iteration - so please ignore.

I think I have a better option - just pushed. The method name, I think, should say what is being fetched/done. The repo module (or some module within the model) contains all the low level SA code that details how it's done. See repository.hda and webapps.galaxy.services.history.py. Naturally, there will be a lot of similar methods with different names - that can be then refactored in the repositories to eliminate any duplicate constructs.

jdavcs and others added 26 commits September 25, 2023 09:23
Drop unused parameters and logic from select methods in legacy controller.
Move data access method to managers.users
Replace python iteration with SA batch update
Move data access method into manager (get_jobs_to_check_at_startup)
Move data access method to managers.quotas (get_quotas)
Move data access method to managers.histories (get_len_files_by_history)
Move data access method to managers.groups (get_group_by_name)
@mvdbeek mvdbeek merged commit 815d8b7 into galaxyproject:dev Sep 26, 2023
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/database Galaxy's database or data access layer kind/refactoring cleanup or refactoring of existing code, no functional changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants