Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[24.0] Fix source history update_time being updated when importing a public history #17728

Merged
merged 8 commits into from
Mar 21, 2024

Conversation

jmchilton
Copy link
Member

@jmchilton jmchilton commented Mar 14, 2024

Includes a failing test for history copying updating the history update_time.

I believe I've seen this behavior for a long time but I'm happy to work on a fix for it and a test is a good place to start.

Fixes #17702

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@jdavcs
Copy link
Member

jdavcs commented Mar 20, 2024

I think the update bug does not appear at the model level because the functionality is defined correctly at the model level. I have another unit test to verify this (from a draft, post-SA2.0 branch): ce17483 - it checks that history.update_time is updated after a history hda is created or updated, but not after a history hda is copied - i.e., as expected.

@mvdbeek
Copy link
Member

mvdbeek commented Mar 20, 2024

e3f135d should fix it. I debugged this by checking at which point the history update_time changed, which was right after creating the HDA. I then removed the source hda via session.expunge and committed the session, which emitted an sqlalchemy warning that the copied_from_history_dataset_association backref could not be followed. Simply setting just the id instead of adding in the whole instance fixed it.

@mvdbeek mvdbeek changed the title [24.0] Failing test for history copying updating the history update_time. [24.0] Fix source history update_time being updated when importing a public history Mar 20, 2024
@mvdbeek
Copy link
Member

mvdbeek commented Mar 20, 2024

well, that's fun, the copied_from_history_dataset_association_id just disappears after commit ... 😆

@martenson
Copy link
Member

sidenote: We might be doing the same thing with copied_from_library_dataset_dataset_association and others?

@mvdbeek
Copy link
Member

mvdbeek commented Mar 20, 2024

Screenshot 2024-03-20 at 19 35 42

... what is this ???

@jdavcs
Copy link
Member

jdavcs commented Mar 20, 2024

most likely, when you changed the arg to copy, it broke something. For example, you are no longer assigning copied_from_history_dataset_association in the hda constructor. In general, there should be nothing wrong with passing the hda and not assigning the hda.id: doing self.foo = bar should not add bar to session.dirty, I think. So that's not the root cause here, I think.

@mvdbeek
Copy link
Member

mvdbeek commented Mar 20, 2024

It is though, see #17728 (comment)

@jdavcs
Copy link
Member

jdavcs commented Mar 20, 2024

Yes, i've read that comment. What adds the original hda to session.dirty is the metadata. As soon as you instantiate the new hda (inside the hda' copy method), the original hda (i.e., self) gains 2 new attributes:

'_metadata': {'data_lines': 1, 'dbkey': ['?']},                                                                                                                                         
'_metadata_collection': <galaxy.model.metadata.MetadataCollection object at 0x7f4814d5a5b0>, 

I suppose that's what makes it dirty. Where they come from I don't know yet - just got there, looking now.

@mvdbeek
Copy link
Member

mvdbeek commented Mar 20, 2024

That still doesn't make sense, since removing copied_from_history_dataset_association=self fixes the test, and we're still copying the metadata

@mvdbeek
Copy link
Member

mvdbeek commented Mar 20, 2024

and dirty is just some heuristic that doesn't necessarily mean anything is getting updated, see https://docs.sqlalchemy.org/en/20/orm/session_api.html#sqlalchemy.orm.Session.dirty

@mvdbeek
Copy link
Member

mvdbeek commented Mar 20, 2024

This fixes the unit test:

diff --git a/lib/galaxy/model/__init__.py b/lib/galaxy/model/__init__.py
index 3b8dc6fa2d..53d238ee7a 100644
--- a/lib/galaxy/model/__init__.py
+++ b/lib/galaxy/model/__init__.py
@@ -4985,7 +4985,6 @@ class HistoryDatasetAssociation(DatasetInstance, HasTags, Dictifiable, UsesAnnot
         self,
         hid=None,
         history=None,
-        copied_from_history_dataset_association=None,
         copied_from_library_dataset_dataset_association=None,
         sa_session=None,
         **kwd,
@@ -4999,7 +4998,6 @@ class HistoryDatasetAssociation(DatasetInstance, HasTags, Dictifiable, UsesAnnot
         self.hid = hid
         # Relationships
         self.history = history
-        self.copied_from_history_dataset_association = copied_from_history_dataset_association
         self.copied_from_library_dataset_dataset_association = copied_from_library_dataset_dataset_association
 
     def __strict_check_before_flush__(self):
@@ -5100,10 +5098,10 @@ class HistoryDatasetAssociation(DatasetInstance, HasTags, Dictifiable, UsesAnnot
         hda.purged = self.purged
 
         hda.copy_tags_to(copy_tags)
-        object_session(self).add(hda)
+        session = object_session(self)
+        session.add(hda)
         hda.metadata = self.metadata
         if flush:
-            session = object_session(self)
             with transaction(session):
                 session.commit()
         return hda

which I admit is crazy ...

@jdavcs
Copy link
Member

jdavcs commented Mar 20, 2024

That still doesn't make sense, since removing copied_from_history_dataset_association=self fixes the test, and we're still copying the metadata

I didn't test that with your latest commit: so copied_from_history_dataset_association is passed to the new hda in my test. And, it appears that upon creating the copied hda, the metadata is modified on the source hda.

mvdbeek added 3 commits March 20, 2024 20:45
…ory_dataset_association.id

Otherwise sqlachemy will think `None` is the value we've set.

A lesson we seem to have already learnt with https://github.com/galaxyproject/galaxy/blob/429c63e8563f98237770e43d94b25fdacf4ede58/lib/galaxy/model/__init__.py#L5785-L5788
Was added in
galaxyproject@75ba9cf
and i'm reverting this part back to the original code that now passes
since I've added the correct type hint to DatasetInstance.
@jdavcs
Copy link
Member

jdavcs commented Mar 21, 2024

I've investigated this and I think I now understand what's going on.
When we do new_foo = foo.copy(), foo should not change. In our case, the original HDA does change, which leads to an UPDATE statement sent to the database, which fires the trigger and updates the history's update_time to which the source HDA belongs. Thus, the updated history is the symptom of a deeper problem.

The root cause is what changes in the source HDA when nothing should. Here's what happens:

  1. History.copy calls the copy method of each HDA in that history
  2. HDA.copy creates a new instance of an HDA, passing self.dbkey as the dbkey argument.
  3. HDA.dbkey is a property; when it's accessed, it accesses the metadata property.
  4. If the HDA object doesn't have its metadata initialized, it will instantiate a MetadataCollection object, which in its constructor accesses the _metadata attribute of the HDA, which is the key for the metadata Column object. This adds a non-empty _metadata (and _metadata_collection) attr to the HDA, which, via SQLAlchemy, moves the HDA to the session.dirty set.
  5. Then we flush. A flush triggers the before_flush event handler, which loops over the objects in session.dirty and calls the __create_version__ method on each HDA.
  6. In __create_version__, we collect any modified values and unconditionally set or increment the version of the HDA. That new version value is what causes the HDA to be updated in the database, which then gets the history updated.

I think the solution is not to update the HDA version if nothing in the HDA is changed. If so, the implementation might be as simple as this: jdavcs@518cb58 .

Does this make sense?

@mvdbeek
Copy link
Member

mvdbeek commented Mar 21, 2024

It's a good commit, however on my this branch the source HDA isn't marked dirty at all, so accessing HDA.dbkey isn't automatically marking that HDA as dirty, there's still step X involved. I'll pull in your commit

@jmchilton jmchilton marked this pull request as ready for review March 21, 2024 14:31
@github-actions github-actions bot added this to the 24.1 milestone Mar 21, 2024
@jdavcs
Copy link
Member

jdavcs commented Mar 21, 2024

Yep, found step X. Accessing hda._metadata does add the 2 attributes to hda.__dict__, but that's not what marks it as dirty. That happens in the HDA constructor when we do this:
new_hda.self.copied_from_history_dataset_association = source_hda
I don't know exactly where that happens inside SA, but it does happen and it's due to the back_populates ties between the 2 relationships defined on the HDA model. I don't think this is incorrect. In this branch it doesn't happen because we bypass the relationship (we no longer make that assignment) and instead assign the id directly. And bypassing this step (and all the ORM logic that it triggered) is a good thing, I think. So, 2 fixes in 1 PR!

@jdavcs jdavcs merged commit 7af9c3d into galaxyproject:release_24.0 Mar 21, 2024
53 checks passed
@jdavcs jdavcs modified the milestones: 24.1, 24.0 Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants