Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[24.0] Simplify nested collection joins #17817

Conversation

mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Mar 22, 2024

by joining via dataset collection element, and only bringing in the dataset collection if we're filtering against the collection.

The collection.dataset_instances property is a lot faster for a list:list:list collection with a single item (457 ms vs 6ms).

Before: https://explain.dalibo.com/plan/d9gd3gfe5b104343
After: https://explain.dalibo.com/plan/19he141a5e21afb4

The query changes from

SELECT history_dataset_association.id, history_dataset_association.history_id, history_dataset_association.dataset_id, history_dataset_association.create_time, history_dataset_association.update_time, history_dataset_association.state, history_dataset_association.copied_from_history_dataset_association_id, history_dataset_association.copied_from_library_dataset_dataset_association_id, history_dataset_association.name, history_dataset_association.info, history_dataset_association.blurb, history_dataset_association.peek, history_dataset_association.tool_version, history_dataset_association.extension, history_dataset_association.metadata_deferred, history_dataset_association.parent_id, history_dataset_association.designation, history_dataset_association.deleted, history_dataset_association.visible, history_dataset_association.extended_metadata_id, history_dataset_association.version, history_dataset_association.hid, history_dataset_association.purged, history_dataset_association.validated_state, history_dataset_association.validated_state_message, history_dataset_association.hidden_beneath_collection_instance_id, dataset_1.id AS id_1, dataset_1.job_id, dataset_1.create_time AS create_time_1, dataset_1.update_time AS update_time_1, dataset_1.state AS state_1, dataset_1.deleted AS deleted_1, dataset_1.purged AS purged_1, dataset_1.purgable, dataset_1.object_store_id, dataset_1.external_filename, dataset_1._extra_files_path, dataset_1.created_from_basename, dataset_1.file_size, dataset_1.total_size, dataset_1.uuid
FROM dataset_collection AS dataset_collection_1 JOIN dataset_collection_element AS dataset_collection_element_1 ON dataset_collection_element_1.dataset_collection_id = dataset_collection_1.id JOIN dataset_collection AS dataset_collection_2 ON dataset_collection_2.id = dataset_collection_element_1.child_collection_id AND dataset_collection_element_1.dataset_collection_id = dataset_collection_1.id LEFT OUTER JOIN dataset_collection_element AS dataset_collection_element_2 ON dataset_collection_element_2.dataset_collection_id = dataset_collection_2.id JOIN dataset_collection AS dataset_collection_3 ON dataset_collection_3.id = dataset_collection_element_2.child_collection_id AND dataset_collection_element_2.dataset_collection_id = dataset_collection_2.id LEFT OUTER JOIN dataset_collection_element AS dataset_collection_element_3 ON dataset_collection_element_3.dataset_collection_id = dataset_collection_3.id JOIN history_dataset_association ON history_dataset_association.id = dataset_collection_element_3.hda_id JOIN dataset ON dataset.id = history_dataset_association.dataset_id LEFT OUTER JOIN dataset AS dataset_1 ON dataset_1.id = history_dataset_association.dataset_id
WHERE dataset_collection_1.id = 6343673

to

SELECT history_dataset_association.id, history_dataset_association.history_id, history_dataset_association.dataset_id, history_dataset_association.create_time, history_dataset_association.update_time, history_dataset_association.state, history_dataset_association.copied_from_history_dataset_association_id, history_dataset_association.copied_from_library_dataset_dataset_association_id, history_dataset_association.name, history_dataset_association.info, history_dataset_association.blurb, history_dataset_association.peek, history_dataset_association.tool_version, history_dataset_association.extension, history_dataset_association.metadata_deferred, history_dataset_association.parent_id, history_dataset_association.designation, history_dataset_association.deleted, history_dataset_association.visible, history_dataset_association.extended_metadata_id, history_dataset_association.version, history_dataset_association.hid, history_dataset_association.purged, history_dataset_association.validated_state, history_dataset_association.validated_state_message, history_dataset_association.hidden_beneath_collection_instance_id, dataset_1.id AS id_1, dataset_1.job_id, dataset_1.create_time AS create_time_1, dataset_1.update_time AS update_time_1, dataset_1.state AS state_1, dataset_1.deleted AS deleted_1, dataset_1.purged AS purged_1, dataset_1.purgable, dataset_1.object_store_id, dataset_1.external_filename, dataset_1._extra_files_path, dataset_1.created_from_basename, dataset_1.file_size, dataset_1.total_size, dataset_1.uuid
FROM dataset_collection AS dataset_collection_1 JOIN dataset_collection_element AS dataset_collection_element_1 ON dataset_collection_element_1.dataset_collection_id = dataset_collection_1.id JOIN dataset_collection_element AS dataset_collection_element_2 ON dataset_collection_element_2.dataset_collection_id = dataset_collection_element_1.child_collection_id JOIN dataset_collection_element AS dataset_collection_element_3 ON dataset_collection_element_3.dataset_collection_id = dataset_collection_element_2.child_collection_id JOIN history_dataset_association ON history_dataset_association.id = dataset_collection_element_3.hda_id LEFT OUTER JOIN dataset AS dataset_1 ON dataset_1.id = history_dataset_association.dataset_id
WHERE dataset_collection_1.id = 6343673 ORDER BY dataset_collection_element_1.element_index, dataset_collection_element_2.element_index, dataset_collection_element_3.element_index

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

by joining via dataset collection element, and only bringing
in the dataset collection if we're filtering against the collection.
@mvdbeek
Copy link
Member Author

mvdbeek commented Mar 22, 2024

OK, had to turn this back into outerjoins for tests to pass, but that's only a minor increase to 9.6 ms, that's probably not worth tweaking now, this is still > 50 times faster.

@mvdbeek mvdbeek marked this pull request as ready for review March 22, 2024 18:59
@github-actions github-actions bot added this to the 24.1 milestone Mar 22, 2024
Copy link
Member

@jmchilton jmchilton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing - very impressive!

@martenson martenson modified the milestones: 24.1, 24.0 Mar 24, 2024
@martenson martenson merged commit bb0b2ac into galaxyproject:release_24.0 Mar 24, 2024
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants