Directions to Improve Distributed Data Handling in Galaxy #11787

jmchilton · 2021-04-05T14:09:56Z

Some important tangents to the distributed data question are:

Enabling more kinds of ObjectStores - potentially useful for distributed data, especially in a multi-user Galaxy context.

Implementing quotas per objectstore [WIP] Implement quota tracking options per ObjectStore. #10977
Implemented private objectstores to disable sharing, etc.. [WIP] Implement abstractions to annotate non-sharable datasets & objectstores. #10840
Per-user objectstore configurations User-based ObjectStore #4840

And Scalable Interfaces for Getting Data Into Galaxy

Needed: scalable way of selecting data from object stores for import into history #11523

Once the data is in "Galaxy" (i.e. is part of an objecstore abstraction configured within Galaxy and tracked in Galaxy's database) the biggest question is just how to get extended metadata collection robust and useful with Pulsar. This should be workable within the context of Kubernetes and outside it.

How does distributed data work with Pulsar on Kubernetes?

End-to-end manual test of Pulsar with extended metadata on without two-container job pods. File the bugs and get it working. xref Fixes for extended metadata working with Pulsar. #11788.
End-to-end manual test of Pulsar with extended metadata on with two-container job pods. File the bugs and get it working.
Enable end-to-end test of Pulsar with extended metadata in CI. Work started in [WIP] Extended metadata testing with Pulsar #9227 but failed. Issue tracking progress in Test Case for Extended Metadata Handling with Pulsar #11456.

Can distributed data work with Pulsar outside of Kubernetes?

Getting truly distributed data working with SLURM would vastly simplify the process of meta-scheduling on usegalaxy.org and could potentially provide important abstractions for re-working that approach for Condor flocking, Amazon Batch, etc..

galaxyproject/pulsar#250

There are a lot more open-ended questions I think related to working with initial datasets and getting data into Galaxy. We've spent years tackling questions about how to use data in a distributed fashion once it is in Galaxy but we haven't really tackled dealing with remote data that isn't already in a Galaxy object store.

How can we populate Galaxy with metadata previously generated?

Write documentation for model store discovery CLI and tooling. Identify limitations of this approach and outline improvements.

CLI description is in https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/model/store/build_objects.py#L15.
Implement admin-only APIs for importing model store data archives to make it more usable. (Currently it needs to target the database directly).

Important tangents here for building up those abstractions and import/export functionality is allowing model stores to work with workflow invocations #9024. With this in place, one could imagine distributing whole workflow invocations to local clusters where we bypass issues like caching and and then just shipping the resulting models to the source Galaxy.

How can we use Galaxy tools without metadata? Can we do CLI generation without metadata for certain tools/workflows.

Extensions to the tool specification to allow simple files instead of datasets ("data"). The tool wrappers would be given just a path (perhaps with a meaningful name) instead of the rest of the metadata.
Allow Simple "File" Inputs to Tools #11785

How can we use Galaxy tools without metadata? Can we defer metadata generation until immediately prior to the job?

If we could generate command lines as part of multi-container job pods, we could probably delay metadata generation until that point also.

This is scoped in #10873 and would likely be a substantial task but we've laid a lot of groundwork and it would allow us to make progress on things like converting Galaxy workflows wholesale to other workflow languages.

The previous task/section would still be useful because we could prevent metadata generation on the worker in the case we've annotated the tool doesn't have a use for it.

We have symlinked data for datasets, can we do something similar with remote data in the sense that it doesn't exist in a Galaxy objectstore at all.

The first big question here I think is do we need to store things in the database or can we just have a UI for collecting "files" (maybe from the file sources plugins URIs or maybe from a file selector or maybe from URLs) and running tools/workflows.

If we allow just allow invoking workflows and tools on URIs:

If we did this we could break the UI and backend into separate pieces of work and the UI could be used with a backend step that "materialized" the selected files as datasets until we are able to get away without it (e.g. implement "dataset" creation on inputs as part of the first job that uses them - questions tackled above).

An MVP User Story Might Include the Following Two Big Tasks

Implement something like the upload from tool form that selects URIs instead of producing datasets.
Implement options on the backend for materializing URIs before the job runs. API should consume {src: 'uri', uri: <uri>} in addition to {src: 'hda', id: <id>}.
If we combined this with Allow Simple "File" Inputs to Tools #11785, we could bypass materialization of dataset and just use files directly in this case.

Once that is done we could start optimizing things like incorporating a collection builder (e.g. #9114) from the URIs and augmenting dataset sniffing to reduce the possibilities for files to those compatible with a given tool/workflow execution.

If we want something like datasets in the database corresponding to the remote data:

If the database is needed and we do want something that looks like HDAs, should they be datasets or something else? Is storing a dereferenced dataset - adding a deferred column to the dataset and using the existing URI models attached (https://github.com/galaxyproject/galaxy/pull/7487/files). If not, what does that look like?

xref

The Nate white paper on related things (https://docs.google.com/document/d/1TQCbQxwLLijKHl_sZgTDQCpNXX2jejJQBljVrNslOvA/edit)

The text was updated successfully, but these errors were encountered:

jgoecks · 2021-04-08T04:40:24Z

Thanks for starting this conversation @jmchilton

We have symlinked data for datasets, can we do something similar with remote data in the sense that it doesn't exist in a Galaxy objectstore at all.

This is essential for Galaxy to scale to large datasets, and all other work should flow from the assumption that Galaxy must work with remote data, from upload to metadata generation to tool execution.

As an aside, there should be an abstraction layer over remote data to access/reference/copy via URI, DRS, etc.

This was referenced Apr 12, 2021

Prototype New Tool Submission/Job Creation Endpoint Based on Celery #11820

Open

Executive summary of 2021 Q2 Backend Goals #11824

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Directions to Improve Distributed Data Handling in Galaxy #11787

Directions to Improve Distributed Data Handling in Galaxy #11787

jmchilton commented Apr 5, 2021 •

edited

Loading

jgoecks commented Apr 8, 2021 •

edited

Loading

Directions to Improve Distributed Data Handling in Galaxy #11787

Directions to Improve Distributed Data Handling in Galaxy #11787

Comments

jmchilton commented Apr 5, 2021 • edited Loading

How does distributed data work with Pulsar on Kubernetes?

Can distributed data work with Pulsar outside of Kubernetes?

How can we populate Galaxy with metadata previously generated?

How can we use Galaxy tools without metadata? Can we do CLI generation without metadata for certain tools/workflows.

How can we use Galaxy tools without metadata? Can we defer metadata generation until immediately prior to the job?

We have symlinked data for datasets, can we do something similar with remote data in the sense that it doesn't exist in a Galaxy objectstore at all.

If we allow just allow invoking workflows and tools on URIs:

If we want something like datasets in the database corresponding to the remote data:

jgoecks commented Apr 8, 2021 • edited Loading

jmchilton commented Apr 5, 2021 •

edited

Loading

jgoecks commented Apr 8, 2021 •

edited

Loading