Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Directions to Improve Distributed Data Handling in Galaxy #11787

Open
jmchilton opened this issue Apr 5, 2021 · 1 comment
Open

Directions to Improve Distributed Data Handling in Galaxy #11787

jmchilton opened this issue Apr 5, 2021 · 1 comment

Comments

@jmchilton
Copy link
Member

jmchilton commented Apr 5, 2021

Some important tangents to the distributed data question are:

Enabling more kinds of ObjectStores - potentially useful for distributed data, especially in a multi-user Galaxy context.

And Scalable Interfaces for Getting Data Into Galaxy

Once the data is in "Galaxy" (i.e. is part of an objecstore abstraction configured within Galaxy and tracked in Galaxy's database) the biggest question is just how to get extended metadata collection robust and useful with Pulsar. This should be workable within the context of Kubernetes and outside it.

How does distributed data work with Pulsar on Kubernetes?

Can distributed data work with Pulsar outside of Kubernetes?

Getting truly distributed data working with SLURM would vastly simplify the process of meta-scheduling on usegalaxy.org and could potentially provide important abstractions for re-working that approach for Condor flocking, Amazon Batch, etc..

galaxyproject/pulsar#250

There are a lot more open-ended questions I think related to working with initial datasets and getting data into Galaxy. We've spent years tackling questions about how to use data in a distributed fashion once it is in Galaxy but we haven't really tackled dealing with remote data that isn't already in a Galaxy object store.

How can we populate Galaxy with metadata previously generated?

Important tangents here for building up those abstractions and import/export functionality is allowing model stores to work with workflow invocations #9024. With this in place, one could imagine distributing whole workflow invocations to local clusters where we bypass issues like caching and and then just shipping the resulting models to the source Galaxy.

How can we use Galaxy tools without metadata? Can we do CLI generation without metadata for certain tools/workflows.

  • Extensions to the tool specification to allow simple files instead of datasets ("data"). The tool wrappers would be given just a path (perhaps with a meaningful name) instead of the rest of the metadata.
    Allow Simple "File" Inputs to Tools #11785

How can we use Galaxy tools without metadata? Can we defer metadata generation until immediately prior to the job?

If we could generate command lines as part of multi-container job pods, we could probably delay metadata generation until that point also.

This is scoped in #10873 and would likely be a substantial task but we've laid a lot of groundwork and it would allow us to make progress on things like converting Galaxy workflows wholesale to other workflow languages.

The previous task/section would still be useful because we could prevent metadata generation on the worker in the case we've annotated the tool doesn't have a use for it.

We have symlinked data for datasets, can we do something similar with remote data in the sense that it doesn't exist in a Galaxy objectstore at all.

The first big question here I think is do we need to store things in the database or can we just have a UI for collecting "files" (maybe from the file sources plugins URIs or maybe from a file selector or maybe from URLs) and running tools/workflows.

If we allow just allow invoking workflows and tools on URIs:

If we did this we could break the UI and backend into separate pieces of work and the UI could be used with a backend step that "materialized" the selected files as datasets until we are able to get away without it (e.g. implement "dataset" creation on inputs as part of the first job that uses them - questions tackled above).

An MVP User Story Might Include the Following Two Big Tasks

  • Implement something like the upload from tool form that selects URIs instead of producing datasets.
  • Implement options on the backend for materializing URIs before the job runs. API should consume {src: 'uri', uri: <uri>} in addition to {src: 'hda', id: <id>}.
    If we combined this with Allow Simple "File" Inputs to Tools #11785, we could bypass materialization of dataset and just use files directly in this case.

Once that is done we could start optimizing things like incorporating a collection builder (e.g. #9114) from the URIs and augmenting dataset sniffing to reduce the possibilities for files to those compatible with a given tool/workflow execution.

If we want something like datasets in the database corresponding to the remote data:

If the database is needed and we do want something that looks like HDAs, should they be datasets or something else? Is storing a dereferenced dataset - adding a deferred column to the dataset and using the existing URI models attached (https://github.com/galaxyproject/galaxy/pull/7487/files). If not, what does that look like?


xref

@jgoecks
Copy link
Contributor

jgoecks commented Apr 8, 2021

Thanks for starting this conversation @jmchilton

We have symlinked data for datasets, can we do something similar with remote data in the sense that it doesn't exist in a Galaxy objectstore at all.

This is essential for Galaxy to scale to large datasets, and all other work should flow from the assumption that Galaxy must work with remote data, from upload to metadata generation to tool execution.

As an aside, there should be an abstraction layer over remote data to access/reference/copy via URI, DRS, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants