You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Once the data is in "Galaxy" (i.e. is part of an objecstore abstraction configured within Galaxy and tracked in Galaxy's database) the biggest question is just how to get extended metadata collection robust and useful with Pulsar. This should be workable within the context of Kubernetes and outside it.
How does distributed data work with Pulsar on Kubernetes?
Can distributed data work with Pulsar outside of Kubernetes?
Getting truly distributed data working with SLURM would vastly simplify the process of meta-scheduling on usegalaxy.org and could potentially provide important abstractions for re-working that approach for Condor flocking, Amazon Batch, etc..
There are a lot more open-ended questions I think related to working with initial datasets and getting data into Galaxy. We've spent years tackling questions about how to use data in a distributed fashion once it is in Galaxy but we haven't really tackled dealing with remote data that isn't already in a Galaxy object store.
How can we populate Galaxy with metadata previously generated?
Write documentation for model store discovery CLI and tooling. Identify limitations of this approach and outline improvements.
Implement admin-only APIs for importing model store data archives to make it more usable. (Currently it needs to target the database directly).
Important tangents here for building up those abstractions and import/export functionality is allowing model stores to work with workflow invocations #9024. With this in place, one could imagine distributing whole workflow invocations to local clusters where we bypass issues like caching and and then just shipping the resulting models to the source Galaxy.
How can we use Galaxy tools without metadata? Can we do CLI generation without metadata for certain tools/workflows.
Extensions to the tool specification to allow simple files instead of datasets ("data"). The tool wrappers would be given just a path (perhaps with a meaningful name) instead of the rest of the metadata. Allow Simple "File" Inputs to Tools #11785
How can we use Galaxy tools without metadata? Can we defer metadata generation until immediately prior to the job?
If we could generate command lines as part of multi-container job pods, we could probably delay metadata generation until that point also.
This is scoped in #10873 and would likely be a substantial task but we've laid a lot of groundwork and it would allow us to make progress on things like converting Galaxy workflows wholesale to other workflow languages.
The previous task/section would still be useful because we could prevent metadata generation on the worker in the case we've annotated the tool doesn't have a use for it.
We have symlinked data for datasets, can we do something similar with remote data in the sense that it doesn't exist in a Galaxy objectstore at all.
The first big question here I think is do we need to store things in the database or can we just have a UI for collecting "files" (maybe from the file sources plugins URIs or maybe from a file selector or maybe from URLs) and running tools/workflows.
If we allow just allow invoking workflows and tools on URIs:
If we did this we could break the UI and backend into separate pieces of work and the UI could be used with a backend step that "materialized" the selected files as datasets until we are able to get away without it (e.g. implement "dataset" creation on inputs as part of the first job that uses them - questions tackled above).
An MVP User Story Might Include the Following Two Big Tasks
Implement something like the upload from tool form that selects URIs instead of producing datasets.
Implement options on the backend for materializing URIs before the job runs. API should consume {src: 'uri', uri: <uri>} in addition to {src: 'hda', id: <id>}.
If we combined this with Allow Simple "File" Inputs to Tools #11785, we could bypass materialization of dataset and just use files directly in this case.
Once that is done we could start optimizing things like incorporating a collection builder (e.g. #9114) from the URIs and augmenting dataset sniffing to reduce the possibilities for files to those compatible with a given tool/workflow execution.
If we want something like datasets in the database corresponding to the remote data:
If the database is needed and we do want something that looks like HDAs, should they be datasets or something else? Is storing a dereferenced dataset - adding a deferred column to the dataset and using the existing URI models attached (https://github.com/galaxyproject/galaxy/pull/7487/files). If not, what does that look like?
We have symlinked data for datasets, can we do something similar with remote data in the sense that it doesn't exist in a Galaxy objectstore at all.
This is essential for Galaxy to scale to large datasets, and all other work should flow from the assumption that Galaxy must work with remote data, from upload to metadata generation to tool execution.
As an aside, there should be an abstraction layer over remote data to access/reference/copy via URI, DRS, etc.
Some important tangents to the distributed data question are:
Enabling more kinds of ObjectStores - potentially useful for distributed data, especially in a multi-user Galaxy context.
And Scalable Interfaces for Getting Data Into Galaxy
Once the data is in "Galaxy" (i.e. is part of an objecstore abstraction configured within Galaxy and tracked in Galaxy's database) the biggest question is just how to get extended metadata collection robust and useful with Pulsar. This should be workable within the context of Kubernetes and outside it.
How does distributed data work with Pulsar on Kubernetes?
Can distributed data work with Pulsar outside of Kubernetes?
Getting truly distributed data working with SLURM would vastly simplify the process of meta-scheduling on usegalaxy.org and could potentially provide important abstractions for re-working that approach for Condor flocking, Amazon Batch, etc..
galaxyproject/pulsar#250
There are a lot more open-ended questions I think related to working with initial datasets and getting data into Galaxy. We've spent years tackling questions about how to use data in a distributed fashion once it is in Galaxy but we haven't really tackled dealing with remote data that isn't already in a Galaxy object store.
How can we populate Galaxy with metadata previously generated?
Write documentation for model store discovery CLI and tooling. Identify limitations of this approach and outline improvements.
CLI description is in https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/model/store/build_objects.py#L15.
Implement admin-only APIs for importing model store data archives to make it more usable. (Currently it needs to target the database directly).
Important tangents here for building up those abstractions and import/export functionality is allowing model stores to work with workflow invocations #9024. With this in place, one could imagine distributing whole workflow invocations to local clusters where we bypass issues like caching and and then just shipping the resulting models to the source Galaxy.
How can we use Galaxy tools without metadata? Can we do CLI generation without metadata for certain tools/workflows.
Allow Simple "File" Inputs to Tools #11785
How can we use Galaxy tools without metadata? Can we defer metadata generation until immediately prior to the job?
If we could generate command lines as part of multi-container job pods, we could probably delay metadata generation until that point also.
This is scoped in #10873 and would likely be a substantial task but we've laid a lot of groundwork and it would allow us to make progress on things like converting Galaxy workflows wholesale to other workflow languages.
The previous task/section would still be useful because we could prevent metadata generation on the worker in the case we've annotated the tool doesn't have a use for it.
We have symlinked data for datasets, can we do something similar with remote data in the sense that it doesn't exist in a Galaxy objectstore at all.
The first big question here I think is do we need to store things in the database or can we just have a UI for collecting "files" (maybe from the file sources plugins URIs or maybe from a file selector or maybe from URLs) and running tools/workflows.
If we allow just allow invoking workflows and tools on URIs:
If we did this we could break the UI and backend into separate pieces of work and the UI could be used with a backend step that "materialized" the selected files as datasets until we are able to get away without it (e.g. implement "dataset" creation on inputs as part of the first job that uses them - questions tackled above).
An MVP User Story Might Include the Following Two Big Tasks
{src: 'uri', uri: <uri>}
in addition to{src: 'hda', id: <id>}
.If we combined this with Allow Simple "File" Inputs to Tools #11785, we could bypass materialization of dataset and just use files directly in this case.
Once that is done we could start optimizing things like incorporating a collection builder (e.g. #9114) from the URIs and augmenting dataset sniffing to reduce the possibilities for files to those compatible with a given tool/workflow execution.
If we want something like datasets in the database corresponding to the remote data:
If the database is needed and we do want something that looks like HDAs, should they be datasets or something else? Is storing a dereferenced dataset - adding a deferred column to the dataset and using the existing URI models attached (https://github.com/galaxyproject/galaxy/pull/7487/files). If not, what does that look like?
xref
The text was updated successfully, but these errors were encountered: