-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discuss Better Support for Deeply Nested Dataset Structures #4013
Comments
Here is an example from my discussion with @shiltemann @yhoogstrate David van Zessen and Andrew Stubbs. Suppose you have data from multiple patients. Each patient has three (or, really, any number) types of biopsies taken. Each biopsy is sequenced in several technical replicates with paired-end approach. In addition, there are other types of data about a patient such as smoker/non-smorer, age, sex, etc... So it looks something like this:
This is taken from this image: To upload such a structure into Galaxy users must be able to create a spreadsheet-like manifest in which he can associate individual files with appropriate metadata. This example is very similar to the ChIP-seq example we have discussed during the team meeting. |
This is also somehow related to the ISA-tab discussions we had for the metabolomics datatypes. It would be nice to have a general concept of uploading data as an archive with a self-describing format - that can be converted into list-of-list and so on. |
I've updated the original issue with whiteboard pictures - a huge thanks to Jen for taking these. |
Just for clarification and discussion purposes. My understanding of this is the following:
|
thank you @guerler |
@guerler and others, yes, this sounds really really great. @nekrut asked us to provides some more info on our process, so here is my 2 cents:
most of our users just upload it as separate files through the upload menu (unless files are very big). With the drag-and-drop feature and multiple file select it is easy enough to upload many files at once this way as well. Not sure if you were thinking of doing this upon upload, but often our users would want to change their initial design later (e.g. remove poor quality samples, fix mistakes, change/add metadata etc) or build their collection from data already on Galaxy (think shared data libraries, imported from data sources, or uploaded by others and shared with them) so the ability to edit or build a manifest file from scratch in Galaxy from items in the history would be great. Our experimental design/manifest usually looks like the one described by Anton. To give you a concrete example, right now we have an experiment where we have 100 samples we are analyzing with mothur, 3 technical replicates each, and metadata associated with each of them. Additionally we have 10 negative control samples, also consisting of 3 replicates each. Each sample has one negative control associated with it, but each negative control is associated with 10 of the samples. So one of the features/metadata of a sample could be a reference to another dataset as well. |
I've created two big issues for what I see as the next big steps in the direction outlined here - #4707 for the advanced dataset input piece and #4733 for getting large amounts of nested data into Galaxy. We can keep this open for general comments - but specific comments about those two big issues I guess should be redirected to said issues? |
Alright, I think action points 3 and 4 would help a lot with things that came up in #740 |
I'm going to close this issue - it was a good conversation and it shaped a half of year of my development time and I'm proud of the outcome. I don't think we are done by any means but the landscape has really shifted - we've made a lot of progress on all of these issues with 18.05 I think - and we should have a new discussion at some point that reflects the current state of things and the new constructs we have to address these concerns. @mvdbeek and I will discuss a bunch of the enhancements we've made to tackle these problems at the GCC. |
We had discussion about deeply nested data structures in-depth. This issue summarizes that meeting and our perceived take aways and then we close it once we have concrete issues (action items 😄) from that.
Some obvious existing GUI issues are #2495 and #3689 - and those just need to be fixed.
Here is the @jmchilton summary of this part of the meeting.
list:list:pair
s instead of alist:list:list*:pair
(where thelist*
always contains element identifierscondition
andcontrol
) - the workflow would be executable today in Galaxy. The part that isn't there with the proposed approach is to take Macs without modification and feed "control" sublist and "condition" sublist to separate inputs.As I have been thinking about the meeting I had some more thoughts about record types and my final impression was that simply adding constraints to lists would get us farther, faster than building up record types - which I see as being more general (and potentially too general to useful in the context of our GUI). I want to implement record types - but the GUI problems seem more tractable with list constraints.
Digesting all of that I'm tempted to create these concrete issues:
We didn't discuss where the user would do this - but since the meeting I have been thinking workflow inputs is a good start as well as during creation itself. When uploading/creating collections - you can create the list constraints directly or import an input collection definition from a workflow via one of its inputs.
The text was updated successfully, but these errors were encountered: