-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rule-based Uploader / Builder #5365
Conversation
afbe051
to
97436cd
Compare
@jmchilton Are you aiming for 18.01? Or should the milestone be updated? |
@nsoranzo Breaking Anton's heart I've switched this to 18.05. |
jmchilton#64 should fix this up a bit, resolving both the production build issues and the upload initialization failures that were breaking qunit tests. |
97436cd
to
eb5c4d4
Compare
41edd5e
to
91726f9
Compare
Updated issue description with links to followup issues and other related issues. |
Hi @jmchilton this sounds very exciting! and will this:
help with this issue? #740 |
No, I don't see how it would to be honest. I saw your screenshot in that issue where you are dragging bits and pieces of the collection out and into multi-select drop box. This PR does allow creating nested lists however via the GUI, and if you have nested lists upfront I'm really hoping you won't even want to do the thing outlined in #740. You can have multiple replicates with different conditions or whatever and take advantage of that to avoid needing to specify the grouping manually after the fact? The tool you showed had a group name and then multiple files per group - well that really should have an option of just taking in a nested list and using the outer list identifier as the group name and the inner lists are the files you select. If parts of the analysis require things to be in a flatter fashion - there is a collection operation to build a flatten collection so you can have two views of the same data for different parts of the analysis. Then if everything stays in collections throughout you you'll always have the element identifier. If this model is too simplistic - and you need to really filter the data deeply and get even more different views of it, if what you are doing manually is something that can be represented by the rule builder here - I did outline creating a collection operation tool out the rule language here in this issue (#5381) and that would allow really arbitrary different organizations and filtered views of the data and could applied in the middle of workflows for instance - so it would be great for manual or automated analyses. |
Some things I noticed (in chronological order)
Looking the job up in the database I see this in the stderr:
Turns out this is also broken through the regular upload, I will check if this is a regression or not. |
(I'm behind a proxy, this upload issue probably has nothing to do with the PR, except for the missing indication that something went wrong) |
For the collection building from history items:
|
Thanks for trying it out @mvdbeek!
This is the instinct I'm trying to break. I'm not trying to build a crappy Excel - I'm trying to force people into writing programs that can be recorded and reran for munging their data and that scale to arbitrary numbers of rows. In that spirit of pretending I know better than users I explicitly disable editing. At lower level the way this works is that every time a rule is added or editing everything is regenerated from the initial data. If there are manual edits in the middle - the rules won't reapply - this breaks down. That said - how about I offer a olive branch to this instinct to edit - since both you and Anton had the same instinct. As long as there are no rules applied to data - so when you first land on the page - you can edit and I'll record it as a pseudo rule "Manually Edited Data" and note that it isn't reproducible or exportable. Then as soon as rules start being applied to the data the table becomes read only and I add some mouseover that explains why it is read only. Another way this could work is we keep the rule editor the same - real-only - and I could implement your other idea of allowing pasting directly into a spreadsheet but we could do that on the previous page during upload. So instead of pasting into the textbox we figure out how to paste into a spreadsheet and then allow editing from there as soon as the build button is clicked and the data is brought over to the rule builder things are frozen. This would make it very clear what is input and what is the set of rules the user is building - and that inputs are modifiable and the rules are reproducible/executable on the backend/etc....
I've added that as an enhancement to #5380.
I feel like this sometimes works for me and sometimes doesn't. I can spend some more time working on the stylesheet and figure out what is wrong. I've added this as a bug to #5380.
I thought I had tested this but clearly not. I'll definitely fix that. I've added this as a bug to #5380.
There is a discussion of this in #5380 already - the cardinal sin like always is that Galaxy doesn't represent jobs (or workflows) in the history panel so these errors aren't seen. If you keep the modal dialog open it will poll on the job and report errors right in the panel I think - at least it did for me at various times. Short of putting jobs in the history panel there are two fixes we could apply - we could either add an explicit output to the data fetching tool added in #5220 the way some tools do that could summarize what happened and that would turn red on failure -or- we could do some harder work and update the new data fetch API to prebuild skeletons for all datasets and collections that need to be created before the job runs (right now these are just discovered after the job is complete). The log file approach has always struck me as kind of hacky - but might be useful in other consumers of this API so who knows. Pre-creating everything sounds great - but is tricky to implement and could result in a lot of pointless red stuff in the history if things fail. We could also blend the approaches and sort discover a log file if the job fails. I'm not sure how to proceed here - happy to hear advice though. |
I didn't even start to think about non-HDAs. I'll report and error if such things are used for now - obviously down the road some combinations of these things will be super powerful and super awesome though. |
This is great. I am wondering if we can consolidate the other collection builders e.g. in the Collection uploader tab, the history panel and this one into a single collection builder ui? |
7470631
to
0626296
Compare
The collection tab is unified with the history panel in the sense that they both launch the same components after the files or datasets are selected. That initial step of selecting datasets or files is pretty different though. This makes sense to me - in one case the datasets are in the history and in the other case the datasets don't yet exist in your history. I guess one could eliminate the upload collection tab and require files be added to the history first - but that is how it was in the past and many people called in confusing and resulted in a lot of extra error prone clicking that can be avoided by skipping the history step. One could imagine replacing the existing collection builders (the things after the initial upload or after things are selected in the history) with this work - because indeed this can do a superset of what they can do, but I'd argue we shouldn't do that. This just isn't as simple as the paired-end list builder or the list builder when building small lists (less say a few dozen items or less). While I really think we should work toward make this as simple as possible - this approach will always be more complex I think and so it targeted at more technically sophisticated users. From the above description:
The other thing to keep in mind is that this new option is not a collection upload option necessarily - since you can use rules to upload individual datasets. Check out the first example in the tutorial I'm building https://github.com/jmchilton/training-material/blob/rules/topics/introduction/tutorials/galaxy-intro-rules/tutorial.md for instance. I use uploading datasets as a hook into defining rules and such. That is the pushback - where I might agree is say in the history panel - it would be nice to have a "create collection" for instance that opens up dialog or something like the new viz panel that describes the 4 different ways to build a collection in detail instead of having the 4 individual options. Something along the lines of what @martenson did for libraries #5080 - but maybe big buttons like in the viz instead of a dropdown - and then we could synchronize the help language across libraries and histories and maybe add some images and such that can be shared throughout? The history => collection builder would then be a two step process and everyone explicitly shot down wizards along time ago when we first started talking about this. Integrating this into the uploader the same way would be a three step process - definitely a wizard. It seems like multi-step wizard-ish interfaces would give us more room to explain things and synchronize UI elements - so I think I'm in favor of doing that but it was explicitly shot down by the powers that be in the past. |
f57a245
to
99a93f7
Compare
@mvdbeek I have updated this using #5609 which causes datasets and collections to be pre-created during job submission and to refresh the history just after the job is submitted. So if there is some runtime issue with creating collections or dataset there will now be a red collection or dataset. The job details are not available in the GUI for direct output collections - that would be really helpful for end users trying to debug problems I think - this is a sort of general Galaxy issue. I may take a crack at this generally outside the context of this PR but it would enhance this PR. |
@jmchilton Can you rebase? After the merge of #5220 this has a few conflicts. |
c641aba
to
6049021
Compare
Vue-based component for defining collections by applying rules to a list of files or more general spreadsheet style information (e.g. sample sheets or tabular data from data sources containing URL or FTP file paths for files along with metadata). The widget is fairly complex but very broadly is broken into two panes - one to preview how rules are applied to build up tabular data defining collections (each row corresponding to a file with columns for metadata and such) and one that displays defined rules and allows for editing of these rules and creation of new ones. The goal behind defining rules this way instead of allowing the user to interact with the spreadsheet display directly is to enable scaling up collection creation. If a user wishes to upload hundreds of datasets - interacting with a widget directly for each input doesn't scale well and would be error prone. If a user wishes to upload hundreds of thousands of datasets - even loading this information in the GUI may not scale (though I've been impressed with the performance so far of this approach) and so we can potentially just display a preview of some of the rows and process the final set of rules on the backend. Since we can handle an arbitrary number of columns this way, we can define multiple list identifiers per file and so we can easily construct nested lists. Hence this allows creation of not just potentially larger collections but arbitrarily complex lists as well. Paired identifiers via indicator columns are also implemented. In order to operate over lists of datasets directly - the multi-select history widget now has a new option "Build Collection from Rules" along side the other collection builders. This mode uses the well established dataset collection API to build collections from HDAs. In order to operate on lists of FTP files or URLs - the upload widget has a new tab "Rule-based" tab that allows users to paste in tabular data or select a history dataset and then send this tabular data to the new builder widget. This will be extended to include FTP directories for instance over time. This mode uses the new data fetch API to build collections and handle uploads of arbitrary collections of files. The preview of the tabular data generated via rules is done via [Handsontable](https://handsontable.com/) - a JavaScript spreadsheet widget with a VueJS [wrapper component](https://github.com/handsontable/vue-handsontable-official). This turns out to be a fairly nice application for reactive components - as rules are added or modified the spreadsheet just naturally updates. In my hands the widget scales very nicely - I've uploaded files with tens of thousands of rows and rules modifying the data and changing the spreadsheet do not seem to cause siignificant delays in the web browser.
…lation without a vue runtime for nested local components in vue SFCs, and other situations.
6049021
to
1c04bbe
Compare
I think it is ready for the big stage. Majority of things work as expected and besides few validation problems I encountered nothing that would stop me from using this efficiently. I will point out only one thing here: I think the editor/rule builder needs more canvas, the modal is too much of a size constraint. It is very exciting feature @jmchilton and I think many people will love this! Thank you for your review @mvdbeek. I will put this on https://test.galaxyproject.org and hopefully @nekrut @blankenberg @jgoecks and other power users will give it a spin so we can see what it can do in a real settings. |
I deployed the new rule-based dataset/collection uploader on https://test.galaxyproject.org/ p.s. sorry for leaving this open for so long @jmchilton 😞 |
This is great!!! 👏 🎉 I've only played with it a small bit so far but it works beautifully for importing the counts files I need for a tutorial I'm working on (here: https://www.bioconductor.org/help/workflows/RNAseq123/) e.g. if I paste this into the Rule-builder
I can import the counts files from GEO directly into collections for the groups Basal/LP/MP really easily. 😄 Yay!!! Would be great to be able to add hashtags to save on doing that after importing eg add hashtags to the collections using the group column (Basal/LP/MP). Then it would be super quick to go from grabbing the counts, to groups ready for the next step in the tutorial - differential expression with limma-voom. I have a few other small feedback points (should I put them here?):
Thanks for this 😄 |
@mblue9 Thanks for the quick feedback - I'll try to get at least some of this implemented this week. I've tracked a lot of these issues on a new issue here (#5822). The one about using previous rules is a little more intricate so I put it on a longer term tracking issue here #5381. I'll think about the metadata issue to - not sure what caused that. The RNA seq example looks exciting! |
Thanks to you @jmchilton for working on this! Just to say, the previous rules thing is only a nice-to-have for me at this stage, I'm not that fussed about that one atm. Much more interested in #5381 (comment) |
Vue-based component for defining collections and dataset uploads by applying rules to a list of files or more general spreadsheet style information (e.g. sample sheets or tabular data from data sources containing URL or FTP file paths for files along with metadata). The widget is fairly complex but very broadly is broken into two panes - one to preview how rules are applied to build up tabular data defining collections (each row corresponding to a file with columns for metadata and such) and one that displays defined rules and allows for editing of these rules and creation of new ones.
The goal behind defining rules this way instead of allowing the user to interact with the spreadsheet display directly is to enable scaling up collection creation. If a user wishes to upload hundreds of datasets - interacting with a widget directly for each input doesn't scale well and would be error prone. If a user wishes to upload hundreds of thousands of datasets - even loading this information in the GUI may not scale (though I've been impressed with the performance so far of this approach) and so we can potentially just display a preview of some of the rows and process the final set of rules on the backend.
Since we can handle an arbitrary number of columns this way, we can define multiple list identifiers per file and so we can easily construct nested lists. Hence this allows creation of not just potentially larger collections but arbitrarily complex lists as well. Paired identifiers via indicator columns are also implemented. This is our first GUI-based approach to allowing the creation of nested lists and enables a majority of the user stories I outlined in #4733.
In order to operate over lists of datasets directly - the multi-select history widget now has a new option "Build Collection from Rules" along side the other collection builders. This mode uses the well established dataset collection API to build collections from HDAs.
In order to operate on lists of FTP files or URLs - the upload widget has a new tab "Rule-based" tab that allows users to paste in tabular data -or- select a history dataset -or- user their FTP directory contents and then send this tabular data to the new builder widget. This mode uses the new data fetch API (#5220) to build collections and handle uploads of arbitrary collections of files.
The preview of the tabular data generated via rules is done via Handsontable - a JavaScript spreadsheet widget with a VueJS wrapper component. This turns out to be a fairly nice application for reactive components - as rules are added or modified the spreadsheet just naturally updates. In my hands the widget scales very nicely - I've uploaded files with tens of thousands of rows and rules modifying the data and changing the spreadsheet do not seem to cause significant delays in the web browser.
I was in the middle of developing a tutorial / training material section and detailed test cases for this component since it is more complex and less obvious than typical Galaxy GUI but @nekrut wants me to open a PR now and get something rougher than I'd like in. So this works I think - but it has some rough edges. I've created an issue to track these rough edges with #5380 and reviewers can decide what needs more polish before an initial merge and what can wait for follow up commits - I'll also track reviewer comments there.
Another thing to keep in mind, I think this is a power user feature like notebooks - I'll try to add lots of in-app help and documentation - but ultimately users are creating a program for defining how to input data into Galaxy. I think this is a good direction to move parts of the GUI - not just serving users without programming knowledge but providing paths to learn and incentivize learning these skills. That said the target audience is a bit different and so hopefully reviewers are on board for that vision also, and even the existing paired list creator uses regex for instance.
This PR now includes the first three test cases outlined in #5379 and so numerous screenshots are produced every time the PR is updated. Click the Selenium tests, then Build Artifacts, then screenshots. Here is an example:
The screenshots available:
The first use case - featuring pasting data into the browser, stripping out header information, and uploading individual datasets from ENA (not in a collection).
The second use case uses the same data but loads it from a history dataset instead of copying into the web browser and builds a collection (a flat list in this case) from the data.
The third example uses a new dataset from ENA and demonstrates building a list of pairs along with many more rule operations including hiding columns, swapping columns, splitting up a cell that has two URLs with a regular expression, extracting and mapping paired identifier information.
Related issues: