Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow selecting subset of collection by group tag #5457

Merged
merged 9 commits into from
Aug 7, 2018

Conversation

mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Feb 4, 2018

This would allow selecting tagged subsets of (nested) collections. To use this you need to tag the individual collection elements with group:<some group name>. All groups defined this way will be selectable in the tool form.

Together with this patch against the deseq2 wrapper one can setup a deseq2 experiment from an abitrarily nested collection like this:
screen shot 2018-02-04 at 12 02 02

That's a fairly simple way to do this, I'd be curious to know if people think that something like this could work out ?
xref #740

@jmchilton
Copy link
Member

Thanks for throwing this out there - I appreciate the conversation and the prototype to ground it! I've been thinking about this since you opened it and I'm going to continue to - there are some things really awesome here and some things that make me nervous (but I'm always nervous).

  • In this example, would the tags come during the initial upload/import and we would need to develop a system for deciding what non-name tags get propagated or do you expect the previous tool in the chain to annotate it outputs with tags (we could certainly add a syntax for that)?
  • Do the files in the different levels potentially overlap (i.e. might a file belong to multiple factors)?
  • Would you ever use deseq2 with a large number of factors or in workflows where the number of factors vary based on the inputs.

@mvdbeek
Copy link
Member Author

mvdbeek commented Feb 5, 2018

In this example, would the tags come during the initial upload/import and we would need to develop a system for deciding what non-name tags get propagated or do you expect the previous tool in the chain to annotate it outputs with tags (we could certainly add a syntax for that)?

I think the grouping information has to be supplied during upload / when building the collection, I don't think this can be inferred realistically (Of course I could be wrong!). This morning I wrote a tool that adds tags based on a text file that maps element identifiers to tags, this is very similar to the relabel identifiers from file tool. And yes, those tags should be propagated by default, similar to the name tags.

Do the files in the different levels potentially overlap (i.e. might a file belong to multiple factors)?

Yes, I think that's frequently the case (the example here is from the deseq2 iuc test data, where a file belongs to one of the "paired" and "treated" factor levels).

Would you ever use deseq2 with a large number of factors

That depends a bit on what you would call a large number. I guess more than 4 would be uncommon, but each factor may again have multiple levels (think time course experiments).

or in workflows where the number of factors vary based on the inputs.

Yes, I think that is also common since that depends on the experimental design. Currently (and with this proposition as well) we have to modify each workflow if the experimental design is changed (i.e we would need to add or remove factors and factor levels in the deseq2 tool). Still with this approach we at least don't need to re-wire the connections in the workflow as we have to do with multiple flat collections. In fact since now the inputs are unchanged this can simply be runtime parameters.

@mvdbeek
Copy link
Member Author

mvdbeek commented Feb 5, 2018

A minor inconvenience I see is that tags don't work for anonymous users (so this is a bit harder to test), but I guess we'd make a special variant of Tag anyway if we go with this approach.

@mvdbeek
Copy link
Member Author

mvdbeek commented Feb 6, 2018

Also tags like group:treatment and group:day1 are not supposed to co-exist on the same dataset, I guess. It seems to work on sqlite, but not with postgresql:

IntegrityError: (psycopg2.IntegrityError) duplicate key value violates unique constraint "tag_name_key"
DETAIL:  Key (name)=(group) already exists.
 [SQL: 'INSERT INTO tag (type, parent_id, name) VALUES (%(type)s, %(parent_id)s, %(name)s) RETURNING tag.id'] [parameters: {'parent_id': None, 'type': 0, 'name': u'group'}]
galaxy.tools.execute WARNING 2018-02-06 15:22:38,808 [p:2122,w:1,m:0] [uWSGIWorker1Core1] There was a failure executing a job for tool [__TAG_FROM_FILE__] - Error executing tool: (psycopg2.IntegrityError) duplicate key value violates unique constraint "tag_name_key"

@mvdbeek
Copy link
Member Author

mvdbeek commented Feb 6, 2018

Actually that works via the history UI, but not if I set this with #5462 ... maybe each tag needs to be set individually. Is this supposed to work ? I guess it should, given that you can add many name tags ?!

EDIT: yep, need to flush after each new tag,

jmchilton added a commit to jmchilton/galaxy that referenced this pull request Jul 12, 2018
Based on galaxyproject#5457, certain paths to creating tag values with : must already work - just not when parsing from a raw string I guess.
@@ -2103,6 +2201,7 @@ def to_dict(self, trans, other_values=None):
genomebuild=GenomeBuildParameter,
select=SelectToolParameter,
color=ColorToolParameter,
select_tag=SelectTagParameter,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't call data parameters select_data and this doesn't apply to all tags - so I'm going to switch this to group_tag - I hope that is okay? I'll add an example tool, test case, and update the XSD. Let me know if there is any issue with this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes more sense!

- Switch select_tag to group_tag as a parameter name - since it is restricted to group_tags and we don't call data parameters select_data.
- Add minimal XSD documentation for this new parameter type.
- Add example tools - for single and multiple tags.
- Add API tests for these tools and the tag handling.
@jmchilton
Copy link
Member

jmchilton commented Jul 24, 2018

Merged with latest dev and pushed a commit that:

  • Switches select_tag to group_tag as a parameter name - discussed above.
  • Adds minimal XSD documentation for this new parameter type.
  • Adds example tools - for single and multiple tags.
  • Add API tests for these tools and the tag handling.

Unfortunately the multiple parameter test doesn't pass locally for me and I think it is because the tags are being reordered vs. what is sent in through the API. If that isn't stable I wonder if we should just drop "multiple=True" as an option and require a repeat? This would allow multiply selecting the same tag for applications that might require it... it would also allow multiply selecting the same tag for applications that might want to prohibit it. Not sure.

I might try to take a stab at making it stable though.

@jmchilton
Copy link
Member

It turns out the order was made unstable by throwing things into a set to ensure supplied tags were distinct - I switched the logic here to use a list instead of set to ensure the tags are distinct and the test case passed. That said - I have no clue why we would enforce that - it feels like if the user is supplying duplicate tags in a specific order they must have a reason (not that the UI allows it - but it should I assume - same with multi-data parameters) and we should just respect that.

@mvdbeek Mind if I make the switch of not enforcing the tags are distinct in from_json - or if you want to keep it mind if I hide it behind a boolean tool param attribute called distinct? After that I'm +1 on this PR now.

@mvdbeek
Copy link
Member Author

mvdbeek commented Aug 6, 2018

it feels like if the user is supplying duplicate tags in a specific order they must have a reason

If a group is present it should correspond to 1 or more datasets, so if a group has multiple corresponding datasets, the group would appear multiple times.
At least in the differential expression example that wouldn't be helpful.

I can see how it would be helpful though in other scenarios, so a distinct attribute sounds good to me. I wonder if a with_replacement might be helpful as well, so you could pick as many times as you wanted if with_replacement was true ? Or is that the default and I want without_replacement 😕 ? Or should we go stepwise and see if we actually need it ?

@jmchilton
Copy link
Member

If a group is present it should correspond to 1 or more datasets, so if a group has multiple corresponding datasets, the group would appear multiple times.

I was I believe talking about making things distinct when the user has already selected the input and is running the tool and has somehow selected a tag more than once. I feel like you responded to making things distinct during the dropdown building. I feel like for the dropdown things are distinct and that makes perfect sense - we don't an option to change that.

I also don't understand your with_replacement attribute - is that different name for it or does that do something different?

@mvdbeek
Copy link
Member Author

mvdbeek commented Aug 6, 2018

I was I believe talking about making things distinct when the user has already selected the input and is running the tool and has somehow selected a tag more than once.

Oh, yeah, that we shouldn't do.
I mean, we shouldn't make it distinct if a user explicitly chose the same tag multiple times.

I also don't understand your with_replacement attribute - is that different name for it or does that do something different?

That was still me thinking about whether a user can select the same thing twice. Which isn't specific to group tags, that probably applies to all multiple="True" select parameters.

# Check if a dataset is selected
if not history_items:
return []
tags = set()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in practice we just make this a list, right ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that one is fine? I think we just drop the and tag not in tag_list at https://github.com/galaxyproject/galaxy/pull/5457/files#diff-9baf995401cfeb779edf8731ebaf0d2dR1064.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, I think it was a bit too late for me yesterday ...

@mvdbeek mvdbeek changed the title RFC: allow selecting subset of collection by group tag Allow selecting subset of collection by group tag Aug 7, 2018
@mvdbeek
Copy link
Member Author

mvdbeek commented Aug 7, 2018

Alright, then I guess this is ready for a final round of review!

@galaxybot galaxybot added this to the 18.09 milestone Aug 7, 2018
@jmchilton jmchilton merged commit 55436c3 into galaxyproject:dev Aug 7, 2018
@jmchilton
Copy link
Member

Amazing, thanks so much for pushing the platform in this direction - and explaining it to me slowly and repeatedly. This is very much needed work!

@galaxybot
Copy link
Contributor

This PR was merged without a 'kind/' tag, please correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants