Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename output in workflow fails on paired dataset collection #1675

Closed
dmaticzka opened this issue Feb 4, 2016 · 14 comments
Closed

Rename output in workflow fails on paired dataset collection #1675

dmaticzka opened this issue Feb 4, 2016 · 14 comments

Comments

@dmaticzka
Copy link

The "Rename dataset" workflow feature fails for paired datasest collections. When trying to rename the output using e.g. #{input_1} or #{library}.bam the filenames generated on saving as file only contain an empty string, e.g. "Galaxy3-[.bam].bam". The naming of the pairs in the output dataset collection displayed by galaxy is fine, however.

screenshot from 2016-02-04 14 10 05

This happens for fastq-join, bowtie2 and hisat2 so it does not seem to be tool-related. For bowtie2 also reported at biostars: https://biostar.usegalaxy.org/p/14911/

@bgruening
Copy link
Member

I would like to high-jack this issue and raise again the general naming dataset issue.
We had this discussion on how we should name dataset and preserve a useful filename over datasets several times before. For example here: https://trello.com/c/dQA7Y5vS.

The problem now gets even more complicated that we started to use collections. Such lines https://github.com/galaxyproject/tools-devteam/blob/master/tools/bowtie2/bowtie2_wrapper.xml#L485 are only (limited) useful in single-input mode, but in collections it nearly useless. In the workflow editor we have the possibility to rename datasets with #{input_1} or similar constructs, this is doable but not user-friendly and more importantly we do not have such a mechanism for the analysis mode.

Not to mention downloading datasets. These often ends up in not-usable filenames. I guess it's time to discuss this issue once and for all and fix it finally. Maybe the Galaxy team can make a start during the retreat and discuss possibilities?

@mvdbeek
Copy link
Member

mvdbeek commented Feb 18, 2016

https://github.com/galaxyproject/tools-devteam/blob/master/tools/bowtie2/bowtie2_wrapper.xml#L485

I guess this should be .element_identifier, instead of .name, which defaults to .name outside of collections.

Maybe the resolution of ${on_string} could be improved.

In general, I think element_identifier should be available in the workflow editor.
Though I have to say I haven't used paired dataset collections at all, perhaps this is more complicated then I am aware :/

@mvdbeek mvdbeek self-assigned this Feb 18, 2016
@mvdbeek mvdbeek added this to the 16.04 milestone Feb 18, 2016
@bwlang
Copy link
Contributor

bwlang commented Feb 18, 2016

Yes yes yes

This is the single largest complaint I get from users.

Please could things be named using both the tool and the original input
file names?

Brad

On Thursday, February 18, 2016, Björn Grüning [email protected]
wrote:

I would like to high-jack this issue and raise again the general naming
dataset issue.
We had this discussion on how we should name dataset and preserve a
useful filename over datasets several times before. For example here:
https://trello.com/c/dQA7Y5vS.

The problem now gets even more complicated as we starting to use
collections. Such lines
https://github.com/galaxyproject/tools-devteam/blob/master/tools/bowtie2/bowtie2_wrapper.xml#L485
are only (limited) useful in single-input mode, but in collections it
nearly useless. In the workflow editor we have the possibility to rename
datasets with #{input_1} or similar constructs, this is doable but not
user-friendly and more importantly we do not have such a mechanism for the
analysis mode.

Not to mention downloading datasets. These often ends up in not-usable
filenames. I guess it's time to discuss this issue once and for all and fix
it finally. Maybe the Galaxy team can make a start during the retreat and
discuss possibilities?


Reply to this email directly or view it on GitHub
#1675 (comment)
.

@martenson martenson modified the milestones: 16.07, 16.04 Apr 5, 2016
@lparsons
Copy link
Contributor

Dataset naming is the single biggest issue I have now. Here is an attempt to collect various related issues. Perhaps someone on the Galaxy team would like to create one large ticket to collect dataset naming issues (and add to the roadmap #1928?) @jmchilton, @martenson?

Enhancements:

  1. Ability to name datasets using the element identifier: Expose element_identifier as a workflow parameter variable. #2006
  2. Ability to define the collection name in a workflow: Enhancement: Ability to name collection in a workflow #2398
  3. Name datasets according to both collection name and element identifier: Feature request - Filenames based on dataset collection identifier #2140, Download files in a list of datasets with the name from the list #2023

Bug Fixes:

  1. Renaming using the input from paired dataset collections: Rename output in workflow fails on paired dataset collection #1675, Rename output file on workflow #1686

@martenson martenson modified the milestones: 16.10, 16.07 Jul 27, 2016
@lparsons
Copy link
Contributor

+1 to get this fixed in 16.10 (and backported?)

@jmchilton
Copy link
Member

jmchilton commented Sep 30, 2016

I'm going to skip the middle comments here - they are serious issues and they need to be addressed - it is just that we don't really know how to address them and there isn't agreement across the team or community on how to. It is too big for this particular issue.

The issue here is that the GUI isn't showing you the "name" of the dataset - it is showing you the element identifier for that element in the collection. I don't consider this to be a bug - in most cases you want the element identifier and the "name" of collection items is irrelevant. If there is a rename post job action on a collection mapping step - the collection itself should probably be renamed usually instead of the items in the collection. There is a feature request issue I created for that - #1680. There should also be a way to see the dataset name in the GUI for people that want to IMO - but I doubt @carlfeberhard agrees and I can see the case against it pretty easily.

tl;dr) The names have changed - we just aren't showing them.

@dmaticzka
Copy link
Author

I'm not concerned with what is shown by the GUI, my problem is that the rename does not work for paired collections on the file name level. Rename just drops everything resulting in filenames like "Galaxy12-[].bam". With that, there's no way to know what data this was and where it came from.

The current alternative of not doing the rename action results in filenames like "Galaxy9-[Bowtie2_on_data_2_and_data_1__aligned_reads_(sorted_BAM)].bam", here also no association between files and the elements shown by the GUI is possible. Being able to show the dataset name in the GUI would allow this, but it wouldn't be pretty :)

I have no issue with the use of element identifiers by the GUI --- my concern here is being able to identify which set belongs to which input and that works nicely when using only the GUI.

@martenson martenson modified the milestones: 17.01, 16.10 Nov 16, 2016
@martenson martenson assigned jmchilton and mvdbeek and unassigned mvdbeek Jan 12, 2017
@martenson martenson modified the milestones: 17.01, 17.05 Jan 12, 2017
jmchilton added a commit to jmchilton/galaxy that referenced this issue Apr 27, 2017
…ollections.

xref galaxyproject#1675

This is of limited utility since we don't really expose the name - and intentionally so. Related open bugs/enhancements that still need to be addressed are:

 - Applying rename to the collection (in addition to the elements) - galaxyproject#1680.
 - Download of collection elements with element identifier instead of the name: galaxyproject#2023 / galaxyproject#2140.
@jmchilton
Copy link
Member

The current alternative of not doing the rename action results in filenames like "Galaxy9-[Bowtie2_on_data_2_and_data_1__aligned_reads_(sorted_BAM)].bam", here also no association between files and the elements shown by the GUI is possible.

#3985 fixes the downloaded name so hopefully this whole issue is now moot. As such I guess I'm going to close this as a duplicate of #2140.

(If this proves not quite enough and what is actually desired is for the collection itself to be renamed by the PJA - there is another open issue #1680. Hopefully #3985 is good enough though.)

@dpryan79
Copy link
Contributor

dpryan79 commented May 9, 2017

Just to clarify, after #3985, will the post-job action on paired-collections actually work to rename the individual (usually hidden) history elements? My current issue is related to what @dmaticzka reported, though in my case trying to use a post-job rename action results in the following error:

galaxy.workflow.run ERROR 2017-05-09 11:41:40,511 Failed to schedule Workflow[id=1533,name=PE DNA mapping (May 9th 2017)], problem occurred on WorkflowStep[index=2,type=tool].
Traceback (most recent call last):
  File "/galaxy-central/lib/galaxy/workflow/run.py", line 169, in invoke
    jobs = self._invoke_step( step )
  File "/galaxy-central/lib/galaxy/workflow/run.py", line 239, in _invoke_step
    jobs = step.module.execute( self.trans, self.progress, self.workflow_invocation, step )
  File "/galaxy-central/lib/galaxy/workflow/modules.py", line 1110, in execute
    self._handle_post_job_actions( step, job, invocation.replacement_dict )
  File "/galaxy-central/lib/galaxy/workflow/modules.py", line 1153, in _handle_post_job_actions
    ActionBox.execute( self.trans.app, self.trans.sa_session, pja, job, replacement_dict )
  File "/galaxy-central/lib/galaxy/jobs/actions/post.py", line 398, in execute
    ActionBox.actions[pja.action_type].execute(app, sa_session, pja, job, replacement_dict)
  File "/galaxy-central/lib/galaxy/jobs/actions/post.py", line 151, in execute
    replacement = hdca.name
AttributeError: 'DatasetCollectionElement' object has no attribute 'name'

The major annoyance is that without post-job renaming, while everything is still nicely labeled inside of collections, the element identifier isn't actually changed, so feeding a collection of mapped bam files into multiBamSummary (to use an example of a tool that uses element identifiers to label samples) still results in everything being labeled "bowtie2 on data 6 and 2" or something like that.

@jmchilton
Copy link
Member

@dpryan79 The element identifier shouldn't be "bowtie2 on data 6 and 2" - that would be really odd. The element identifier should be preserved from the beginning of the workflow throughout in most cases. This is the newest multiBamSummary that includes deeptools/deepTools#500?

@dpryan79
Copy link
Contributor

dpryan79 commented May 9, 2017

Yes, this the most recent version, so that's what the element identifier is actually getting set as (this also matches what the hidden history items are named as). I'm running Galaxy 17.01, so if this is changed in the upcoming 17.05 then consider me already happy :)

@chambm
Copy link
Contributor

chambm commented Jun 21, 2017

@dpryan79 I'm running up to date release_17.05 (as of yesterday) and got the same error you posted above when trying a PJA rename on paired collection.

@dpryan79
Copy link
Contributor

@chambm :(

@chambm
Copy link
Contributor

chambm commented Jun 21, 2017

Changing post.py:150 from:
replacement = hdca.name
to
replacement = hdca.element_identifier
Fixed it for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants