Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cutadapt #7

Merged
merged 41 commits into from
Jun 21, 2024
Merged

cutadapt #7

merged 41 commits into from
Jun 21, 2024

Conversation

tverbeiren
Copy link
Contributor

Description

Add cutadapt to the repo.

@tverbeiren
Copy link
Contributor Author

Some food for thought:

Prefixes for FASTA files

-g and -G require an argument of the form file://<file>. But only if one wants to read from a FASTA file. In another pipeline, we solved this by letting the user specify a file and then in the script prepending file: to it. If we don't do it like this, Viash will not be able to verify the existence of the input file.

But what if the input file is not a FASTA file? Then the -g and -G take a string argument. Moreover, if we allow Viash to accept a string, how can we tell it needs to have the file: prefix added?

-a, -A, -b and -B behave the same.

Infrastructure argument

Cutadapt has a -j/--cores argument to specify the number of cores to run on. I suggest to not allow the user to set that for the Viash component. --cores 0 auto-detects.

What is the output?

Cutadapt allows to specify the output filenames of R1 and R2 separately. This is done using a form of variable substitution: -o "{name}_R1_001.fastq". I think, however, it makes more sense to provide an output directory, so that it does not matter anymore if cutadapt is run on paired or unpaired data when it comes to the output of the component.

Of course, downstream this will have to be managed.

@rcannood
Copy link
Contributor

Some food for thought:

Prefixes for FASTA files

-g and -G require an argument of the form file://<file>. But only if one wants to read from a FASTA file. In another pipeline, we solved this by letting the user specify a file and then in the script prepending file: to it. If we don't do it like this, Viash will not be able to verify the existence of the input file.

But what if the input file is not a FASTA file? Then the -g and -G take a string argument. Moreover, if we allow Viash to accept a string, how can we tell it needs to have the file: prefix added?
-a, -A, -b and -B behave the same.

Alternatively, you could define two arguments:

arguments:
  - name: --adapter_file
    type: file
  - name: --adapter_string
    type: string

If both are defined, throw an error.

Infrastructure argument

Cutadapt has a -j/--cores argument to specify the number of cores to run on. I suggest to not allow the user to set that for the Viash component. --cores 0 auto-detects.

Makes sense!

What is the output?

Cutadapt allows to specify the output filenames of R1 and R2 separately. This is done using a form of variable substitution: -o "{name}_R1_001.fastq". I think, however, it makes more sense to provide an output directory, so that it does not matter anymore if cutadapt is run on paired or unpaired data when it comes to the output of the component.

Of course, downstream this will have to be managed.

will respond later ^^

Copy link
Contributor

@rcannood rcannood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR! It's always funny to see GitHub not wanting to show our config.vsh.yaml because it's too large :)

I requested some changes! 🙇

src/cutadapt/config.vsh.yaml Outdated Show resolved Hide resolved
src/cutadapt/config.vsh.yaml Outdated Show resolved Hide resolved
src/cutadapt/config.vsh.yaml Outdated Show resolved Hide resolved
src/cutadapt/config.vsh.yaml Outdated Show resolved Hide resolved
src/cutadapt/config.vsh.yaml Outdated Show resolved Hide resolved
src/cutadapt/config.vsh.yaml Outdated Show resolved Hide resolved
src/cutadapt/config.vsh.yaml Outdated Show resolved Hide resolved
Comment on lines 399 to 408
- name: --output
type: file
description: |
Write trimmed reads to this directory and name the files using {name}.
FASTQ or FASTA format is chosen depending on input.
Summary report is sent to standard output.
default: output
direction: output
required: true
must_exist: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split up the output to separate output files?

CONTRIBUTING.md:

Step 8: Add arguments for output files

...
Note:
Preferably, these outputs should not be directores but files. For example, if a tool outputs a directory foo/ containing files foo/bar.txt and foo/baz.txt, there should be two output arguments --bar and --baz (as opposed to one output argument which outputs the whole foo/ directory).

Especially since cutadapt also doesn't rely in an output directory,

    ${par_json:+--json "${par_output}/report.json"} \
    --output "$par_output/{name}_R1_001.fastq" \
    --paired-output "$par_output/{name}_R2_001.fastq" \

Would need to be changed to

    ${par_json:+--json "${par_json}"} \
    --output "$par_output" \
    ${par_output_r2:+--paired-output "$par_output_r2"} \

or something to that extend

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cutadapt implicitly creates a directory because it generates a number of output fastq files depending on the input data and the arguments provided. Hence the {name} variable. I would like to avoid that users have to set those 'variable' output paths explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it would be cumbersome if the users would have to set the variable output paths explicitly. I now understand that cutadapt returns a variable number of outputs depending on how many adapters the user provides.

I'm still a bit on the fence because creating one big directory with all of the outputs makes it harder for downstream components to process the files.

Can we set something up with type: file, multiple: true such that, by default, the user can use --output to get all of the outputs in one directory, but if they provide paths with wildcards the fasta's / fastq's will be written to those paths instead?

  • --output_dir: contains all of the files below
  • --report_txt: report.txt
  • --report_json: report.json
  • --output: output_R1.fasta/q if paired, else output.fasta/q
  • --output_r2: output_*_R2.fasta/q

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just remembered that type: file, multiple: true, direction: output is currently broken and will be fixed in viash-io/viash#639. Will try to finish this PR ASAP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really blocking I think because the bug appears in the VDSL3 module.

Copy link
Contributor Author

@tverbeiren tverbeiren Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to your suggestion, @rcannood:

I have a version in development that does what you suggest: take both an --output_dir as well as an --output argument. The first is just one directory, the second is multiple: true. This allows me to, for instance, run this in the tests:

"$meta_executable" \
  --output_dir out_test1 \
  --output out_test1/1_001.fasta \
  --output out_test1/unknown_001.fasta \
  ...

Or, alternatively --output out_test1/1_001.fasta;out_test1/unknown_001.fasta.

I agree it's nice to be able to capture the output explicitly rather than a directory, but it requires one to know the result of the cutadapt step prior to running it which will depend on the case at hand.

Ideally though, I would prefer to do something like:

"$meta_executable" \
  --output_dir out_test1 \
  --output out_test1/*.fast? \
  ...

This, however, does not work because of the way the executable is built (set -f disallows globbing):

# check whether required files exist
if [ ! -z "$VIASH_PAR_OUTPUT" ]; then
  IFS=';'
  set -f
  for file in $VIASH_PAR_OUTPUT; do
    unset IFS
    if [ ! -e "$file" ]; then
      ViashError "Output file '$file' does not exist."
      exit 1
    fi
  done
  set +f
fi

BTW: The way we deal with this in a Nextflow pipeline is by adding a map and glob inside the file() function of Nextflow.

Now, how to proceed?

  1. Look into adapting the behaviour of Viash? (set -f)
  2. Stick to what works with the individual output as illustrated above?
  3. Just output the directory name and omit the individual output files?
  4. (1) and (2)?

src/cutadapt/script.sh Outdated Show resolved Hide resolved
src/cutadapt/script.sh Outdated Show resolved Hide resolved
@tverbeiren tverbeiren mentioned this pull request Feb 26, 2024
10 tasks
@rcannood
Copy link
Contributor

Currently waiting for the next version of Viash to be released so we can get {type: file, multiple: true, direction: output} to work.

@rcannood
Copy link
Contributor

Viash was updated to 0.9.0-rc3 in #51, so we can finally update this PR 😁

@tverbeiren tverbeiren requested a review from rcannood June 5, 2024 15:10
@tverbeiren
Copy link
Contributor Author

tverbeiren commented Jun 5, 2024

Update: The --paired-output cutadapt argument is supported differently in this version. If paired input is provided, paired output is automatically written to $output_dir/<sample>_R[12]_001.... The output argument globs over the reads as well.

We could introduce --output_dir_r2 and --output_r2 but I don't see much sense in practical applications of that. It adds complexity and would only be useful when different reads need to end up in different directories.

WDYT @rcannood ?

Copy link
Contributor

@rcannood rcannood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor change requested, otherwise LGTM

src/cutadapt/config.vsh.yaml Outdated Show resolved Hide resolved
@tverbeiren
Copy link
Contributor Author

It turns out .references should be under .links.

@rcannood rcannood merged commit ce9b845 into main Jun 21, 2024
6 of 7 checks passed
@rcannood rcannood deleted the dev/cutadapt branch June 21, 2024 16:03
dorien-er pushed a commit that referenced this pull request Jul 11, 2024
* First commit, clone of cutadapt in htrnaseq + help.txt

* Add config

* Don't allow multiple: true when providing a FASTA file with adapters

* First version of script

* Updates and fixes - se/pe

* Add tests and fix --json argument

* Add software version

* Better consistency in using snake_case

* Update src/cutadapt/config.vsh.yaml

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Update src/cutadapt/config.vsh.yaml

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Update src/cutadapt/config.vsh.yaml

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Specify --input and --input_r2 as separate arguments

* Avoid specifying default arg values

* Add more information to `--minimum_length` and `maximum_length`

* Add --cpus by means of $meta_cpus and set proper default

* Allow multiple for adapters/fasta and add test

* change multiple_sep to ';'

* add example

* simplify code with a helper function

* create directories in test

* use a different output extension if --fasta is provided

* decrease code duplication by separating optional outputs from paired/unpaired output arguments

* write custom tests for cutadapt

* fix _r2 arguments

* add debug flag as not to always print the cli command

* remove comment

* Update to Viash 0.9.0-RC4

* Ability to specify output globbing patterns

* Avoid the need for both output_dir and output

* Move fields from `info` to `links`

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Move references back to the info field

* apologies, I proposed a wrong syntax

---------

Co-authored-by: Robrecht Cannoodt <[email protected]>
rcannood added a commit that referenced this pull request Aug 21, 2024
* first version

* complete script for qualimap

* add escaping character before leading hashtag (#50)

* add escaping character before leading hashtag

* update changelog

* Update CHANGELOG.md

Co-authored-by: Robrecht Cannoodt <[email protected]>

* replace escaping \ by \\

---------

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Samtools collate (#49)

* initial commit dedup

* Revert "initial commit dedup"

This reverts commit 38f586b.

* Initial commit, whole component is functional

* Update viash (#51)

* update viash

* update readme

* update changelog

* update changelog

* fix incorrect heading detection

* update again

* clean up readme

* Samtools view (#48)

* initial commit dedup

* Revert "initial commit dedup"

This reverts commit 38f586b.

* initial version with a few tests, script, and config file

* update changelog, add one test

* add a 4th test, fix option names in the script

* Fix name of component in config

* remove option named with a number

* add must_exist to input file argument

* removed "default: null" from one of the arguments in config

* remove utf8 characters from config

* Update CHANGELOG.md

---------

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Samtools fastq (#52)

* initial commit dedup

* Revert "initial commit dedup"

This reverts commit 38f586b.

* Initial commit, config, script, help and test_data

* Update changelog, add tests, fix argument naming errors, add test data

* update changelog, remove gffread namespace field

---------

Co-authored-by: Robrecht Cannoodt <[email protected]>

* format URL in the description (#55)

* format URL in the description

* update changelog

* Change name in _viash.yaml (#60)

* Update operational code (#63)

* update readme

* switch ci to toolbox

* update to viash 0.9.0-RC6

* edit keywords

* fix version

* update biobox

* cutadapt (#7)

* First commit, clone of cutadapt in htrnaseq + help.txt

* Add config

* Don't allow multiple: true when providing a FASTA file with adapters

* First version of script

* Updates and fixes - se/pe

* Add tests and fix --json argument

* Add software version

* Better consistency in using snake_case

* Update src/cutadapt/config.vsh.yaml

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Update src/cutadapt/config.vsh.yaml

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Update src/cutadapt/config.vsh.yaml

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Specify --input and --input_r2 as separate arguments

* Avoid specifying default arg values

* Add more information to `--minimum_length` and `maximum_length`

* Add --cpus by means of $meta_cpus and set proper default

* Allow multiple for adapters/fasta and add test

* change multiple_sep to ';'

* add example

* simplify code with a helper function

* create directories in test

* use a different output extension if --fasta is provided

* decrease code duplication by separating optional outputs from paired/unpaired output arguments

* write custom tests for cutadapt

* fix _r2 arguments

* add debug flag as not to always print the cli command

* remove comment

* Update to Viash 0.9.0-RC4

* Ability to specify output globbing patterns

* Avoid the need for both output_dir and output

* Move fields from `info` to `links`

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Move references back to the info field

* apologies, I proposed a wrong syntax

---------

Co-authored-by: Robrecht Cannoodt <[email protected]>

* update changelog

* update readme

* Update salmon quant arguments (#57)

* Make index an optional argument

* FIx argument type and add optional argument

* FEAT: add bedtools getfasta. (#59)

* FEAT: add bedtools getfasta.

* Add PR number to CHANGELOG

* Add star genomegenerate component (#58)

* Add star genomegenerate component

* Update changelog

* Rename component

* Update test

* Update CHANGELOG.md

---------

Co-authored-by: Robrecht Cannoodt <[email protected]>

* fix package config (#65)

* Delete src/bgzip directory (#64)

It was moved to toolbox

* Output alignments to the transcriptome (#56)

* Output alignments to  the transcriptome

* Change argument name

* BUG: pear component failure is ignored (#70)

* FEAT + BUG: cutadapt; allowing disabling demultiplexing and fix par_quality_cutoff_r2 (#69)

* FEAT: Disable cutadapt demultiplexing by default

* Cutadapt: fix --par_quality_cutoff_r2

* FEAT: update busco to 5.7.1 (#72)

* FEAT: update busco to 5.7.1

* Typo

* Samtools fasta (#53)

* initial commit dedup

* Revert "initial commit dedup"

This reverts commit 38f586b.

* Fasta component

* change script resource to samtools_fastq script, with dummy argument to specify the command

* add dummy argument to samtools_fastq to share the script with samtools_fasta

* fix path to script in config

* Update src/samtools/samtools_fastq/script.sh

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Change default fields to examples

* Two more default fields changed to examples

* Minor formatting changes

* Markdown formatting changes in configs

---------

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Umi tools dedup (#54)

* initial commit dedup

* Revert "initial commit dedup"

This reverts commit 38f586b.

* inital commit dedup

* Working component with one test

* Update test 1 and test data, fix some arg types in config and script

* test data files and changes to script

* Add third test and test data

* Fix typo in script

* remove utf8 characters in config

* Add choices fields and change default fields to exampels

* Minor formatting changes

* md formatting changes in config

* Fix typo (#79)

* add vscode to gitignore

* update multiple separator (#81)

* update multiple separator

* update changelog

* Update src/multiqc/config.vsh.yaml

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Update src/multiqc/config.vsh.yaml

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Update src/multiqc/config.vsh.yaml

Co-authored-by: Robrecht Cannoodt <[email protected]>

* Update src/multiqc/config.vsh.yaml

Co-authored-by: Robrecht Cannoodt <[email protected]>

* update ifs

---------

Co-authored-by: Robrecht Cannoodt <[email protected]>

* add test data

* add tests

* update changelog

* remove unrequired test data

* update descriptions

* update changelog

* update help text

* Update src/qualimap/qualimap_rnaseq/script.sh

Co-authored-by: Robrecht Cannoodt <[email protected]>

* update unit tests

* update unit tests

* addres pr changes request

* add version

* remove whitespace multiqc

* Apply suggestions from code review

Co-authored-by: Robrecht Cannoodt <[email protected]>

* address pr comments

* Update CHANGELOG.md

* fix doi

* Fix name

* update version and container image

* write software version to file

---------

Co-authored-by: dorien-er <[email protected]>
Co-authored-by: Leila011 <[email protected]>
Co-authored-by: Robrecht Cannoodt <[email protected]>
Co-authored-by: emmarousseau <[email protected]>
Co-authored-by: Sai Nirmayi Yasa <[email protected]>
Co-authored-by: Dries Schaumont <[email protected]>
Co-authored-by: Dorien <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants