Merge branch 'main' into qualimap

viash-hub · Jul 29, 2024 · 28cd122 · 28cd122
2 parents e6420cd + da414e7
commit 28cd122
Show file tree

Hide file tree

Showing 95 changed files with 3,940 additions and 444 deletions.
diff --git a/.github/workflows/test.yaml b/.github/workflows/test.yaml
@@ -1,9 +1,11 @@
-name: Component Testing
+name: Test components
 
 on:
   pull_request:
   push:
+    branches:
+      - main
 
 jobs:
   test:
-    uses: viash-hub/toolbox/.github/workflows/test.yaml@main
+    uses: viash-io/viash-actions/.github/workflows/test.yaml@v6
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,31 +1,66 @@
 # biobox x.x.x
 
-## BUG FIXES
+## BREAKING CHANGES
 
-* `pear`: fix component not exiting with the correct exitcode when PEAR fails.
+* `star/star_align_reads`: Change all arguments from `--camelCase` to `--snake_case` (PR #62).
 
-* `cutadapt`: fix `--par_quality_cutoff_r2` argument.
+* `star/star_genome_generate`: Change all arguments from `--camelCase` to `--snake_case` (PR #62).
 
-* `cutadapt`: demultiplexing is now disabled by default. It can be re-enabled by using `demultiplex_mode`.
+## NEW FUNCTIONALITY
 
-* `multiqc`: update multiple separator to `;` (PR #81).
+* `star/star_align_reads`: Add star solo related arguments (PR #62).
+
+* `bd_rhapsody/bd_rhapsody_make_reference`: Create a reference for the BD Rhapsody pipeline (PR #75).
+
+* `umitools/umitools_dedup`: Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read (PR #54).
+
+* `seqtk`:
+  - `seqtk/seqtk_sample`: Subsamples sequences from FASTA/Q files (PR #68).
+  - `seqtk/seqtk_subseq`: Extract the sequences (complete or subsequence) from the FASTA/FASTQ files
+                based on a provided sequence IDs or region coordinates file (PR #85).
+
+* `agat/agat_convert_sp_gff2gtf`: convert any GTF/GFF file into a proper GTF file (PR #76).
 
 ## MINOR CHANGES
 
-* `busco` components: update BUSCO to `5.7.1`.
+* `busco` components: update BUSCO to `5.7.1` (PR #72).
 
 ## NEW FEATURES
 
 * `qualimap/qualimap_rnaseq`: RNA-seq QC analysis using qualimap (PR #74). 
 
 # biobox 0.1.0
 
-## BREAKING CHANGES
+* Update CI to reusable workflow in `viash-io/viash-actions` (PR #86).
+
+## DOCUMENTATION
+
+* Extend the contributing guidelines (PR #82):
+
+  - Update format to Viash 0.9.
+
+  - Descriptions should be formatted in markdown.
 
-* Change default `multiple_sep` to `;` (PR #25). This aligns with an upcoming breaking change in
-  Viash 0.9.0 in order to avoid issues with the current default separator `:` unintentionally
-  splitting up certain file paths.
+  - Add defaults to descriptions, not as a default of the argument.
 
+  - Explain parameter expansion.
+
+  - Mention that the contents of the output of components in tests should be checked.
+
+* Add authorship to existing components (PR #88).
+
+## BUG FIXES
+
+* `pear`: fix component not exiting with the correct exitcode when PEAR fails (PR #70).
+
+* `cutadapt`: fix `--par_quality_cutoff_r2` argument (PR #69).
+
+* `cutadapt`: demultiplexing is now disabled by default. It can be re-enabled by using `demultiplex_mode` (PR #69).
+
+* `multiqc`: update multiple separator to `;` (PR #81).
+
+
+# biobox 0.1.0
 
 ## NEW FEATURES
 
@@ -74,12 +109,11 @@
     - `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTQ (PR #52).
     - `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTA (PR #53).
 
+* `umi_tools`:
+    -`umi_tools/umi_tools_extract`: Flexible removal of UMI sequences from fastq reads (PR #71).
 
 * `falco`: A C++ drop-in replacement of FastQC to assess the quality of sequence read data (PR #43).
 
-* `umitools`:
-    - `umitools_dedup`: Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read (PR #54).
-
 * `bedtools`:
     - `bedtools_getfasta`: extract sequences from a FASTA file for each of the
                            intervals defined in a BED/GFF/VCF file (PR #59).
@@ -104,4 +138,4 @@
 
 * Add escaping character before leading hashtag in the description field of the config file (PR #50).
 
-* Format URL in biobase/bcl_convert description (PR #55).
+* Format URL in biobase/bcl_convert description (PR #55).
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -65,22 +65,21 @@ runners:
 Fill in the relevant metadata fields in the config. Here is an example of the metadata of an existing component.
 
 ```yaml
-functionality:
-  name: arriba
-  description: Detect gene fusions from RNA-Seq data
-  keywords: [Gene fusion, RNA-Seq]
-  links:
-    homepage: https://arriba.readthedocs.io/en/latest/
-    documentation: https://arriba.readthedocs.io/en/latest/
-    repository: https://github.com/suhrig/arriba
-    issue_tracker: https://github.com/suhrig/arriba/issues
-  references:
-    doi: 10.1101/gr.257246.119
-    bibtex: |
-      @article{
-        ... a bibtex entry in case the doi is not available ...
-      }
-  license: MIT
+name: arriba
+description: Detect gene fusions from RNA-Seq data
+keywords: [Gene fusion, RNA-Seq]
+links:
+  homepage: https://arriba.readthedocs.io/en/latest/
+  documentation: https://arriba.readthedocs.io/en/latest/
+  repository: https://github.com/suhrig/arriba
+  issue_tracker: https://github.com/suhrig/arriba/issues
+references:
+  doi: 10.1101/gr.257246.119
+  bibtex: |
+    @article{
+      ... a bibtex entry in case the doi is not available ...
+    }
+license: MIT
 ```
 
 ### Step 4: Find a suitable container
@@ -162,7 +161,7 @@ argument_groups:
       type: file
       description: |
         File in SAM/BAM/CRAM format with main alignments as generated by STAR
-        (Aligned.out.sam). Arriba extracts candidate reads from this file.
+        (`Aligned.out.sam`). Arriba extracts candidate reads from this file.
       required: true
       example: Aligned.out.bam
 ```
@@ -175,7 +174,7 @@ Several notes:
 
 * Input arguments can have `multiple: true` to allow the user to specify multiple files.
 
-
+* The description should be formatted in markdown.
 
 ### Step 8: Add arguments for the output files
 
@@ -220,7 +219,7 @@ argument_groups:
 
 Note: 
 
-* Preferably, these outputs should not be directores but files. For example, if a tool outputs a directory `foo/` containing files `foo/bar.txt` and `foo/baz.txt`, there should be two output arguments `--bar` and `--baz` (as opposed to one output argument which outputs the whole `foo/` directory).
+* Preferably, these outputs should not be directories but files. For example, if a tool outputs a directory `foo/` containing files `foo/bar.txt` and `foo/baz.txt`, there should be two output arguments `--bar` and `--baz` (as opposed to one output argument which outputs the whole `foo/` directory).
 
 ### Step 9: Add arguments for the other arguments
 
@@ -230,6 +229,8 @@ Finally, add all other arguments to the config file. There are a few exceptions:
 
 * Arguments related to printing the information such as printing the version (`-v`, `--version`) or printing the help (`-h`, `--help`) should not be added to the config file.
 
+* If the help lists defaults, do not add them as defaults but to the description. Example: `description: <Explanation of parameter>. Default: 10.`
+
 
 ### Step 10: Add a Docker engine
 
@@ -275,10 +276,13 @@ Next, we need to write a runner script that runs the tool with the input argumen
 ## VIASH START
 ## VIASH END
 
+# unset flags
+[[ "$par_option" == "false" ]] && unset par_option
+
 xxx \
   --input "$par_input" \
   --output "$par_output" \
-  $([ "$par_option" = "true" ] && echo "--option")
+  ${par_option:+--option}
 ```
 
 When building a Viash component, Viash will automatically replace the `## VIASH START` and `## VIASH END` lines (and anything in between) with environment variables based on the arguments specified in the config.
@@ -291,82 +295,107 @@ As an example, this is what the Bash script for the `arriba` component looks lik
 ## VIASH START
 ## VIASH END
 
+# unset flags
+[[ "$par_skip_duplicate_marking" == "false" ]] && unset par_skip_duplicate_marking
+[[ "$par_extra_information" == "false" ]] && unset par_extra_information
+[[ "$par_fill_gaps" == "false" ]] && unset par_fill_gaps
+
 arriba \
   -x "$par_bam" \
   -a "$par_genome" \
   -g "$par_gene_annotation" \
   -o "$par_fusions" \
   ${par_known_fusions:+-k "${par_known_fusions}"} \
   ${par_blacklist:+-b "${par_blacklist}"} \
-  ${par_structural_variants:+-d "${par_structural_variants}"} \
-  $([ "$par_skip_duplicate_marking" = "true" ] && echo "-u") \
-  $([ "$par_extra_information" = "true" ] && echo "-X") \
-  $([ "$par_fill_gaps" = "true" ] && echo "-I")
+  # ...
+  ${par_extra_information:+-X} \
+  ${par_fill_gaps:+-I}
 ```
 
+Notes:
 
-### Step 12: Create test script
+* If your arguments can contain special variables (e.g. `$`), you can use quoting (need to find a documentation page for this) to make sure you can use the string as input. Example: `-x ${par_bam@Q}`.
 
+* Optional arguments can be passed to the command conditionally using Bash [parameter expansion](https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html). For example: `${par_known_fusions:+-k ${par_known_fusions@Q}}`
+
+* If your tool allows for multiple inputs using a separator other than `;` (which is the default Viash multiple separator), you can substitute these values with a command like: `par_disable_filters=$(echo $par_disable_filters | tr ';' ',')`.
+
+
+### Step 12: Create test script
 
 If the unit test requires test resources, these should be provided in the `test_resources` section of the component. 
 
 ```yaml
-functionality:
-  # ...
-  test_resources:
-    - type: bash_script
-      path: test.sh
-    - type: file
-      path: test_data
+test_resources:
+  - type: bash_script
+    path: test.sh
+  - type: file
+    path: test_data
 ```
 
 Create a test script at `src/xxx/test.sh` that runs the component with the test data. This script should run the component (available with `$meta_executable`) with the test data and check if the output is as expected. The script should exit with a non-zero exit code if the output is not as expected. For example:
 
 ```bash
 #!/bin/bash
 
+set -e
+
 ## VIASH START
 ## VIASH END
 
-echo "> Run xxx with test data"
+#############################################
+# helper functions
+assert_file_exists() {
+  [ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
+}
+assert_file_doesnt_exist() {
+  [ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
+}
+assert_file_empty() {
+  [ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
+}
+assert_file_not_empty() {
+  [ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
+}
+assert_file_contains() {
+  grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
+}
+assert_file_not_contains() {
+  grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
+}
+assert_file_contains_regex() {
+  grep -q -E "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
+}
+assert_file_not_contains_regex() {
+  grep -q -E "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
+}
+#############################################
+
+echo "> Run $meta_name with test data"
 "$meta_executable" \
-  --input "$meta_resources_dir/test_data/input.txt" \
+  --input "$meta_resources_dir/test_data/reads_R1.fastq" \
   --output "output.txt" \
   --option
 
-echo ">> Checking output"
-[ ! -f "output.txt" ] && echo "Output file output.txt does not exist" && exit 1
-```
+echo ">> Check if output exists"
+assert_file_exists "output.txt"
 
+echo ">> Check if output is empty"
+assert_file_not_empty "output.txt"
 
-For example, this is what the test script for the `arriba` component looks like:
+echo ">> Check if output is correct"
+assert_file_contains "output.txt" "some expected output"
 
-```bash
-#!/bin/bash
+echo "> All tests succeeded!"
+```
 
-## VIASH START
-## VIASH END
+Notes:
 
-echo "> Run arriba with blacklist"
-"$meta_executable" \
-  --bam "$meta_resources_dir/test_data/A.bam" \
-  --genome "$meta_resources_dir/test_data/genome.fasta" \
-  --gene_annotation "$meta_resources_dir/test_data/annotation.gtf" \
-  --blacklist "$meta_resources_dir/test_data/blacklist.tsv" \
-  --fusions "fusions.tsv" \
-  --fusions_discarded "fusions_discarded.tsv" \
-  --interesting_contigs "1,2"
-
-echo ">> Checking output"
-[ ! -f "fusions.tsv" ] && echo "Output file fusions.tsv does not exist" && exit 1
-[ ! -f "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv does not exist" && exit 1
+* Do always check the contents of the output file. If the output is not deterministic, you can use regular expressions to check the output.
 
-echo ">> Check if output is empty"
-[ ! -s "fusions.tsv" ] && echo "Output file fusions.tsv is empty" && exit 1
-[ ! -s "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv is empty" && exit 1
-```
+* If possible, generate your own test data instead of copying it from an external resource.
 
-### Step 12: Create a `/var/software_versions.txt` file
+### Step 13: Create a `/var/software_versions.txt` file
 
 For the sake of transparency and reproducibility, we require that the versions of the software used in the component are documented.
 
@@ -378,6 +407,8 @@ engines:
     image: quay.io/biocontainers/xxx:0.1.0--py_0
     setup:
       - type: docker
+        # note: /var/software_versions.txt should contain:
+        #   arriba: "2.4.0"
         run: |
           echo "xxx: \"0.1.0\"" > /var/software_versions.txt
 ```
diff --git a/src/_authors/angela_o_pisco.yaml b/src/_authors/angela_o_pisco.yaml
@@ -0,0 +1,14 @@
+name: Angela Oliveira Pisco
+info:
+  role: Contributor
+  links:
+    github: aopisco
+    orcid: "0000-0003-0142-2355"
+    linkedin: aopisco
+  organizations:
+    - name: Insitro
+      href: https://insitro.com
+      role: Director of Computational Biology
+    - name: Open Problems
+      href: https://openproblems.bio
+      role: Core Member
diff --git a/src/_authors/dorien_roosen.yaml b/src/_authors/dorien_roosen.yaml
@@ -0,0 +1,10 @@
+name: Dorien Roosen
+info:
+  links:
+    email: [email protected]
+    github: dorien-er
+    linkedin: dorien-roosen
+  organizations:
+    - name: Data Intuitive
+      href: https://www.data-intuitive.com
+      role: Data Scientist
diff --git a/src/_authors/dries_schaumont.yaml b/src/_authors/dries_schaumont.yaml
@@ -0,0 +1,11 @@
+name: Dries Schaumont
+info:
+  links:
+    email: [email protected]
+    github: DriesSchaumont
+    orcid: "0000-0002-4389-0440"
+    linkedin: dries-schaumont
+  organizations:
+    - name: Data Intuitive
+      href: https://www.data-intuitive.com
+      role: Data Scientist