Skip to content

Commit

Permalink
Merge branch 'main' into seqtk_subseq
Browse files Browse the repository at this point in the history
  • Loading branch information
rcannood authored Jul 18, 2024
2 parents 32c084d + e8b82b5 commit 3dfc028
Show file tree
Hide file tree
Showing 74 changed files with 1,510 additions and 140 deletions.
6 changes: 4 additions & 2 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
name: Component Testing
name: Test components

on:
pull_request:
push:
branches:
- main

jobs:
test:
uses: viash-hub/toolbox/.github/workflows/test.yaml@main
uses: viash-io/viash-actions/.github/workflows/test.yaml@v6
54 changes: 37 additions & 17 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,52 @@
# biobox x.x.x

## BUG FIXES
## NEW FEATURES

* `pear`: fix component not exiting with the correct exitcode when PEAR fails.
* `bd_rhapsody/bd_rhapsody_make_reference`: Create a reference for the BD Rhapsody pipeline (PR #75).

* `cutadapt`: fix `--par_quality_cutoff_r2` argument.
* `umitools/umitools_dedup`: Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read (PR #54).

* `cutadapt`: demultiplexing is now disabled by default. It can be re-enabled by using `demultiplex_mode`.
* `seqtk`:
- `seqtk/seqtk_sample`: Subsamples sequences from FASTA/Q files (PR #68).
- `seqtk/seqtk_subseq`: Extract the sequences (complete or subsequence) from the FASTA/FASTQ files
based on a provided sequence IDs or region coordinates file (PR #85).

* `multiqc`: update multiple separator to `;` (PR #81).
* `agat/agat_convert_sp_gff2gtf`: convert any GTF/GFF file into a proper GTF file (PR #76).

## MINOR CHANGES

* `busco` components: update BUSCO to `5.7.1`.
* `busco` components: update BUSCO to `5.7.1` (PR #72).

# biobox 0.1.0
* Update CI to reusable workflow in `viash-io/viash-actions` (PR #86).

## BREAKING CHANGES
## DOCUMENTATION

* Change default `multiple_sep` to `;` (PR #25). This aligns with an upcoming breaking change in
Viash 0.9.0 in order to avoid issues with the current default separator `:` unintentionally
splitting up certain file paths.
* Extend the contributing guidelines (PR #82):

- Update format to Viash 0.9.

- Descriptions should be formatted in markdown.

- Add defaults to descriptions, not as a default of the argument.

- Explain parameter expansion.

- Mention that the contents of the output of components in tests should be checked.

* Add authorship to existing components (PR #88).

## BUG FIXES

* `pear`: fix component not exiting with the correct exitcode when PEAR fails (PR #70).

* `cutadapt`: fix `--par_quality_cutoff_r2` argument (PR #69).

* `cutadapt`: demultiplexing is now disabled by default. It can be re-enabled by using `demultiplex_mode` (PR #69).

* `multiqc`: update multiple separator to `;` (PR #81).


# biobox 0.1.0

## NEW FEATURES

Expand Down Expand Up @@ -73,17 +98,12 @@

* `falco`: A C++ drop-in replacement of FastQC to assess the quality of sequence read data (PR #43).

* `umitools`:
- `umitools_dedup`: Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read (PR #54).
* `seqtk/seqtk_sample`: Sample sequences from FASTA/Q(.gz) files to FASTA/Q (PR #68).

* `bedtools`:
- `bedtools_getfasta`: extract sequences from a FASTA file for each of the
intervals defined in a BED/GFF/VCF file (PR #59).

* `seqtk`:
- `subseq`: Extract the sequences (complete or subsequence) from the FASTA/FASTQ files
based on a provided sequence IDs or region coordinates file (PR #85).

## MINOR CHANGES

* Uniformize component metadata (PR #23).
Expand Down
151 changes: 91 additions & 60 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,22 +65,21 @@ runners:
Fill in the relevant metadata fields in the config. Here is an example of the metadata of an existing component.
```yaml
functionality:
name: arriba
description: Detect gene fusions from RNA-Seq data
keywords: [Gene fusion, RNA-Seq]
links:
homepage: https://arriba.readthedocs.io/en/latest/
documentation: https://arriba.readthedocs.io/en/latest/
repository: https://github.com/suhrig/arriba
issue_tracker: https://github.com/suhrig/arriba/issues
references:
doi: 10.1101/gr.257246.119
bibtex: |
@article{
... a bibtex entry in case the doi is not available ...
}
license: MIT
name: arriba
description: Detect gene fusions from RNA-Seq data
keywords: [Gene fusion, RNA-Seq]
links:
homepage: https://arriba.readthedocs.io/en/latest/
documentation: https://arriba.readthedocs.io/en/latest/
repository: https://github.com/suhrig/arriba
issue_tracker: https://github.com/suhrig/arriba/issues
references:
doi: 10.1101/gr.257246.119
bibtex: |
@article{
... a bibtex entry in case the doi is not available ...
}
license: MIT
```
### Step 4: Find a suitable container
Expand Down Expand Up @@ -162,7 +161,7 @@ argument_groups:
type: file
description: |
File in SAM/BAM/CRAM format with main alignments as generated by STAR
(Aligned.out.sam). Arriba extracts candidate reads from this file.
(`Aligned.out.sam`). Arriba extracts candidate reads from this file.
required: true
example: Aligned.out.bam
```
Expand All @@ -175,7 +174,7 @@ Several notes:

* Input arguments can have `multiple: true` to allow the user to specify multiple files.


* The description should be formatted in markdown.

### Step 8: Add arguments for the output files

Expand Down Expand Up @@ -220,7 +219,7 @@ argument_groups:

Note:

* Preferably, these outputs should not be directores but files. For example, if a tool outputs a directory `foo/` containing files `foo/bar.txt` and `foo/baz.txt`, there should be two output arguments `--bar` and `--baz` (as opposed to one output argument which outputs the whole `foo/` directory).
* Preferably, these outputs should not be directories but files. For example, if a tool outputs a directory `foo/` containing files `foo/bar.txt` and `foo/baz.txt`, there should be two output arguments `--bar` and `--baz` (as opposed to one output argument which outputs the whole `foo/` directory).

### Step 9: Add arguments for the other arguments

Expand All @@ -230,6 +229,8 @@ Finally, add all other arguments to the config file. There are a few exceptions:

* Arguments related to printing the information such as printing the version (`-v`, `--version`) or printing the help (`-h`, `--help`) should not be added to the config file.

* If the help lists defaults, do not add them as defaults but to the description. Example: `description: <Explanation of parameter>. Default: 10.`


### Step 10: Add a Docker engine

Expand Down Expand Up @@ -275,10 +276,13 @@ Next, we need to write a runner script that runs the tool with the input argumen
## VIASH START
## VIASH END
# unset flags
[[ "$par_option" == "false" ]] && unset par_option
xxx \
--input "$par_input" \
--output "$par_output" \
$([ "$par_option" = "true" ] && echo "--option")
${par_option:+--option}
```

When building a Viash component, Viash will automatically replace the `## VIASH START` and `## VIASH END` lines (and anything in between) with environment variables based on the arguments specified in the config.
Expand All @@ -291,82 +295,107 @@ As an example, this is what the Bash script for the `arriba` component looks lik
## VIASH START
## VIASH END
# unset flags
[[ "$par_skip_duplicate_marking" == "false" ]] && unset par_skip_duplicate_marking
[[ "$par_extra_information" == "false" ]] && unset par_extra_information
[[ "$par_fill_gaps" == "false" ]] && unset par_fill_gaps
arriba \
-x "$par_bam" \
-a "$par_genome" \
-g "$par_gene_annotation" \
-o "$par_fusions" \
${par_known_fusions:+-k "${par_known_fusions}"} \
${par_blacklist:+-b "${par_blacklist}"} \
${par_structural_variants:+-d "${par_structural_variants}"} \
$([ "$par_skip_duplicate_marking" = "true" ] && echo "-u") \
$([ "$par_extra_information" = "true" ] && echo "-X") \
$([ "$par_fill_gaps" = "true" ] && echo "-I")
# ...
${par_extra_information:+-X} \
${par_fill_gaps:+-I}
```

Notes:

### Step 12: Create test script
* If your arguments can contain special variables (e.g. `$`), you can use quoting (need to find a documentation page for this) to make sure you can use the string as input. Example: `-x ${par_bam@Q}`.

* Optional arguments can be passed to the command conditionally using Bash [parameter expansion](https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html). For example: `${par_known_fusions:+-k ${par_known_fusions@Q}}`

* If your tool allows for multiple inputs using a separator other than `;` (which is the default Viash multiple separator), you can substitute these values with a command like: `par_disable_filters=$(echo $par_disable_filters | tr ';' ',')`.


### Step 12: Create test script

If the unit test requires test resources, these should be provided in the `test_resources` section of the component.

```yaml
functionality:
# ...
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
```

Create a test script at `src/xxx/test.sh` that runs the component with the test data. This script should run the component (available with `$meta_executable`) with the test data and check if the output is as expected. The script should exit with a non-zero exit code if the output is not as expected. For example:

```bash
#!/bin/bash
set -e
## VIASH START
## VIASH END
echo "> Run xxx with test data"
#############################################
# helper functions
assert_file_exists() {
[ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; }
}
assert_file_doesnt_exist() {
[ ! -f "$1" ] || { echo "File '$1' exists but shouldn't" && exit 1; }
}
assert_file_empty() {
[ ! -s "$1" ] || { echo "File '$1' is not empty but should be" && exit 1; }
}
assert_file_not_empty() {
[ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; }
}
assert_file_contains() {
grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_file_not_contains() {
grep -q "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
}
assert_file_contains_regex() {
grep -q -E "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; }
}
assert_file_not_contains_regex() {
grep -q -E "$2" "$1" && { echo "File '$1' contains '$2' but shouldn't" && exit 1; }
}
#############################################
echo "> Run $meta_name with test data"
"$meta_executable" \
--input "$meta_resources_dir/test_data/input.txt" \
--input "$meta_resources_dir/test_data/reads_R1.fastq" \
--output "output.txt" \
--option
echo ">> Checking output"
[ ! -f "output.txt" ] && echo "Output file output.txt does not exist" && exit 1
```
echo ">> Check if output exists"
assert_file_exists "output.txt"
echo ">> Check if output is empty"
assert_file_not_empty "output.txt"
For example, this is what the test script for the `arriba` component looks like:
echo ">> Check if output is correct"
assert_file_contains "output.txt" "some expected output"
```bash
#!/bin/bash
echo "> All tests succeeded!"
```

## VIASH START
## VIASH END
Notes:

echo "> Run arriba with blacklist"
"$meta_executable" \
--bam "$meta_resources_dir/test_data/A.bam" \
--genome "$meta_resources_dir/test_data/genome.fasta" \
--gene_annotation "$meta_resources_dir/test_data/annotation.gtf" \
--blacklist "$meta_resources_dir/test_data/blacklist.tsv" \
--fusions "fusions.tsv" \
--fusions_discarded "fusions_discarded.tsv" \
--interesting_contigs "1,2"
echo ">> Checking output"
[ ! -f "fusions.tsv" ] && echo "Output file fusions.tsv does not exist" && exit 1
[ ! -f "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv does not exist" && exit 1
* Do always check the contents of the output file. If the output is not deterministic, you can use regular expressions to check the output.

echo ">> Check if output is empty"
[ ! -s "fusions.tsv" ] && echo "Output file fusions.tsv is empty" && exit 1
[ ! -s "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv is empty" && exit 1
```
* If possible, generate your own test data instead of copying it from an external resource.

### Step 12: Create a `/var/software_versions.txt` file
### Step 13: Create a `/var/software_versions.txt` file

For the sake of transparency and reproducibility, we require that the versions of the software used in the component are documented.

Expand All @@ -378,6 +407,8 @@ engines:
image: quay.io/biocontainers/xxx:0.1.0--py_0
setup:
- type: docker
# note: /var/software_versions.txt should contain:
# arriba: "2.4.0"
run: |
echo "xxx: \"0.1.0\"" > /var/software_versions.txt
```
14 changes: 14 additions & 0 deletions src/_authors/angela_o_pisco.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
name: Angela Oliveira Pisco
info:
role: Contributor
links:
github: aopisco
orcid: "0000-0003-0142-2355"
linkedin: aopisco
organizations:
- name: Insitro
href: https://insitro.com
role: Director of Computational Biology
- name: Open Problems
href: https://openproblems.bio
role: Core Member
10 changes: 10 additions & 0 deletions src/_authors/dorien_roosen.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
name: Dorien Roosen
info:
links:
email: [email protected]
github: dorien-er
linkedin: dorien-roosen
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Data Scientist
11 changes: 11 additions & 0 deletions src/_authors/dries_schaumont.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: Dries Schaumont
info:
links:
email: [email protected]
github: DriesSchaumont
orcid: "0000-0002-4389-0440"
linkedin: dries-schaumont
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Data Scientist
10 changes: 10 additions & 0 deletions src/_authors/emma_rousseau.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
name: Emma Rousseau
info:
links:
email: [email protected]
github: emmarousseau
linkedin: emmarousseau1
organizations:
- name: Data Intuitive
href: https://www.data-intuitive.com
role: Bioinformatician
Loading

0 comments on commit 3dfc028

Please sign in to comment.