Merge branch 'main' into add-agat_convert_minimap2_bam2gff

viash-hub · Nov 8, 2024 · fc6c524 · fc6c524
2 parents 852f11b + b3fcd52
commit fc6c524
Show file tree

Hide file tree

Showing 243 changed files with 24,110 additions and 203 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,59 @@
 # biobox x.x.x
 
+## NEW FUNCTIONALITY
+
+* `agat`:
+  - `agat/agat_convert_genscan2gff`: convert a genscan file into a GFF file (PR #100).
+  - `agat/agat_sp_add_introns`: add intron features to gtf/gff file without intron features (PR #104).
+  - `agat/agat_sp_filter_feature_from_kill_list`: remove features in a GFF file based on a kill list (PR #105).
+  - `agat/agat_sp_merge_annotations`: merge different gff annotation files in one (PR #106).
+  - `agat/agat_sp_statistics`: provides exhaustive statistics of a gft/gff file (PR #107).
+  - `agat/agat_sq_stat_basic`: provide basic statistics of a gtf/gff file (PR #110).
+
+* `bd_rhapsody/bd_rhapsody_sequence_analysis`: BD Rhapsody Sequence Analysis CWL pipeline (PR #96).
+
+* `bedtools`:
+   - `bedtools/bedtools_bamtobed`: Converts BAM alignments to BED6 or BEDPE format (PR #109).
+
+* `rsem/rsem_calculate_expression`: Calculate expression levels (PR #93).
+
+* `rseqc`:
+  - `rseqc/rseqc_inner_distance`: Calculate inner distance between read pairs (PR #159).
+  - `rseqc/rseqc_inferexperiment`: Infer strandedness from sequencing reads (PR #158).
+  - `rseqc/bam_stat`: Generate statistics from a bam file (PR #155).
+
+* `nanoplot`: Plotting tool for long read sequencing data and alignments (PR #95).
+
+## BUG FIXES
+
+* `falco`: Fix a typo in the `--reverse_complement` argument (PR #157).
+
+* `cutadapt`: Fix the the non-functional `action` parameter (PR #161).
+
+* `bbmap_bbsplit`: Change argument type of `build` to `file` and add output argument `index` (PR #162).
+
+* `kallisto/kallisto_index`: Fix command script to use `--threads` option (PR #162).
+
+* `kallisto/kallisto_quant`: Change type of argument `output_dir` to `file` and add output argument `log` (PR #162).
+
+* `rsem/rsem_calculate_expression`: Fix output handling (PR #162).
+
+* `sortmerna`: Change type pf argument `aligned` to `file`; update docker image; accept more than two reference files (PR #162).
+
+* `umi_tools/umi_tools_extract`: Remove `umi_discard_reads` option and change `log2stderr` to input argument (PR #162).
+
+## MINOR CHANGES
+
+* `agat_convert_bed2gff`: change type of argument `inflate_off` from `boolean_false` to `boolean_true` (PR #160).
+
+* `cutadapt`: change type of argument `no_indels` and `no_match_adapter_wildcards` from `boolean_false` to `boolean_true` (PR #160).
+
+* Upgrade to Viash 0.9.0.
+
+* `bbmap_bbsplit`: Move to namespace `bbmap` (PR #162).
+
+# biobox 0.2.0
+
 ## BREAKING CHANGES
 
 * `star/star_align_reads`: Change all arguments from `--camelCase` to `--snake_case` (PR #62).
@@ -20,18 +74,52 @@
                 based on a provided sequence IDs or region coordinates file (PR #85).
 
 * `agat`:
+  - `agat_convert_sp_gff2gtf`: convert any GTF/GFF file into a proper GTF file (PR #76).
+  - `agat_convert_bed2gff`: convert bed file to gff format (PR #97).
+  - `agat_convert_embl2gff`: convert an EMBL file into GFF format (PR #99).
   - `agat/agat_convert_sp_gff2gtf`: convert any GTF/GFF file into a proper GTF file (PR #76).
   - `agat/agat_convert_bed2gff`: convert bed file to gff format (PR #97).
   - `agat/agat_convert_minimap2_bam2gff`: convert output from minimap2 (bam or sam) into gff file (PR #113).
+  - `agat/agat_convert_mfannot2gff`: convert MFannot "masterfile" annotation to gff format (PR #112).
   - `agat/agat_convert_embl2gff`: convert an EMBL file into GFF format (PR #99).
   - `agat/agat_convert_sp_gff2tsv`: convert gtf/gff file into tabulated file (PR #102).
   - `agat/agat_convert_sp_gxf2gxf`: fixes and/or standardizes any GTF/GFF file into full sorted GTF/GFF file (PR #103).
 
 * `bedtools`:
   - `bedtools/bedtools_intersect`: Allows one to screen for overlaps between two sets of genomic features (PR #94).
   - `bedtools/bedtools_sort`: Sorts a feature file (bed/gff/vcf) by chromosome and other criteria (PR #98).
+  - `bedtools/bedtools_genomecov`: Compute the coverage of a feature file (bed/gff/vcf/bam) among a genome (PR #128).
+  - `bedtools/bedtools_groupby`: Summarizes a dataset column based upon common column groupings. Akin to the SQL "group by" command (PR #123).
+  - `bedtools/bedtools_merge`: Merges overlapping BED/GFF/VCF entries into a single interval (PR #118).
   - `bedtools/bedtools_bamtofastq`: Convert BAM alignments to FASTQ files (PR #101).
   - `bedtools/bedtools_bedtobam`: Converts genomic feature records (bed/gff/vcf) to BAM format (PR #111).
+  - `bedtools/bedtools_bed12tobed6`: Converts BED12 files to BED6 files (PR #140).
+  - `bedtools/bedtools_links`: Creates an HTML file with links to an instance of the UCSC Genome Browser for all features / intervals in a (bed/gff/vcf) file (PR #137).
+
+* `qualimap/qualimap_rnaseq`: RNA-seq QC analysis using qualimap (PR #74). 
+
+* `rsem/rsem_prepare_reference`: Prepare transcript references for RSEM (PR #89).
+
+* `bcftools`:
+  - `bcftools/bcftools_concat`: Concatenate or combine VCF/BCF files (PR #145).
+  - `bcftools/bcftools_norm`: Left-align and normalize indels, check if REF alleles match the reference, split multiallelic sites into multiple rows; recover multiallelics from multiple rows (PR #144).
+  - `bcftools/bcftools_annotate`: Add or remove annotations from a VCF/BCF file (PR #143).
+  - `bcftools/bcftools_stats`: Parses VCF or BCF and produces a txt stats file which can be plotted using plot-vcfstats (PR #142).
+  - `bcftools/bcftools_sort`: Sorts BCF/VCF files by position and other criteria (PR #141).
+
+* `fastqc`: High throughput sequence quality control analysis tool (PR #92).
+
+* `sortmerna`: Local sequence alignment tool for mapping, clustering, and filtering rRNA from
+  metatranscriptomic data (PR #146).
+
+* `fq_subsample`: Sample a subset of records from single or paired FASTQ files (PR #147).
+
+* `kallisto`:
+    - `kallisto_index`: Create a kallisto index (PR #149).
+    - `kallisto_quant`: Quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads (PR #152).
+
+* `trimgalore`: Quality and adapter trimming for fastq files (PR #117). 
+
 
 ## MINOR CHANGES
 
@@ -120,13 +208,18 @@
     - `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTA (PR #53).
 
 * `umi_tools`:
-    -`umi_tools/umi_tools_extract`: Flexible removal of UMI sequences from fastq reads (PR #71).
+    - `umi_tools/umi_tools_extract`: Flexible removal of UMI sequences from fastq reads (PR #71).
+    - `umi_tools/umi_tools_prepareforrsem`: Fix paired-end reads in name sorted BAM file to prepare for RSEM (PR #148).
 
 * `falco`: A C++ drop-in replacement of FastQC to assess the quality of sequence read data (PR #43).
 
 * `bedtools`:
     - `bedtools_getfasta`: extract sequences from a FASTA file for each of the
                            intervals defined in a BED/GFF/VCF file (PR #59).
+
+* `bbmap`:
+    - `bbmap_bbsplit`: Split sequencing reads by mapping them to multiple references simultaneously (PR #138).
+
 
 ## MINOR CHANGES
 

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -231,6 +231,12 @@ Finally, add all other arguments to the config file. There are a few exceptions:
 
 * If the help lists defaults, do not add them as defaults but to the description. Example: `description: <Explanation of parameter>. Default: 10.`
 
+Note:
+
+* Prefer using `boolean_true` over `boolean_false`. This avoids confusion when specifying values for this argument in a Nextflow workflow.
+  For example, consider the CLI option `--no-indels` for `cutadapt`. If the config for `cutadapt` would specify an argument `no_indels` of type `boolean_false`,
+  the script of the component must pass a `--no-indels` argument to `cutadapt` when `par_no_indels` is set to `false`. This becomes problematic setting a value for this argument using `fromState` in a nextflow workflow: with `fromState: ["no_indels": true]`, the value that gets passed to the script is `true` and the `--no-indels` flag would *not* be added to the options for `cutadapt`. This is inconsitent to what one might expect when interpreting `["no_indels": true]`.
+  When using `boolean_true`, the reasoning becomes simpler because its value no longer represents the effect of the argument, but wether or not the flag is set.
 
 ### Step 10: Add a Docker engine
 

diff --git a/_viash.yaml b/_viash.yaml
@@ -7,7 +7,7 @@ links:
   issue_tracker: https://github.com/viash-hub/biobox/issues
   repository: https://github.com/viash-hub/biobox
 
-viash_version: 0.9.0-RC7
+viash_version: 0.9.0
 
 config_mods: |
   .requirements.commands := ['ps']
diff --git a/src/agat/agat_convert_bed2gff/config.vsh.yaml b/src/agat/agat_convert_bed2gff/config.vsh.yaml
@@ -49,7 +49,7 @@ argument_groups:
       - name: --inflate_off
         description: |
           By default we inflate the block fields (blockCount, blockSizes, blockStarts) to create subfeatures of the main feature (primary_tag). The type of subfeature created is based on the inflate_type parameter. If you do not want this inflating behaviour you can deactivate it by using the --inflate_off option.
-        type: boolean_false
+        type: boolean_true
       - name: --inflate_type
         description: |
           Feature type (3rd column in gff) created when inflate parameter activated [default: exon].

diff --git a/src/agat/agat_convert_bed2gff/script.sh b/src/agat/agat_convert_bed2gff/script.sh
@@ -4,7 +4,7 @@
 ## VIASH END
 
 # unset flags
-[[ "$par_inflate_off" == "true" ]] && unset par_inflate_off
+[[ "$par_inflate_off" == "false" ]] && unset par_inflate_off
 [[ "$par_verbose" == "false" ]] && unset par_verbose
 
 # run agat_convert_sp_bed2gff.pl

diff --git a/src/agat/agat_convert_genscan2gff/config.vsh.yaml b/src/agat/agat_convert_genscan2gff/config.vsh.yaml
@@ -0,0 +1,95 @@
+name: agat_convert_genscan2gff
+namespace: agat
+description: |
+  The script takes a GENSCAN file as input, and will translate it in gff
+  format. The GENSCAN format is described [here](http://genome.crg.es/courses/Bioinformatics2003_genefinding/results/genscan.html).
+  
+  **Known problem** 
+
+  You must have submited only DNA sequence, without any header!! Indeed the tool expects only DNA
+  sequences and does not crash/warn if an header is submited along the
+  sequence. e.g If you have an header ">seq" s-e-q are seen as the 3 first
+  nucleotides of the sequence. Then all prediction location are shifted
+  accordingly. (checked only on the [online version](http://argonaute.mit.edu/GENSCAN.html). 
+  I don't know if there is the same problem elsewhere.)
+keywords: [gene annotations, GFF conversion, GENSCAN]
+links:
+  homepage: https://github.com/NBISweden/AGAT
+  documentation: https://agat.readthedocs.io/en/latest/tools/agat_convert_genscan2gff.html
+  issue_tracker: https://github.com/NBISweden/AGAT/issues
+  repository: https://github.com/NBISweden/AGAT
+references: 
+  doi: 10.5281/zenodo.3552717
+license: GPL-3.0
+requirements:
+  - commands: [agat]
+authors:
+  - __merge__: /src/_authors/leila_paquay.yaml
+    roles: [ author, maintainer ]
+
+argument_groups:
+  - name: Inputs
+    arguments:
+      - name: --genscan
+        alternatives: [-g]
+        description: Input genscan bed file that will be converted.
+        type: file
+        required: true
+        direction: input
+  - name: Outputs
+    arguments:       
+      - name: --output
+        alternatives: [-o, --out, --outfile, --gff]
+        description: Output GFF file. If no output file is specified, the output will be written to STDOUT.
+        type: file
+        direction: output
+        required: true
+        example: output.gff
+  - name: Arguments
+    arguments:
+      - name: --source
+        description: |
+          The source informs about the tool used to produce the data and is stored in 2nd field of a gff file. Example: Stringtie, Maker, Augustus, etc. [default: data]
+        type: string
+        required: false
+        example: Stringtie
+      - name: --primary_tag
+        description: |
+          The primary_tag corresponds to the data type and is stored in 3rd field of a gff file. Example: gene, mRNA, CDS, etc. [default: gene]
+        type: string
+        required: false
+        example: gene
+      - name: --inflate_type
+        description: |
+          Feature type (3rd column in gff) created when inflate parameter activated [default: exon].
+        type: string
+        required: false
+        example: exon
+      - name: --verbose
+        description: add verbosity
+        type: boolean_true
+      - name: --config
+        alternatives: [-c]
+        description: |
+          AGAT config file. By default AGAT takes the original agat_config.yaml shipped with AGAT. The `--config` option gives you the possibility to use your own AGAT config file (located elsewhere or named differently).
+        type: file
+        required: false
+        example: custom_agat_config.yaml
+resources:
+  - type: bash_script
+    path: script.sh
+test_resources:
+  - type: bash_script
+    path: test.sh
+  - type: file
+    path: test_data
+engines:
+  - type: docker
+    image: quay.io/biocontainers/agat:1.4.0--pl5321hdfd78af_0
+    setup:
+      - type: docker
+        run: |
+          agat --version | sed 's/AGAT\s\(.*\)/agat: "\1"/' > /var/software_versions.txt
+runners:
+  - type: executable
+  - type: nextflow
diff --git a/src/agat/agat_convert_genscan2gff/help.txt b/src/agat/agat_convert_genscan2gff/help.txt
@@ -0,0 +1,94 @@
+```sh
+agat_convert_genscan2gff.pl --help
+```
+ ------------------------------------------------------------------------------
+|   Another GFF Analysis Toolkit (AGAT) - Version: v1.4.0                      |
+|   https://github.com/NBISweden/AGAT                                          |
+|   National Bioinformatics Infrastructure Sweden (NBIS) - www.nbis.se         |
+ ------------------------------------------------------------------------------
+
+Name:
+    agat_convert_genscan2gff.pl
+
+Description:
+    The script takes a genscan file as input, and will translate it in gff
+    format. The genscan format is described here:
+    http://genome.crg.es/courses/Bioinformatics2003_genefinding/results/gens
+    can.html /!\ vvv Known problem vvv /!\ You must have submited only DNA
+    sequence, wihtout any header!! Indeed the tool expects only DNA
+    sequences and does not crash/warn if an header is submited along the
+    sequence. e.g If you have an header ">seq" s-e-q are seen as the 3 first
+    nucleotides of the sequence. Then all prediction location are shifted
+    accordingly. (checked only on the online version
+    http://argonaute.mit.edu/GENSCAN.html. I don't know if there is the same
+    pronlem elsewhere.) /!\ ^^^ Known problem ^^^^ /!\
+
+Usage:
+        agat_convert_genscan2gff.pl --genscan infile.bed [ -o outfile ]
+        agat_convert_genscan2gff.pl -h
+
+Options:
+    --genscan or -g
+            Input genscan bed file that will be convert.
+
+    --source
+            The source informs about the tool used to produce the data and
+            is stored in 2nd field of a gff file. Example:
+            Stringtie,Maker,Augustus,etc. [default: data]
+
+    --primary_tag
+            The primary_tag corresponf to the data type and is stored in 3rd
+            field of a gff file. Example: gene,mRNA,CDS,etc. [default: gene]
+
+    --inflate_off
+            By default we inflate the block fields (blockCount, blockSizes,
+            blockStarts) to create subfeatures of the main feature
+            (primary_tag). Type of subfeature created based on the
+            inflate_type parameter. If you don't want this inflating
+            behaviour you can deactivate it by using the option
+            --inflate_off.
+
+    --inflate_type
+            Feature type (3rd column in gff) created when inflate parameter
+            activated [default: exon].
+
+    --verbose
+            add verbosity
+
+    -o , --output , --out , --outfile or --gff
+            Output GFF file. If no output file is specified, the output will
+            be written to STDOUT.
+
+    -c or --config
+            String - Input agat config file. By default AGAT takes as input
+            agat_config.yaml file from the working directory if any,
+            otherwise it takes the orignal agat_config.yaml shipped with
+            AGAT. To get the agat_config.yaml locally type: "agat config
+            --expose". The --config option gives you the possibility to use
+            your own AGAT config file (located elsewhere or named
+            differently).
+
+    -h or --help
+            Display this helpful text.
+
+Feedback:
+  Did you find a bug?:
+    Do not hesitate to report bugs to help us keep track of the bugs and
+    their resolution. Please use the GitHub issue tracking system available
+    at this address:
+
+                https://github.com/NBISweden/AGAT/issues
+
+     Ensure that the bug was not already reported by searching under Issues.
+     If you're unable to find an (open) issue addressing the problem, open a new one.
+     Try as much as possible to include in the issue when relevant:
+     - a clear description,
+     - as much relevant information as possible,
+     - the command used,
+     - a data sample,
+     - an explanation of the expected behaviour that is not occurring.
+
+  Do you want to contribute?:
+    You are very welcome, visit this address for the Contributing
+    guidelines:
+    https://github.com/NBISweden/AGAT/blob/master/CONTRIBUTING.md
diff --git a/src/agat/agat_convert_genscan2gff/script.sh b/src/agat/agat_convert_genscan2gff/script.sh
@@ -0,0 +1,21 @@
+#!/bin/bash
+
+set -eo pipefail
+
+## VIASH START
+## VIASH END
+
+# unset flags
+[[ "$par_inflate_off" == "true" ]] && unset par_inflate_off
+[[ "$par_verbose" == "false" ]] && unset par_verbose
+
+# run agat_convert_genscan2gff
+agat_convert_genscan2gff.pl \
+  --genscan "$par_genscan" \
+  --output "$par_output" \
+  ${par_source:+--source "${par_source}"} \
+  ${par_primary_tag:+--primary_tag "${par_primary_tag}"} \
+  ${par_inflate_off:+--inflate_off} \
+  ${par_inflate_type:+--inflate_type "${par_inflate_type}"} \
+  ${par_verbose:+--verbose} \
+  ${par_config:+--config "${par_config}"}