Skip to content

Commit

Permalink
Merge branch 'main' into add-agat_convert_minimap2_bam2gff
Browse files Browse the repository at this point in the history
  • Loading branch information
jakubmajercik authored Nov 8, 2024
2 parents 852f11b + b3fcd52 commit fc6c524
Show file tree
Hide file tree
Showing 243 changed files with 24,110 additions and 203 deletions.
95 changes: 94 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,59 @@
# biobox x.x.x

## NEW FUNCTIONALITY

* `agat`:
- `agat/agat_convert_genscan2gff`: convert a genscan file into a GFF file (PR #100).
- `agat/agat_sp_add_introns`: add intron features to gtf/gff file without intron features (PR #104).
- `agat/agat_sp_filter_feature_from_kill_list`: remove features in a GFF file based on a kill list (PR #105).
- `agat/agat_sp_merge_annotations`: merge different gff annotation files in one (PR #106).
- `agat/agat_sp_statistics`: provides exhaustive statistics of a gft/gff file (PR #107).
- `agat/agat_sq_stat_basic`: provide basic statistics of a gtf/gff file (PR #110).

* `bd_rhapsody/bd_rhapsody_sequence_analysis`: BD Rhapsody Sequence Analysis CWL pipeline (PR #96).

* `bedtools`:
- `bedtools/bedtools_bamtobed`: Converts BAM alignments to BED6 or BEDPE format (PR #109).

* `rsem/rsem_calculate_expression`: Calculate expression levels (PR #93).

* `rseqc`:
- `rseqc/rseqc_inner_distance`: Calculate inner distance between read pairs (PR #159).
- `rseqc/rseqc_inferexperiment`: Infer strandedness from sequencing reads (PR #158).
- `rseqc/bam_stat`: Generate statistics from a bam file (PR #155).

* `nanoplot`: Plotting tool for long read sequencing data and alignments (PR #95).

## BUG FIXES

* `falco`: Fix a typo in the `--reverse_complement` argument (PR #157).

* `cutadapt`: Fix the the non-functional `action` parameter (PR #161).

* `bbmap_bbsplit`: Change argument type of `build` to `file` and add output argument `index` (PR #162).

* `kallisto/kallisto_index`: Fix command script to use `--threads` option (PR #162).

* `kallisto/kallisto_quant`: Change type of argument `output_dir` to `file` and add output argument `log` (PR #162).

* `rsem/rsem_calculate_expression`: Fix output handling (PR #162).

* `sortmerna`: Change type pf argument `aligned` to `file`; update docker image; accept more than two reference files (PR #162).

* `umi_tools/umi_tools_extract`: Remove `umi_discard_reads` option and change `log2stderr` to input argument (PR #162).

## MINOR CHANGES

* `agat_convert_bed2gff`: change type of argument `inflate_off` from `boolean_false` to `boolean_true` (PR #160).

* `cutadapt`: change type of argument `no_indels` and `no_match_adapter_wildcards` from `boolean_false` to `boolean_true` (PR #160).

* Upgrade to Viash 0.9.0.

* `bbmap_bbsplit`: Move to namespace `bbmap` (PR #162).

# biobox 0.2.0

## BREAKING CHANGES

* `star/star_align_reads`: Change all arguments from `--camelCase` to `--snake_case` (PR #62).
Expand All @@ -20,18 +74,52 @@
based on a provided sequence IDs or region coordinates file (PR #85).

* `agat`:
- `agat_convert_sp_gff2gtf`: convert any GTF/GFF file into a proper GTF file (PR #76).
- `agat_convert_bed2gff`: convert bed file to gff format (PR #97).
- `agat_convert_embl2gff`: convert an EMBL file into GFF format (PR #99).
- `agat/agat_convert_sp_gff2gtf`: convert any GTF/GFF file into a proper GTF file (PR #76).
- `agat/agat_convert_bed2gff`: convert bed file to gff format (PR #97).
- `agat/agat_convert_minimap2_bam2gff`: convert output from minimap2 (bam or sam) into gff file (PR #113).
- `agat/agat_convert_mfannot2gff`: convert MFannot "masterfile" annotation to gff format (PR #112).
- `agat/agat_convert_embl2gff`: convert an EMBL file into GFF format (PR #99).
- `agat/agat_convert_sp_gff2tsv`: convert gtf/gff file into tabulated file (PR #102).
- `agat/agat_convert_sp_gxf2gxf`: fixes and/or standardizes any GTF/GFF file into full sorted GTF/GFF file (PR #103).

* `bedtools`:
- `bedtools/bedtools_intersect`: Allows one to screen for overlaps between two sets of genomic features (PR #94).
- `bedtools/bedtools_sort`: Sorts a feature file (bed/gff/vcf) by chromosome and other criteria (PR #98).
- `bedtools/bedtools_genomecov`: Compute the coverage of a feature file (bed/gff/vcf/bam) among a genome (PR #128).
- `bedtools/bedtools_groupby`: Summarizes a dataset column based upon common column groupings. Akin to the SQL "group by" command (PR #123).
- `bedtools/bedtools_merge`: Merges overlapping BED/GFF/VCF entries into a single interval (PR #118).
- `bedtools/bedtools_bamtofastq`: Convert BAM alignments to FASTQ files (PR #101).
- `bedtools/bedtools_bedtobam`: Converts genomic feature records (bed/gff/vcf) to BAM format (PR #111).
- `bedtools/bedtools_bed12tobed6`: Converts BED12 files to BED6 files (PR #140).
- `bedtools/bedtools_links`: Creates an HTML file with links to an instance of the UCSC Genome Browser for all features / intervals in a (bed/gff/vcf) file (PR #137).

* `qualimap/qualimap_rnaseq`: RNA-seq QC analysis using qualimap (PR #74).

* `rsem/rsem_prepare_reference`: Prepare transcript references for RSEM (PR #89).

* `bcftools`:
- `bcftools/bcftools_concat`: Concatenate or combine VCF/BCF files (PR #145).
- `bcftools/bcftools_norm`: Left-align and normalize indels, check if REF alleles match the reference, split multiallelic sites into multiple rows; recover multiallelics from multiple rows (PR #144).
- `bcftools/bcftools_annotate`: Add or remove annotations from a VCF/BCF file (PR #143).
- `bcftools/bcftools_stats`: Parses VCF or BCF and produces a txt stats file which can be plotted using plot-vcfstats (PR #142).
- `bcftools/bcftools_sort`: Sorts BCF/VCF files by position and other criteria (PR #141).

* `fastqc`: High throughput sequence quality control analysis tool (PR #92).

* `sortmerna`: Local sequence alignment tool for mapping, clustering, and filtering rRNA from
metatranscriptomic data (PR #146).

* `fq_subsample`: Sample a subset of records from single or paired FASTQ files (PR #147).

* `kallisto`:
- `kallisto_index`: Create a kallisto index (PR #149).
- `kallisto_quant`: Quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads (PR #152).

* `trimgalore`: Quality and adapter trimming for fastq files (PR #117).


## MINOR CHANGES

Expand Down Expand Up @@ -120,13 +208,18 @@
- `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTA (PR #53).

* `umi_tools`:
-`umi_tools/umi_tools_extract`: Flexible removal of UMI sequences from fastq reads (PR #71).
- `umi_tools/umi_tools_extract`: Flexible removal of UMI sequences from fastq reads (PR #71).
- `umi_tools/umi_tools_prepareforrsem`: Fix paired-end reads in name sorted BAM file to prepare for RSEM (PR #148).

* `falco`: A C++ drop-in replacement of FastQC to assess the quality of sequence read data (PR #43).

* `bedtools`:
- `bedtools_getfasta`: extract sequences from a FASTA file for each of the
intervals defined in a BED/GFF/VCF file (PR #59).

* `bbmap`:
- `bbmap_bbsplit`: Split sequencing reads by mapping them to multiple references simultaneously (PR #138).


## MINOR CHANGES

Expand Down
6 changes: 6 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,12 @@ Finally, add all other arguments to the config file. There are a few exceptions:

* If the help lists defaults, do not add them as defaults but to the description. Example: `description: <Explanation of parameter>. Default: 10.`

Note:

* Prefer using `boolean_true` over `boolean_false`. This avoids confusion when specifying values for this argument in a Nextflow workflow.
For example, consider the CLI option `--no-indels` for `cutadapt`. If the config for `cutadapt` would specify an argument `no_indels` of type `boolean_false`,
the script of the component must pass a `--no-indels` argument to `cutadapt` when `par_no_indels` is set to `false`. This becomes problematic setting a value for this argument using `fromState` in a nextflow workflow: with `fromState: ["no_indels": true]`, the value that gets passed to the script is `true` and the `--no-indels` flag would *not* be added to the options for `cutadapt`. This is inconsitent to what one might expect when interpreting `["no_indels": true]`.
When using `boolean_true`, the reasoning becomes simpler because its value no longer represents the effect of the argument, but wether or not the flag is set.

### Step 10: Add a Docker engine

Expand Down
2 changes: 1 addition & 1 deletion _viash.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ links:
issue_tracker: https://github.com/viash-hub/biobox/issues
repository: https://github.com/viash-hub/biobox

viash_version: 0.9.0-RC7
viash_version: 0.9.0

config_mods: |
.requirements.commands := ['ps']
2 changes: 1 addition & 1 deletion src/agat/agat_convert_bed2gff/config.vsh.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ argument_groups:
- name: --inflate_off
description: |
By default we inflate the block fields (blockCount, blockSizes, blockStarts) to create subfeatures of the main feature (primary_tag). The type of subfeature created is based on the inflate_type parameter. If you do not want this inflating behaviour you can deactivate it by using the --inflate_off option.
type: boolean_false
type: boolean_true
- name: --inflate_type
description: |
Feature type (3rd column in gff) created when inflate parameter activated [default: exon].
Expand Down
2 changes: 1 addition & 1 deletion src/agat/agat_convert_bed2gff/script.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
## VIASH END

# unset flags
[[ "$par_inflate_off" == "true" ]] && unset par_inflate_off
[[ "$par_inflate_off" == "false" ]] && unset par_inflate_off
[[ "$par_verbose" == "false" ]] && unset par_verbose

# run agat_convert_sp_bed2gff.pl
Expand Down
95 changes: 95 additions & 0 deletions src/agat/agat_convert_genscan2gff/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
name: agat_convert_genscan2gff
namespace: agat
description: |
The script takes a GENSCAN file as input, and will translate it in gff
format. The GENSCAN format is described [here](http://genome.crg.es/courses/Bioinformatics2003_genefinding/results/genscan.html).
**Known problem**
You must have submited only DNA sequence, without any header!! Indeed the tool expects only DNA
sequences and does not crash/warn if an header is submited along the
sequence. e.g If you have an header ">seq" s-e-q are seen as the 3 first
nucleotides of the sequence. Then all prediction location are shifted
accordingly. (checked only on the [online version](http://argonaute.mit.edu/GENSCAN.html).
I don't know if there is the same problem elsewhere.)
keywords: [gene annotations, GFF conversion, GENSCAN]
links:
homepage: https://github.com/NBISweden/AGAT
documentation: https://agat.readthedocs.io/en/latest/tools/agat_convert_genscan2gff.html
issue_tracker: https://github.com/NBISweden/AGAT/issues
repository: https://github.com/NBISweden/AGAT
references:
doi: 10.5281/zenodo.3552717
license: GPL-3.0
requirements:
- commands: [agat]
authors:
- __merge__: /src/_authors/leila_paquay.yaml
roles: [ author, maintainer ]

argument_groups:
- name: Inputs
arguments:
- name: --genscan
alternatives: [-g]
description: Input genscan bed file that will be converted.
type: file
required: true
direction: input
- name: Outputs
arguments:
- name: --output
alternatives: [-o, --out, --outfile, --gff]
description: Output GFF file. If no output file is specified, the output will be written to STDOUT.
type: file
direction: output
required: true
example: output.gff
- name: Arguments
arguments:
- name: --source
description: |
The source informs about the tool used to produce the data and is stored in 2nd field of a gff file. Example: Stringtie, Maker, Augustus, etc. [default: data]
type: string
required: false
example: Stringtie
- name: --primary_tag
description: |
The primary_tag corresponds to the data type and is stored in 3rd field of a gff file. Example: gene, mRNA, CDS, etc. [default: gene]
type: string
required: false
example: gene
- name: --inflate_type
description: |
Feature type (3rd column in gff) created when inflate parameter activated [default: exon].
type: string
required: false
example: exon
- name: --verbose
description: add verbosity
type: boolean_true
- name: --config
alternatives: [-c]
description: |
AGAT config file. By default AGAT takes the original agat_config.yaml shipped with AGAT. The `--config` option gives you the possibility to use your own AGAT config file (located elsewhere or named differently).
type: file
required: false
example: custom_agat_config.yaml
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/agat:1.4.0--pl5321hdfd78af_0
setup:
- type: docker
run: |
agat --version | sed 's/AGAT\s\(.*\)/agat: "\1"/' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow
94 changes: 94 additions & 0 deletions src/agat/agat_convert_genscan2gff/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
```sh
agat_convert_genscan2gff.pl --help
```
------------------------------------------------------------------------------
| Another GFF Analysis Toolkit (AGAT) - Version: v1.4.0 |
| https://github.com/NBISweden/AGAT |
| National Bioinformatics Infrastructure Sweden (NBIS) - www.nbis.se |
------------------------------------------------------------------------------

Name:
agat_convert_genscan2gff.pl

Description:
The script takes a genscan file as input, and will translate it in gff
format. The genscan format is described here:
http://genome.crg.es/courses/Bioinformatics2003_genefinding/results/gens
can.html /!\ vvv Known problem vvv /!\ You must have submited only DNA
sequence, wihtout any header!! Indeed the tool expects only DNA
sequences and does not crash/warn if an header is submited along the
sequence. e.g If you have an header ">seq" s-e-q are seen as the 3 first
nucleotides of the sequence. Then all prediction location are shifted
accordingly. (checked only on the online version
http://argonaute.mit.edu/GENSCAN.html. I don't know if there is the same
pronlem elsewhere.) /!\ ^^^ Known problem ^^^^ /!\

Usage:
agat_convert_genscan2gff.pl --genscan infile.bed [ -o outfile ]
agat_convert_genscan2gff.pl -h

Options:
--genscan or -g
Input genscan bed file that will be convert.

--source
The source informs about the tool used to produce the data and
is stored in 2nd field of a gff file. Example:
Stringtie,Maker,Augustus,etc. [default: data]

--primary_tag
The primary_tag corresponf to the data type and is stored in 3rd
field of a gff file. Example: gene,mRNA,CDS,etc. [default: gene]

--inflate_off
By default we inflate the block fields (blockCount, blockSizes,
blockStarts) to create subfeatures of the main feature
(primary_tag). Type of subfeature created based on the
inflate_type parameter. If you don't want this inflating
behaviour you can deactivate it by using the option
--inflate_off.

--inflate_type
Feature type (3rd column in gff) created when inflate parameter
activated [default: exon].

--verbose
add verbosity

-o , --output , --out , --outfile or --gff
Output GFF file. If no output file is specified, the output will
be written to STDOUT.

-c or --config
String - Input agat config file. By default AGAT takes as input
agat_config.yaml file from the working directory if any,
otherwise it takes the orignal agat_config.yaml shipped with
AGAT. To get the agat_config.yaml locally type: "agat config
--expose". The --config option gives you the possibility to use
your own AGAT config file (located elsewhere or named
differently).

-h or --help
Display this helpful text.

Feedback:
Did you find a bug?:
Do not hesitate to report bugs to help us keep track of the bugs and
their resolution. Please use the GitHub issue tracking system available
at this address:

https://github.com/NBISweden/AGAT/issues

Ensure that the bug was not already reported by searching under Issues.
If you're unable to find an (open) issue addressing the problem, open a new one.
Try as much as possible to include in the issue when relevant:
- a clear description,
- as much relevant information as possible,
- the command used,
- a data sample,
- an explanation of the expected behaviour that is not occurring.

Do you want to contribute?:
You are very welcome, visit this address for the Contributing
guidelines:
https://github.com/NBISweden/AGAT/blob/master/CONTRIBUTING.md
21 changes: 21 additions & 0 deletions src/agat/agat_convert_genscan2gff/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash

set -eo pipefail

## VIASH START
## VIASH END

# unset flags
[[ "$par_inflate_off" == "true" ]] && unset par_inflate_off
[[ "$par_verbose" == "false" ]] && unset par_verbose

# run agat_convert_genscan2gff
agat_convert_genscan2gff.pl \
--genscan "$par_genscan" \
--output "$par_output" \
${par_source:+--source "${par_source}"} \
${par_primary_tag:+--primary_tag "${par_primary_tag}"} \
${par_inflate_off:+--inflate_off} \
${par_inflate_type:+--inflate_type "${par_inflate_type}"} \
${par_verbose:+--verbose} \
${par_config:+--config "${par_config}"}
Loading

0 comments on commit fc6c524

Please sign in to comment.