Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add agat sq stat basic #110

Merged
merged 18 commits into from
Nov 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
- `agat/agat_sp_filter_feature_from_kill_list`: remove features in a GFF file based on a kill list (PR #105).
- `agat/agat_sp_merge_annotations`: merge different gff annotation files in one (PR #106).
- `agat/agat_sp_statistics`: provides exhaustive statistics of a gft/gff file (PR #107).
- `agat/agat_sq_stat_basic`: provide basic statistics of a gtf/gff file (PR #110).

* `bd_rhapsody/bd_rhapsody_sequence_analysis`: BD Rhapsody Sequence Analysis CWL pipeline (PR #96).

Expand Down Expand Up @@ -68,7 +69,6 @@
- `agat/agat_convert_sp_gff2tsv`: convert gtf/gff file into tabulated file (PR #102).
- `agat/agat_convert_sp_gxf2gxf`: fixes and/or standardizes any GTF/GFF file into full sorted GTF/GFF file (PR #103).


* `bedtools`:
- `bedtools/bedtools_intersect`: Allows one to screen for overlaps between two sets of genomic features (PR #94).
- `bedtools/bedtools_sort`: Sorts a feature file (bed/gff/vcf) by chromosome and other criteria (PR #98).
Expand Down
92 changes: 92 additions & 0 deletions src/agat/agat_sq_stat_basic/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
name: agat_sq_stat_basic
namespace: agat
description: |
The script aims to provide basic statistics of a gtf/gff file.
keywords: [gene annotations, gff, statistics]
links:
homepage: https://github.com/NBISweden/AGAT
documentation: https://agat.readthedocs.io/en/latest/tools/agat_sq_stat_basic.html
issue_tracker: https://github.com/NBISweden/AGAT/issues
repository: https://github.com/NBISweden/AGAT
references:
doi: 10.5281/zenodo.3552717
license: GPL-3.0
requirements:
- commands: [agat]
authors:
- __merge__: /src/_authors/leila_paquay.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:
- name: --gff
alternatives: [-i, --file, --input]
description: |
Input GTF/GFF file.
type: file
required: true
multiple: true
direction: input
example: input.gff
- name: --genome_size
alternatives: [-g]
description: |
That input is designed to know the genome size in order to calculate the percentage of the genome represented by each kind of feature type. You can provide an INTEGER. Or you can also pass a fasta file using the argument --genome_size_fasta. If both are provided, only the value of --genome_size will be considered.
type: integer
required: false
direction: input
example: 10000
- name: --genome_size_fasta
description: |
That input is designed to know the genome size in order to calculate the percentage of the genome represented by each kind of feature type. You can provide the genome in fasta format. Or you can also pass the size directly as an integer using the argument --genome_size. If you provide the fasta, the genome size will be calculated on the fly. If both are provided, only the value of --genome_size will be considered.
type: file
required: false
direction: input
example: genome.fasta
- name: Outputs
arguments:
- name: --output
alternatives: [-o]
description: |
Output file. The result is in tabulate format.
type: file
direction: output
required: true
example: output.txt
- name: Arguments
arguments:
- name: --inflate
description: |
Inflate the statistics taking into account feature with
multi-parents. Indeed to avoid redundant information, some gff
factorize identical features. e.g: one exon used in two
different isoform will be defined only once, and will have
multiple parent. By default the script count such feature only
once. Using the inflate option allows to count the feature and
its size as many time there are parents.
type: boolean_true
- name: --config
alternatives: [-c]
description: |
AGAT config file. By default AGAT takes the original agat_config.yaml shipped with AGAT. The `--config` option gives you the possibility to use your own AGAT config file (located elsewhere or named differently).
type: file
required: false
example: custom_agat_config.yaml
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/agat:1.4.0--pl5321hdfd78af_0
setup:
- type: docker
run: |
agat --version | sed 's/AGAT\s\(.*\)/agat: "\1"/' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow
79 changes: 79 additions & 0 deletions src/agat/agat_sq_stat_basic/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
```sh
agat_sq_stat_basic.pl --help
```

------------------------------------------------------------------------------
| Another GFF Analysis Toolkit (AGAT) - Version: v1.4.0 |
| https://github.com/NBISweden/AGAT |
| National Bioinformatics Infrastructure Sweden (NBIS) - www.nbis.se |
------------------------------------------------------------------------------


Name:
agat_sq_stat_basic.pl

Description:
The script aims to provide basic statistics of a gtf/gff file.

Usage:
agat_sq_stat_basic.pl -i <input file> [-g <integer or fasta> -o <output file>]
agat_sq_stat_basic.pl --help

Options:
-i, --gff, --file or --input
STRING: Input GTF/GFF file. Several files can be processed at
once: -i file1 -i file2

-g, --genome
That input is design to know the genome size in order to
calculate the percentage of the genome represented by each kind
of feature type. You can provide an INTEGER or the genome in
fasta format. If you provide the fasta, the genome size will be
calculated on the fly.

--inflate
Inflate the statistics taking into account feature with
multi-parents. Indeed to avoid redundant information, some gff
factorize identical features. e.g: one exon used in two
different isoform will be defined only once, and will have
multiple parent. By default the script count such feature only
once. Using the inflate option allows to count the feature and
its size as many time there are parents.

-o or --output
STRING: Output file. If no output file is specified, the output
will be written to STDOUT. The result is in tabulate format.

-c or --config
String - Input agat config file. By default AGAT takes as input
agat_config.yaml file from the working directory if any,
otherwise it takes the orignal agat_config.yaml shipped with
AGAT. To get the agat_config.yaml locally type: "agat config
--expose". The --config option gives you the possibility to use
your own AGAT config file (located elsewhere or named
differently).

--help or -h
Display this helpful text.

Feedback:
Did you find a bug?:
Do not hesitate to report bugs to help us keep track of the bugs and
their resolution. Please use the GitHub issue tracking system available
at this address:

https://github.com/NBISweden/AGAT/issues

Ensure that the bug was not already reported by searching under Issues.
If you're unable to find an (open) issue addressing the problem, open a new one.
Try as much as possible to include in the issue when relevant:
- a clear description,
- as much relevant information as possible,
- the command used,
- a data sample,
- an explanation of the expected behaviour that is not occurring.

Do you want to contribute?:
You are very welcome, visit this address for the Contributing
guidelines:
https://github.com/NBISweden/AGAT/blob/master/CONTRIBUTING.md
31 changes: 31 additions & 0 deletions src/agat/agat_sq_stat_basic/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#!/bin/bash

set -eo pipefail

## VIASH START
## VIASH END

# unset flags
[[ "$par_inflate" == "false" ]] && unset par_inflate

# Convert a list of file names to multiple -gff arguments
input_files=""
IFS=";" read -ra file_names <<< "$par_gff"
for file in "${file_names[@]}"; do
input_files+="--gff $file "
done

# take care of --genome (can originally be either a fasta file or an integer)
if [[ -n "$par_genome_size" ]]; then
genome_arg=$par_genome_size
elif [[ -n "$par_genome_size_fasta" ]]; then
genome_arg=$par_genome_size_fasta
fi

# run agat_convert_sp_bed2gff.pl
agat_sq_stat_basic.pl \
$input_files \
${genome_arg:+--genome "${genome_arg}"} \
--output "${par_output}" \
${par_inflate:+--inflate} \
${par_config:+--config "${par_config}"}
36 changes: 36 additions & 0 deletions src/agat/agat_sq_stat_basic/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/bin/bash

set -eo pipefail

## VIASH START
## VIASH END

test_dir="${meta_resources_dir}/test_data"

# create temporary directory and clean up on exit
TMPDIR=$(mktemp -d "$meta_temp_dir/$meta_functionality_name-XXXXXX")
function clean_up {
[[ -d "$TMPDIR" ]] && rm -rf "$TMPDIR"
}
trap clean_up EXIT


echo "> Run $meta_name with test data"
"$meta_executable" \
--gff "$test_dir/1.gff" \
--output "$TMPDIR/output.txt"

echo ">> Checking output"
[ ! -f "$TMPDIR/output.txt" ] && echo "Output file output.txt does not exist" && exit 1

echo ">> Check if output is empty"
[ ! -s "$TMPDIR/output.txt" ] && echo "Output file output.txt is empty" && exit 1

echo ">> Check if output matches expected output"
diff "$TMPDIR/output.txt" "$test_dir/agat_sq_stat_basic_1.gff"
if [ $? -ne 0 ]; then
echo "Output file output.txt does not match expected output"
exit 1
fi

echo "> Test successful"
Loading