Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add agat convert genscan2gff #100

Merged
merged 25 commits into from
Sep 16, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@

* `agat/agat_convert_sp_gff2gtf`: convert any GTF/GFF file into a proper GTF file (PR #76).

* `agat/agat_convert_genscan2gff`: convert a genscan file into a GFF file (PR #100).


## MINOR CHANGES

* `busco` components: update BUSCO to `5.7.1` (PR #72).
Expand Down
97 changes: 97 additions & 0 deletions src/agat/agat_convert_genscan2gff/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
name: agat_convert_genscan2gff
namespace: agat
description: |
The script takes a genscan file as input, and will translate it in gff
format. The genscan format is described here:
http://genome.crg.es/courses/Bioinformatics2003_genefinding/results/gens
can.html /!\ vvv Known problem vvv /!\ You must have submited only DNA
sequence, wihtout any header!! Indeed the tool expects only DNA
sequences and does not crash/warn if an header is submited along the
sequence. e.g If you have an header ">seq" s-e-q are seen as the 3 first
nucleotides of the sequence. Then all prediction location are shifted
accordingly. (checked only on the online version
http://argonaute.mit.edu/GENSCAN.html. I don't know if there is the same
problem elsewhere.) /!\ ^^^ Known problem ^^^^ /!\
Leila011 marked this conversation as resolved.
Show resolved Hide resolved
keywords: [gene annotations, GFF conversion]
links:
homepage: https://github.com/NBISweden/AGAT
documentation: https://agat.readthedocs.io/en/latest/tools/agat_convert_genscan2gff.html
issue_tracker: https://github.com/NBISweden/AGAT/issues
repository: https://github.com/NBISweden/AGAT
references:
doi: 10.5281/zenodo.3552717
license: GPL-3.0
authors:
- __merge__: /src/_authors/leila_paquay.yaml
roles: [ author, maintainer ]

argument_groups:
- name: Inputs
arguments:
- name: --genscan
alternatives: [-g]
description: Input genscan bed file that will be converted.
type: file
required: true
direction: input
- name: Outputs
arguments:
- name: --output
alternatives: [-o, --out, --outfile, --gff]
description: Output GFF file. If no output file is specified, the output will be written to STDOUT.
type: file
direction: output
required: true
example: output.gff
- name: Arguments
arguments:
- name: --source
description: |
The source informs about the tool used to produce the data and is stored in 2nd field of a gff file. Example: Stringtie, Maker, Augustus, etc. [default: data]
type: string
required: false
example: Stringtie
- name: --primary_tag
description: |
The primary_tag corresponds to the data type and is stored in 3rd field of a gff file. Example: gene, mRNA, CDS, etc. [default: gene]
type: string
required: false
example: gene
- name: --inflate_off
description: |
By default we inflate the block fields (blockCount, blockSizes, blockStarts) to create subfeatures of the main feature (primary_tag). Type of subfeature created based on the inflate_type parameter. If you don't want this inflating behaviour you can deactivate it by using the option --inflate_off.
type: boolean_false
- name: --inflate_type
description: |
Feature type (3rd column in gff) created when inflate parameter activated [default: exon].
type: string
required: false
example: exon
- name: --verbose
description: add verbosity
type: boolean_true
- name: --config
alternatives: [-c]
description: |
Input agat config file. By default AGAT takes as input agat_config.yaml file from the working directory if any, otherwise it takes the original agat_config.yaml shipped with AGAT. To get the agat_config.yaml locally type: "agat config --expose". The --config option gives you the possibility to use your own AGAT config file (located elsewhere or named differently).
type: file
required: false
example: custom_agat_config.yaml
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/agat:1.4.0--pl5321hdfd78af_0
setup:
- type: docker
run: |
agat --version | sed 's/AGAT\s\(.*\)/agat: "\1"/' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow
94 changes: 94 additions & 0 deletions src/agat/agat_convert_genscan2gff/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
```sh
agat_convert_genscan2gff.pl --help
```
------------------------------------------------------------------------------
| Another GFF Analysis Toolkit (AGAT) - Version: v1.4.0 |
| https://github.com/NBISweden/AGAT |
| National Bioinformatics Infrastructure Sweden (NBIS) - www.nbis.se |
------------------------------------------------------------------------------

Name:
agat_convert_genscan2gff.pl

Description:
The script takes a genscan file as input, and will translate it in gff
format. The genscan format is described here:
http://genome.crg.es/courses/Bioinformatics2003_genefinding/results/gens
can.html /!\ vvv Known problem vvv /!\ You must have submited only DNA
sequence, wihtout any header!! Indeed the tool expects only DNA
sequences and does not crash/warn if an header is submited along the
sequence. e.g If you have an header ">seq" s-e-q are seen as the 3 first
nucleotides of the sequence. Then all prediction location are shifted
accordingly. (checked only on the online version
http://argonaute.mit.edu/GENSCAN.html. I don't know if there is the same
pronlem elsewhere.) /!\ ^^^ Known problem ^^^^ /!\

Usage:
agat_convert_genscan2gff.pl --genscan infile.bed [ -o outfile ]
agat_convert_genscan2gff.pl -h

Options:
--genscan or -g
Input genscan bed file that will be convert.

--source
The source informs about the tool used to produce the data and
is stored in 2nd field of a gff file. Example:
Stringtie,Maker,Augustus,etc. [default: data]

--primary_tag
The primary_tag corresponf to the data type and is stored in 3rd
field of a gff file. Example: gene,mRNA,CDS,etc. [default: gene]

--inflate_off
By default we inflate the block fields (blockCount, blockSizes,
blockStarts) to create subfeatures of the main feature
(primary_tag). Type of subfeature created based on the
inflate_type parameter. If you don't want this inflating
behaviour you can deactivate it by using the option
--inflate_off.

--inflate_type
Feature type (3rd column in gff) created when inflate parameter
activated [default: exon].

--verbose
add verbosity

-o , --output , --out , --outfile or --gff
Output GFF file. If no output file is specified, the output will
be written to STDOUT.

-c or --config
String - Input agat config file. By default AGAT takes as input
agat_config.yaml file from the working directory if any,
otherwise it takes the orignal agat_config.yaml shipped with
AGAT. To get the agat_config.yaml locally type: "agat config
--expose". The --config option gives you the possibility to use
your own AGAT config file (located elsewhere or named
differently).

-h or --help
Display this helpful text.

Feedback:
Did you find a bug?:
Do not hesitate to report bugs to help us keep track of the bugs and
their resolution. Please use the GitHub issue tracking system available
at this address:

https://github.com/NBISweden/AGAT/issues

Ensure that the bug was not already reported by searching under Issues.
If you're unable to find an (open) issue addressing the problem, open a new one.
Try as much as possible to include in the issue when relevant:
- a clear description,
- as much relevant information as possible,
- the command used,
- a data sample,
- an explanation of the expected behaviour that is not occurring.

Do you want to contribute?:
You are very welcome, visit this address for the Contributing
guidelines:
https://github.com/NBISweden/AGAT/blob/master/CONTRIBUTING.md
19 changes: 19 additions & 0 deletions src/agat/agat_convert_genscan2gff/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/bin/bash

## VIASH START
## VIASH END

# unset flags
[[ "$par_inflate_off" == "true" ]] && unset par_inflate_off
[[ "$par_verbose" == "false" ]] && unset par_verbose

# run agat_convert_genscan2gff
agat_convert_genscan2gff.pl \
--genscan "$par_genscan" \
--output "$par_output" \
${par_source:+--source "${par_source}"} \
${par_primary_tag:+--primary_tag "${par_primary_tag}"} \
${par_inflate_off:+--inflate_off} \
${par_inflate_type:+--inflate_type "${par_inflate_type}"} \
${par_verbose:+--verbose} \
${par_config:+--config "${par_config}"}
27 changes: 27 additions & 0 deletions src/agat/agat_convert_genscan2gff/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#!/bin/bash

## VIASH START
## VIASH END

test_dir="${meta_resources_dir}/test_data"
out_dir="${meta_resources_dir}/out_data"

echo "> Run $meta_name with test data"
"$meta_executable" \
--genscan "$test_dir/test.genscan" \
--output "$out_dir/output.gff"

echo ">> Checking output"
[ ! -f "$out_dir/output.gff" ] && echo "Output file output.gff does not exist" && exit 1

echo ">> Check if output is empty"
[ ! -s "$out_dir/output.gff" ] && echo "Output file output.gff is empty" && exit 1

echo ">> Check if output matches expected output"
diff "$out_dir/output.gff" "$test_dir/agat_convert_genscan2gff_1.gff"
if [ $? -ne 0 ]; then
echo "Output file output.gff does not match expected output"
exit 1
fi

echo "> Test successful"
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
##gff-version 3
unknown genscan gene 2223 4605 75.25 + . ID=gene_1
unknown genscan mRNA 2223 4605 75.25 + . ID=mrna_1;Parent=gene_1
unknown genscan exon 2223 3020 75.25 + . ID=exon_1;Parent=mrna_1
unknown genscan exon 4249 4605 13.03 + . ID=exon_2;Parent=mrna_1
unknown genscan CDS 2223 3020 75.25 + 0 ID=cds_1;Parent=mrna_1
unknown genscan CDS 4249 4605 13.03 + 0 ID=cds_2;Parent=mrna_1
unknown genscan gene 6829 8789 20.06 - . ID=gene_2
unknown genscan mRNA 6829 8789 20.06 - . ID=mrna_2;Parent=gene_2
unknown genscan exon 6829 7297 20.06 - . ID=exon_3;Parent=mrna_2
unknown genscan exon 7730 7888 12.78 - . ID=exon_4;Parent=mrna_2
unknown genscan exon 8029 8185 7.45 - . ID=exon_5;Parent=mrna_2
unknown genscan exon 8278 8546 17.45 - . ID=exon_6;Parent=mrna_2
unknown genscan exon 8647 8789 18.65 - . ID=exon_7;Parent=mrna_2
unknown genscan CDS 6829 7297 20.06 - 1 ID=cds_3;Parent=mrna_2
unknown genscan CDS 7730 7888 12.78 - 1 ID=cds_4;Parent=mrna_2
unknown genscan CDS 8029 8185 7.45 - 2 ID=cds_5;Parent=mrna_2
unknown genscan CDS 8278 8546 17.45 - 1 ID=cds_6;Parent=mrna_2
unknown genscan CDS 8647 8789 18.65 - 0 ID=cds_7;Parent=mrna_2
unknown genscan gene 10209 11924 16.18 + . ID=gene_3
unknown genscan mRNA 10209 11924 16.18 + . ID=mrna_3;Parent=gene_3
unknown genscan exon 10209 11313 16.18 + . ID=exon_8;Parent=mrna_3
unknown genscan exon 11850 11924 3.27 + . ID=exon_9;Parent=mrna_3
unknown genscan CDS 10209 11313 16.18 + 0 ID=cds_8;Parent=mrna_3
unknown genscan CDS 11850 11924 3.27 + 2 ID=cds_9;Parent=mrna_3
11 changes: 11 additions & 0 deletions src/agat/agat_convert_genscan2gff/test_data/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash

# clone repo
if [ ! -d /tmp/agat_source ]; then
git clone --depth 1 --single-branch --branch master https://github.com/NBISweden/AGAT /tmp/agat_source
fi

# copy test data
cp -r /tmp/agat_source/t/scripts_output/in/test.genscan src/agat/agat_convert_genscan2gff/test_data/test.genscan
cp -r /tmp/agat_source/t/scripts_output/out/agat_convert_genscan2gff_1.gff src/agat/agat_convert_genscan2gff/test_data/agat_convert_genscan2gff_1.gff

Loading