Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add agat convertspgff2gtf #76

Merged
merged 26 commits into from
Jul 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
4e6ede8
Fill in the metadata
Leila011 Jun 7, 2024
a72d08a
add help.txt
Leila011 Jun 19, 2024
7b3e329
add test data
Leila011 Jun 19, 2024
8f047d6
update help.txt
Leila011 Jun 19, 2024
8a8c265
add arguments for input file, output file and other arguments
Leila011 Jun 28, 2024
4a5344c
add a Docker engine
Leila011 Jun 28, 2024
320bb88
Write a runner script
Leila011 Jun 28, 2024
29d1d5a
correct --gtf_version choices
Leila011 Jun 28, 2024
beeadfb
update description
Leila011 Jun 28, 2024
6c7df8f
update keywords
Leila011 Jun 28, 2024
e7181c2
Create test script
Leila011 Jun 28, 2024
53de9ff
Create a /var/software_versions.txt file
Leila011 Jun 28, 2024
90c6366
remove duplicated argument
Leila011 Jun 29, 2024
d32fae5
update config
Leila011 Jul 5, 2024
62ba4ba
Merge remote-tracking branch 'biobox/main' into add-agat_convertspgff…
Leila011 Jul 5, 2024
46d4573
change name to agat_convert_sp_gff2gtf
Leila011 Jul 8, 2024
260fe1c
update license
Leila011 Jul 8, 2024
88565ff
replace module name by $meta_name in test.sh
Leila011 Jul 8, 2024
f2dc58d
Add more info to --gtf_version description
Leila011 Jul 8, 2024
2046191
remove extra \
Leila011 Jul 9, 2024
e1f7efa
add additional test: check if the D column in the first line of the G…
Leila011 Jul 9, 2024
d91e8df
update changelog
Leila011 Jul 9, 2024
67d3cd8
Markdown: add newline before listing
Leila011 Jul 15, 2024
0805222
add test to check if the header contains the right GTF version
Leila011 Jul 15, 2024
18858dc
cleanup
Leila011 Jul 15, 2024
78da7b5
fix formatting
rcannood Jul 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,9 @@
- `bedtools_getfasta`: extract sequences from a FASTA file for each of the
intervals defined in a BED/GFF/VCF file (PR #59).

* `agat`:
- `agat_convert_sp_gff2gtf`: convert any GTF/GFF file into a proper GTF file (PR #76).

## MINOR CHANGES

* Uniformize component metadata (PR #23).
Expand Down
90 changes: 90 additions & 0 deletions src/agat/agat_convert_sp_gff2gtf/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
name: agat_convert_sp_gff2gtf
namespace: agat
description: |
The script aims to convert any GTF/GFF file into a proper GTF file. Full
information about the format can be found here:
https://agat.readthedocs.io/en/latest/gxf.html You can choose among 7
different GTF types (1, 2, 2.1, 2.2, 2.5, 3 or relax). Depending the
version selected the script will filter out the features that are not
accepted. For GTF2.5 and 3, every level1 feature (e.g nc_gene
pseudogene) will be converted into gene feature and every level2 feature
(e.g mRNA ncRNA) will be converted into transcript feature. Using the
"relax" option you will produce a GTF-like output keeping all original
feature types (3rd column). No modification will occur e.g. mRNA to
transcript.

To be fully GTF compliant all feature have a gene_id and a transcript_id
attribute. The gene_id is unique identifier for the genomic source of
the transcript, which is used to group transcripts into genes. The
transcript_id is a unique identifier for the predicted transcript, which
is used to group features into transcripts.
keywords: [gene annotations, GTF conversion]
links:
homepage: https://github.com/NBISweden/AGAT
documentation: https://agat.readthedocs.io/
issue_tracker: https://github.com/NBISweden/AGAT/issues
repository: https://github.com/NBISweden/AGAT
references:
doi: 10.5281/zenodo.3552717
license: GPL-3.0
argument_groups:
- name: Inputs
arguments:
- name: --gff
alternatives: [-i]
description: Input GFF/GTF file that will be read
type: file
required: true
direction: input
example: input.gff
- name: Outputs
arguments:
- name: --output
alternatives: [-o, --out, --outfile, --gtf]
description: Output GTF file. If no output file is specified, the output will be written to STDOUT.
type: file
direction: output
required: true
example: output.gtf
- name: Arguments
arguments:
- name: --gtf_version
description: |
Version of the GTF output (1,2,2.1,2.2,2.5,3 or relax). Default value from AGAT config file (relax for the default config). The script option has the higher priority.

* relax: all feature types are accepted.
* GTF3 (9 feature types accepted): gene, transcript, exon, CDS, Selenocysteine, start_codon, stop_codon, three_prime_utr and five_prime_utr.
* GTF2.5 (8 feature types accepted): gene, transcript, exon, CDS, UTR, start_codon, stop_codon, Selenocysteine.
* GTF2.2 (9 feature types accepted): CDS, start_codon, stop_codon, 5UTR, 3UTR, inter, inter_CNS, intron_CNS and exon.
* GTF2.1 (6 feature types accepted): CDS, start_codon, stop_codon, exon, 5UTR, 3UTR.
* GTF2 (4 feature types accepted): CDS, start_codon, stop_codon, exon.
* GTF1 (5 feature types accepted): CDS, start_codon, stop_codon, exon, intron.
type: string
choices: [relax, "1", "2", "2.1", "2.2", "2.5", "3"]
required: false
example: "3"
- name: --config
alternatives: [-c]
description: |
Input agat config file. By default AGAT takes as input agat_config.yaml file from the working directory if any, otherwise it takes the orignal agat_config.yaml shipped with AGAT. To get the agat_config.yaml locally type: "agat config --expose". The --config option gives you the possibility to use your own AGAT config file (located elsewhere or named differently).
type: file
required: false
example: custom_agat_config.yaml
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/agat:1.4.0--pl5321hdfd78af_0
setup:
- type: docker
run: |
agat --version | sed 's/AGAT\s\(.*\)/agat: "\1"/' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow
102 changes: 102 additions & 0 deletions src/agat/agat_convert_sp_gff2gtf/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
```sh
agat_convert_sp_gff2gtf.pl --help
```
------------------------------------------------------------------------------
| Another GFF Analysis Toolkit (AGAT) - Version: v1.4.0 |
| https://github.com/NBISweden/AGAT |
| National Bioinformatics Infrastructure Sweden (NBIS) - www.nbis.se |
------------------------------------------------------------------------------


Name:
agat_convert_sp_gff2gtf.pl

Description:
The script aims to convert any GTF/GFF file into a proper GTF file. Full
information about the format can be found here:
https://agat.readthedocs.io/en/latest/gxf.html You can choose among 7
different GTF types (1, 2, 2.1, 2.2, 2.5, 3 or relax). Depending the
version selected the script will filter out the features that are not
accepted. For GTF2.5 and 3, every level1 feature (e.g nc_gene
pseudogene) will be converted into gene feature and every level2 feature
(e.g mRNA ncRNA) will be converted into transcript feature. Using the
"relax" option you will produce a GTF-like output keeping all original
feature types (3rd column). No modification will occur e.g. mRNA to
transcript.

To be fully GTF compliant all feature have a gene_id and a transcript_id
attribute. The gene_id is unique identifier for the genomic source of
the transcript, which is used to group transcripts into genes. The
transcript_id is a unique identifier for the predicted transcript, which
is used to group features into transcripts.

Usage:
agat_convert_sp_gff2gtf.pl --gff infile.gff [ -o outfile ]
agat_convert_sp_gff2gtf -h

Options:
--gff, --gtf or -i
Input GFF/GTF file that will be read

--gtf_version version of the GTF output (1,2,2.1,2.2,2.5,3 or relax).
Default value from AGAT config file (relax for the default config). The
script option has the higher priority.
relax: all feature types are accepted.

GTF3 (9 feature types accepted): gene, transcript, exon, CDS,
Selenocysteine, start_codon, stop_codon, three_prime_utr and
five_prime_utr

GTF2.5 (8 feature types accepted): gene, transcript, exon, CDS,
UTR, start_codon, stop_codon, Selenocysteine

GTF2.2 (9 feature types accepted): CDS, start_codon, stop_codon,
5UTR, 3UTR, inter, inter_CNS, intron_CNS and exon

GTF2.1 (6 feature types accepted): CDS, start_codon, stop_codon,
exon, 5UTR, 3UTR

GTF2 (4 feature types accepted): CDS, start_codon, stop_codon,
exon

GTF1 (5 feature types accepted): CDS, start_codon, stop_codon,
exon, intron

-o , --output , --out , --outfile or --gtf
Output GTF file. If no output file is specified, the output will
be written to STDOUT.

-c or --config
String - Input agat config file. By default AGAT takes as input
agat_config.yaml file from the working directory if any,
otherwise it takes the orignal agat_config.yaml shipped with
AGAT. To get the agat_config.yaml locally type: "agat config
--expose". The --config option gives you the possibility to use
your own AGAT config file (located elsewhere or named
differently).

-h or --help
Display this helpful text.

Feedback:
Did you find a bug?:
Do not hesitate to report bugs to help us keep track of the bugs and
their resolution. Please use the GitHub issue tracking system available
at this address:

https://github.com/NBISweden/AGAT/issues

Ensure that the bug was not already reported by searching under Issues.
If you're unable to find an (open) issue addressing the problem, open a new one.
Try as much as possible to include in the issue when relevant:
- a clear description,
- as much relevant information as possible,
- the command used,
- a data sample,
- an explanation of the expected behaviour that is not occurring.

Do you want to contribute?:
You are very welcome, visit this address for the Contributing
guidelines:
https://github.com/NBISweden/AGAT/blob/master/CONTRIBUTING.md

10 changes: 10 additions & 0 deletions src/agat/agat_convert_sp_gff2gtf/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/bin/bash

## VIASH START
## VIASH END

agat_convert_sp_gff2gtf.pl \
-i "$par_gff" \
-o "$par_output" \
${par_gtf_version:+--gtf_version "${par_gtf_version}"} \
${par_config:+--config "${par_config}"}
37 changes: 37 additions & 0 deletions src/agat/agat_convert_sp_gff2gtf/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/bin/bash

## VIASH START
## VIASH END

test_dir="${meta_resources_dir}/test_data"

echo "> Run $meta_name with test data"
"$meta_executable" \
--gff "$test_dir/0_test.gff" \
--output "output.gtf"

echo ">> Checking output"
[ ! -f "output.gtf" ] && echo "Output file output.gtf does not exist" && exit 1

echo ">> Check if output is empty"
[ ! -s "output.gtf" ] && echo "Output file output.gtf is empty" && exit 1

echo ">> Check if the conversion resulted in the right GTF format"
idGFF=$(head -n 2 "$test_dir/0_test.gff" | grep -o 'ID=[^;]*' | cut -d '=' -f 2-)
expectedGTF="gene_id \"$idGFF\"; ID \"$idGFF\";"
extractedGTF=$(head -n 3 "output.gtf" | grep -o 'gene_id "[^"]*"; ID "[^"]*";')
[ "$extractedGTF" != "$expectedGTF" ] && echo "Output file output.gtf does not have the right format" && exit 1

rm output.gtf

echo "> Run $meta_name with test data and GTF version 2.5"
"$meta_executable" \
--gff "$test_dir/0_test.gff" \
--output "output.gtf" \
--gtf_version "2.5"

echo ">> Check if the output file header display the right GTF version"
grep -q "##gtf-version 2.5" "output.gtf"
[ $? -ne 0 ] && echo "Output file output.gtf header does not display the right GTF version" && exit 1

Leila011 marked this conversation as resolved.
Show resolved Hide resolved
echo "> Test successful"
36 changes: 36 additions & 0 deletions src/agat/agat_convert_sp_gff2gtf/test_data/0_test.gff
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
##gff-version 3
scaffold625 maker gene 337818 343277 . + . ID=CLUHARG00000005458;Name=TUBB3_2
scaffold625 maker mRNA 337818 343277 . + . ID=CLUHART00000008717;Parent=CLUHARG00000005458
scaffold625 maker exon 337818 337971 . + . ID=CLUHART00000008717:exon:1404;Parent=CLUHART00000008717
scaffold625 maker exon 340733 340841 . + . ID=CLUHART00000008717:exon:1405;Parent=CLUHART00000008717
scaffold625 maker exon 341518 341628 . + . ID=CLUHART00000008717:exon:1406;Parent=CLUHART00000008717
scaffold625 maker exon 341964 343277 . + . ID=CLUHART00000008717:exon:1407;Parent=CLUHART00000008717
scaffold625 maker CDS 337915 337971 . + 0 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 340733 340841 . + 0 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 341518 341628 . + 2 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 341964 343033 . + 2 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker five_prime_UTR 337818 337914 . + . ID=CLUHART00000008717:five_prime_utr;Parent=CLUHART00000008717
scaffold625 maker three_prime_UTR 343034 343277 . + . ID=CLUHART00000008717:three_prime_utr;Parent=CLUHART00000008717
scaffold789 maker gene 558184 564780 . + . ID=CLUHARG00000003852;Name=PF11_0240
scaffold789 maker mRNA 558184 564780 . + . ID=CLUHART00000006146;Parent=CLUHARG00000003852
scaffold789 maker exon 558184 560123 . + . ID=CLUHART00000006146:exon:995;Parent=CLUHART00000006146
scaffold789 maker exon 561401 561519 . + . ID=CLUHART00000006146:exon:996;Parent=CLUHART00000006146
scaffold789 maker exon 564171 564235 . + . ID=CLUHART00000006146:exon:997;Parent=CLUHART00000006146
scaffold789 maker exon 564372 564780 . + . ID=CLUHART00000006146:exon:998;Parent=CLUHART00000006146
scaffold789 maker CDS 558191 560123 . + 0 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold789 maker CDS 561401 561519 . + 2 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold789 maker CDS 564171 564235 . + 0 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold789 maker CDS 564372 564588 . + 1 ID=CLUHART00000006146:cds;Parent=CLUHART00000006146
scaffold789 maker five_prime_UTR 558184 558190 . + . ID=CLUHART00000006146:five_prime_utr;Parent=CLUHART00000006146
scaffold789 maker three_prime_UTR 564589 564780 . + . ID=CLUHART00000006146:three_prime_utr;Parent=CLUHART00000006146
scaffold789 maker mRNA 558184 564780 . + . ID=CLUHART00000006147;Parent=CLUHARG00000003852
scaffold789 maker exon 558184 560123 . + . ID=CLUHART00000006147:exon:997;Parent=CLUHART00000006147
scaffold789 maker exon 561401 561519 . + . ID=CLUHART00000006147:exon:998;Parent=CLUHART00000006147
scaffold789 maker exon 562057 562121 . + . ID=CLUHART00000006147:exon:999;Parent=CLUHART00000006147
scaffold789 maker exon 564372 564780 . + . ID=CLUHART00000006147:exon:1000;Parent=CLUHART00000006147
scaffold789 maker CDS 558191 560123 . + 0 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker CDS 561401 561519 . + 2 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker CDS 562057 562121 . + 0 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker CDS 564372 564588 . + 1 ID=CLUHART00000006147:cds;Parent=CLUHART00000006147
scaffold789 maker five_prime_UTR 558184 558190 . + . ID=CLUHART00000006147:five_prime_utr;Parent=CLUHART00000006147
scaffold789 maker three_prime_UTR 564589 564780 . + . ID=CLUHART00000006147:three_prime_utr;Parent=CLUHART00000006147
9 changes: 9 additions & 0 deletions src/agat/agat_convert_sp_gff2gtf/test_data/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/bin/bash

# clone repo
if [ ! -d /tmp/agat_source ]; then
git clone --depth 1 --single-branch --branch master https://github.com/NBISweden/AGAT /tmp/agat_source
fi

# copy test data
cp -r /tmp/agat_source/t/gff_syntax/in/0_test.gff src/agat/agat_convert_sp_gff2gtf/test_data