Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add agat sp complement annotations #129

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
- `agat/agat_convert_embl2gff`: convert an EMBL file into GFF format (PR #99).
- `agat/agat_convert_sp_gff2tsv`: convert gtf/gff file into tabulated file (PR #102).
- `agat/agat_convert_sp_gxf2gxf`: fixes and/or standardizes any GTF/GFF file into full sorted GTF/GFF file (PR #103).
- `agat/agat_sp_complement_annotations`: complement a reference annotation with other annotations (PR #129).

* `bedtools`:
- `bedtools/bedtools_intersect`: Allows one to screen for overlaps between two sets of genomic features (PR #94).
Expand Down
95 changes: 95 additions & 0 deletions src/agat/agat_sp_complement_annotations/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
name: agat_sp_complement_annotations
namespace: agat
description: |
The script allows to complement a reference annotation with other annotations.

* A l1 feature from the addfile.gff that does not overlap a l1 feature from the reference annotation will be added.
* A l1 feature from the addfile.gff without a CDS that overlaps a l1 feature with a CDS from the reference annotation will be added.
* A l1 feature from the addfile.gff with a CDS that overlaps a l1 feature without a CDS from the reference annotation will be added.
* A l1 feature from the addfile.gff with a CDS that overlaps a l1 feature with a CDS from the reference annotation will be added only if the CDSs don't overlap.
* A l1 feature from the addfile.gff without a CDS that overlaps a l1 feature without a CDS from the reference annotation will be added only if none of the l3 features overlap.

! It is sufficient that only one isoform is overlapping to prevent the whole gene (l1 feature) from the addfile.gff to be added in the output.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
! It is sufficient that only one isoform is overlapping to prevent the whole gene (l1 feature) from the addfile.gff to be added in the output.
It is sufficient that only one isoform is overlapping to prevent the whole gene (l1 feature) from the addfile.gff to be added in the output.

keywords: [gene annotations, GFF]
links:
homepage: https://github.com/NBISweden/AGAT
documentation: https://agat.readthedocs.io/en/latest/tools/agat_sp_complement_annotations.html
issue_tracker: https://github.com/NBISweden/AGAT/issues
repository: https://github.com/NBISweden/AGAT
references:
doi: 10.5281/zenodo.3552717
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
doi: 10.5281/zenodo.3552717
https://doi.org/10.5281/zenodo.3552717

license: GPL-3.0
requirements:
- commands: [agat]
authors:
- __merge__: /src/_authors/leila_paquay.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:
- name: --ref
alternatives: [-r, -i]
description: Input GTF/GFF file used as reference.
type: file
required: true
direction: input
example: reference.gff
- name: --add
alternatives: [-a]
description: |
Annotation(s) file you would like to use to complement the
reference annotation. You can specify as much file you want like.
The order you provide these files matter. Once the reference file has been
complemented by file1, this new annotation becomes the new
reference that will be complemented by file2 etc.
So, be aware of what you want if you use several addfiles.
Comment on lines +40 to +45
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Annotation(s) file you would like to use to complement the
reference annotation. You can specify as much file you want like.
The order you provide these files matter. Once the reference file has been
complemented by file1, this new annotation becomes the new
reference that will be complemented by file2 etc.
So, be aware of what you want if you use several addfiles.
Annotation file(s) you would like to use to complement the reference annotation. You can specify as many files as you like. The order you provide these files matter. Once the reference file has been complemented by file1, this new annotation becomes the new reference that will be complemented by file2, etc. So, be aware of what you want if you use several addfiles.

type: file
required: true
direction: input
multiple: true
example: addfile1.gff
- name: Outputs
arguments:
- name: --output
alternatives: [-o, --out, --outfile]
description: Output gff3 containing the reference annotation with all the non-overlapping newly added genes from addfiles.gff.
type: file
direction: output
required: true
example: output.gff
- name: Arguments
arguments:
- name: --size_min
alternatives: [-s]
description: |
Option to keep the non-overlapping gene only if the CDS size (in
nucleotide) is over the minimum size defined. Default = 0 that
means all of them are kept.
Comment on lines +65 to +67
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Option to keep the non-overlapping gene only if the CDS size (in
nucleotide) is over the minimum size defined. Default = 0 that
means all of them are kept.
Option to keep the non-overlapping gene only if the CDS size (in nucleotides) is over the minimum size defined. The default is 0, meaning all of them are kept.

type: integer
required: false
example: 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
example: 100
example: 100
min: 0

- name: --config
alternatives: [-c]
description: |
AGAT config file. By default AGAT takes the original agat_config.yaml shipped with AGAT. The `--config` option gives you the possibility to use your own AGAT config file (located elsewhere or named differently).
type: file
required: false
example: custom_config.yaml
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/agat:1.4.0--pl5321hdfd78af_0
setup:
- type: docker
run: |
agat --version | sed 's/AGAT\s\(.*\)/agat: "\1"/' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- type: nextflow
- type: nextflow

91 changes: 91 additions & 0 deletions src/agat/agat_sp_complement_annotations/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
```sh
agat_sp_complement_annotations.pl --help
```

------------------------------------------------------------------------------
| Another GFF Analysis Toolkit (AGAT) - Version: v1.4.0 |
| https://github.com/NBISweden/AGAT |
| National Bioinformatics Infrastructure Sweden (NBIS) - www.nbis.se |
------------------------------------------------------------------------------


Name:
agat_sp_complement_annotations.pl

Description:
The script allows to complement a reference annotation with other
annotations. A l1 feature from the addfile.gff that does not overlap a
l1 feature from the reference annotation will be added. A l1 feature
from the addfile.gff without a CDS that overlaps a l1 feature with a CDS
from the reference annotation will be added. A l1 feature from the
addfile.gff with a CDS that overlaps a l1 feature without a CDS from the
reference annotation will be added. A l1 feature from the addfile.gff
with a CDS that overlaps a l1 feature with a CDS from the reference
annotation will be added only if the CDSs don't overlap. A l1 feature
from the addfile.gff without a CDS that overlaps a l1 feature without a
CDS from the reference annotation will be added only if none of the l3
features overlap. /!\ It is sufficiant that only one isoform is
overlapping to prevent the whole gene (l1 feature) from the addfile.gff
to be added in the output.

Usage:
agat_sp_complement_annotations.pl --ref annotation_ref.gff --add addfile1.gff --add addfile2.gff --out outFile
agat_sp_complement_annotations.pl --help

Options:
--ref, -r or -i
Input GTF/GFF file used as reference.

--add or -a
Annotation(s) file you would like to use to complement the
reference annotation. You can specify as much file you want like
so: -a addfile1 -a addfile2 -a addfile3 /!\ The order you
provide these files matter. Once the reference file has been
complemented by file1, this new annotation becomes the new
reference that will be complemented by file2 etc. /!\ The result
with -a addfile1 -a addfile2 will differ to the result from -a
addfile2 -a addfile1. So, be aware of what you want if you use
several addfiles.

--size_min or -s
Option to keep the non-overlping gene only if the CDS size (in
nucleotide) is over the minimum size defined. Default = 0 that
means all of them are kept.

--out, --output, --outfile or -o
Output gff3 containing the reference annotation with all the
non-overlapping newly added genes from addfiles.gff.

-c or --config
String - Input agat config file. By default AGAT takes as input
agat_config.yaml file from the working directory if any,
otherwise it takes the orignal agat_config.yaml shipped with
AGAT. To get the agat_config.yaml locally type: "agat config
--expose". The --config option gives you the possibility to use
your own AGAT config file (located elsewhere or named
differently).

--help or -h
Display this helpful text.

Feedback:
Did you find a bug?:
Do not hesitate to report bugs to help us keep track of the bugs and
their resolution. Please use the GitHub issue tracking system available
at this address:

https://github.com/NBISweden/AGAT/issues

Ensure that the bug was not already reported by searching under Issues.
If you're unable to find an (open) issue addressing the problem, open a new one.
Try as much as possible to include in the issue when relevant:
- a clear description,
- as much relevant information as possible,
- the command used,
- a data sample,
- an explanation of the expected behaviour that is not occurring.

Do you want to contribute?:
You are very welcome, visit this address for the Contributing
guidelines:
https://github.com/NBISweden/AGAT/blob/master/CONTRIBUTING.md
25 changes: 25 additions & 0 deletions src/agat/agat_sp_complement_annotations/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/bin/bash

set -eo pipefail

## VIASH START
## VIASH END

# unset flags
[[ "$par_verbose" == "false" ]] && unset par_verbose

# Convert a list of file names to multiple --add arguments
input_files=""
IFS=";" read -ra file_names <<< "$par_add"
for file in "${file_names[@]}"; do
input_files+="--add $file "
done

# run agat_sp_complement_annotations.pl
agat_sp_complement_annotations.pl \
--ref "$par_ref" \
$input_files \
-o "$par_output" \
${par_size_min:+--size_min "${par_size_min}"} \
${par_config:+--config "${par_config}"} \
${par_verbose:+--verbose}
56 changes: 56 additions & 0 deletions src/agat/agat_sp_complement_annotations/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#!/bin/bash

set -eo pipefail

## VIASH START
## VIASH END

test_dir="${meta_resources_dir}/test_data"

# create temporary directory and clean up on exit
TMPDIR=$(mktemp -d "$meta_temp_dir/$meta_functionality_name-XXXXXX")
function clean_up {
[[ -d "$TMPDIR" ]] && rm -rf "$TMPDIR"
}
trap clean_up EXIT

echo "> Run $meta_name with test data"
"$meta_executable" \
--ref "$test_dir/25_test.gff" \
--add "$test_dir/9_test.gff" \
--output "$TMPDIR/output.gff"

echo ">> Checking output"
[ ! -f "$TMPDIR/output.gff" ] && echo "Output file output.gff does not exist" && exit 1

echo ">> Check if output is empty"
[ ! -s "$TMPDIR/output.gff" ] && echo "Output file output.gff is empty" && exit 1

echo ">> Check if output matches expected output"
diff "$TMPDIR/output.gff" "$test_dir/agat_sp_complement_annotations_1.gff"
if [ $? -ne 0 ]; then
echo "Output file output.gff does not match expected output"
exit 1
fi

rm -rf "$TMPDIR/output.gff"

echo "> Run $meta_name with test data"
"$meta_executable" \
--ref "$test_dir/agat_sp_complement_annotations_ref.gff" \
--add "$test_dir/agat_sp_complement_annotations_add.gff" \
--output "$TMPDIR/output.gff"

echo ">> Checking output"
[ ! -f "$TMPDIR/output.gff" ] && echo "Output file output.gff does not exist" && exit 1

echo ">> Check if output is empty"
[ ! -s "$TMPDIR/output.gff" ] && echo "Output file output.gff is empty" && exit 1

echo ">> Check if output matches expected output"
diff "$TMPDIR/output.gff" "$test_dir/agat_sp_complement_annotations_2.gff"
if [ $? -ne 0 ]; then
echo "Output file output.gff does not match expected output"
exit 1
fi
echo "> Test successful"
32 changes: 32 additions & 0 deletions src/agat/agat_sp_complement_annotations/test_data/25_test.gff
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# gffread all_merged.stringtie.gtf -E -F -o -
# gffread v0.9.9
##gff-version 3
scaffold1 StringTie transcript 2551 2965 1000.00 . . ID=MSTRG.1.1;geneID=MSTRG.1
scaffold1 StringTie exon 2551 2965 1000.00 . . Parent=MSTRG.1.1;cov=68.607231
scaffold1 StringTie transcript 8147 13353 1000.00 - . ID=MSTRG.6.1;geneID=MSTRG.6
scaffold1 StringTie exon 8147 8981 1000.00 - . Parent=MSTRG.6.1;cov=529.868042
scaffold1 StringTie exon 9082 9171 1000.00 - . Parent=MSTRG.6.1;cov=451.066681
scaffold1 StringTie exon 9328 9433 1000.00 - . Parent=MSTRG.6.1;cov=548.889893
scaffold1 StringTie exon 9682 9875 1000.00 - . Parent=MSTRG.6.1;cov=416.032471
scaffold1 StringTie exon 10018 10228 1000.00 - . Parent=MSTRG.6.1;cov=268.398773
scaffold1 StringTie exon 10436 10511 1000.00 - . Parent=MSTRG.6.1;cov=263.012329
scaffold1 StringTie exon 10665 10744 1000.00 - . Parent=MSTRG.6.1;cov=262.177094
scaffold1 StringTie exon 10901 10996 1000.00 - . Parent=MSTRG.6.1;cov=285.484375
scaffold1 StringTie exon 11277 11348 1000.00 - . Parent=MSTRG.6.1;cov=272.513885
scaffold1 StringTie exon 11521 11718 1000.00 - . Parent=MSTRG.6.1;cov=323.955170
scaffold1 StringTie exon 11802 12004 1000.00 - . Parent=MSTRG.6.1;cov=258.021729
scaffold1 StringTie exon 12106 13353 1000.00 - . Parent=MSTRG.6.1;cov=192.039612
scaffold1 StringTie transcript 21499 23178 1000.00 . . ID=MSTRG.7.1;geneID=MSTRG.7
scaffold1 StringTie exon 21499 23178 1000.00 . . Parent=MSTRG.7.1;cov=207.398804
scaffold1 StringTie transcript 44218 47964 1000.00 - . ID=MSTRG.11.1;geneID=MSTRG.11
scaffold1 StringTie exon 44218 45365 1000.00 - . Parent=MSTRG.11.1;cov=3001.629883
scaffold1 StringTie exon 47660 47706 1000.00 - . Parent=MSTRG.11.1;cov=4399.870117
scaffold1 StringTie exon 47827 47964 1000.00 - . Parent=MSTRG.11.1;cov=2103.559082
scaffold1 StringTie transcript 44218 47964 1000.00 - . ID=MSTRG.11.2;geneID=MSTRG.11
scaffold1 StringTie exon 44218 45365 1000.00 - . Parent=MSTRG.11.2;cov=487.085846
scaffold1 StringTie exon 47660 47718 1000.00 - . Parent=MSTRG.11.2;cov=557.812744
scaffold1 StringTie exon 47824 47964 1000.00 - . Parent=MSTRG.11.2;cov=242.265823
scaffold1 StringTie transcript 44427 47958 1000.00 - . ID=MSTRG.11.3;geneID=MSTRG.11
scaffold1 StringTie exon 44427 45365 1000.00 - . Parent=MSTRG.11.3;cov=2892.249023
scaffold1 StringTie exon 47660 47723 1000.00 - . Parent=MSTRG.11.3;cov=2083.479492
scaffold1 StringTie exon 47827 47958 1000.00 - . Parent=MSTRG.11.3;cov=734.545044
20 changes: 20 additions & 0 deletions src/agat/agat_sp_complement_annotations/test_data/9_test.gff
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
##gff-version 3
#!gff-spec-version 1.14
#!source-version NCBI C++ formatter 0.2
##Type DNA NC_003070.9
NC_003070.9 RefSeq source 1 30427671 . + . organism=Arabidopsis thaliana;mol_type=genomic DNA;db_xref=taxon:3702;chromosome=1;ecotype=Columbia
NC_003070.9 RefSeq gene 3631 5899 . + . ID=NC_003070.9:NAC001;locus_tag=AT1G01010;
NC_003070.9 RefSeq exon 3631 3913 . + . ID=NM_099983.2;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010;
NC_003070.9 RefSeq exon 3996 4276 . + . ID=NM_099983.2;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010;
NC_003070.9 RefSeq exon 4486 4605 . + . ID=NM_099983.2;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010;
NC_003070.9 RefSeq exon 4706 5095 . + . ID=NM_099983.2;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010;
NC_003070.9 RefSeq exon 5174 5326 . + . ID=NM_099983.2;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010;
NC_003070.9 RefSeq exon 5439 5899 . + . ID=NM_099983.2;Parent=NC_003070.9:NAC001;gbkey=mRNA;locus_tag=AT1G01010;
NC_003070.9 RefSeq CDS 3760 3913 . + 0 ID=NM_099983.2;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010;
NC_003070.9 RefSeq CDS 3996 4276 . + 2 ID=NM_099983.2;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010;
NC_003070.9 RefSeq CDS 4486 4605 . + 0 ID=NM_099983.2;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010;
NC_003070.9 RefSeq CDS 4706 5095 . + 0 ID=NM_099983.2;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010;
NC_003070.9 RefSeq CDS 5174 5326 . + 0 ID=NM_099983.2;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010;
NC_003070.9 RefSeq CDS 5439 5627 . + 0 ID=NM_099983.2;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010;
NC_003070.9 RefSeq start_codon 3760 3762 . + 0 ID=NM_099983.2;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010;
NC_003070.9 RefSeq stop_codon 5628 5630 . + 0 ID=NM_099983.2;Parent=NC_003070.9:NAC001;locus_tag=AT1G01010;
Loading