This is a workflow that combines multiple software, mainly for whole genome annotation of eukaryotes.
The following tools are required. Some options and compatibilities might depend on the software version.
- augustus v.3.4.0
- bedtools v.2.26.0
- BLAST v.2.6.0+
- BLAT v.36x2
- busco v.5.4.3
- cd-hit V4.8.1
- exonerate v.2.4.0
- genewise v2.4.1
- genometools V.1.6.2
- geta V.2.44
- gffread
- hisat2 v.2.1.0
- HMMER V.3.3.2
- interproscan V.5.56-89.0
- mafft V.7.508
- magicblast V.1.4.0
- maker v.3.01.03
- PASA v.2.5.2
- Pfam database
- parafly r2013-01-21
- RepeatMasker v.4.1.2-p1
- RepeatModeler V.2.0.1
- samtools v.1.7
- SNAP v.2006-07-28
- stringtie v.2.2.1
- TransDecoder v.5.5.0
- trimmomatic v.0.38
git clone https://github.com/unavailable-2374/Genome-Wide-annotation-pipeline.git
If you do not have much experience in compiling software, it is recommended to use conda to complete most of the software installation.
cd Genome-Wide-annotation-pipeline
export PATH=/PATH/TO/bin >> ~/.bashrc
mamba env create -f anno_tools.yml
conda activate GWAP
Download and cat PFAM_dabase
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam27.0/Pfam-A.hmm.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam27.0/Pfam-B.hmm.gz
gzip -dc Pfam-A.hmm.gz > Pfam-AB.hmm
gzip -dc Pfam-B.hmm.gz >> Pfam-AB.hmm
Usage:
perl GWAP.pl [options]
For example:
perl GWAP.pl --genome genome.fasta -1 rna_1.1.fq.gz,rna_2.1.fq.gz -2 rna_1.2.fq.gz,rna_2.2.fq.gz --protein homolog.fasta --out_prefix out --cpu 80 --gene_prefix Vitis --Pfam_db /PATH-to/Pfam-AB.hmm
Parameters:
[General]
--genome <string> Required
genome file in fasta format.
-1 <string> -2 <string> Required
fastq format files contain of paired-end RNA-seq data. if you have data come from multi librarys, input multi fastq files separated by comma. the compress file format .gz also can be accepted.
--protein <string> Required
homologous protein sequences (derived from multiple species would be recommended) file in fasta format.
--augustus_species <string> Required when --use_existed_augustus_species were not provided
species identifier for Augustus. the relative hmm files of augustus training will be created with this prefix. if the relative hmm files of augustus training exists, the program will delete the hmm files directory firstly, and then start the augustus training steps.
[other]
--out_prefix <string> default: out
the prefix of outputs.
--use_existed_augustus_species <string> Required when --augustus_species were not provided
species identifier for Augustus. This parameter is conflict with --augustus_species. When this parameter set, the --augustus_species parameter will be invalid, and the relative hmm files of augustus training should exists, and the augustus training step will be skipped (this will save lots of runing time).
--RM_species <string> default: None
species identifier for RepeatMasker. The acceptable value of this parameter can be found in file $dirname/RepeatMasker_species.txt. Such as, Eukaryota for eucaryon, Fungi for fungi, Viridiplantae for plants, Metazoa for animals. The repeats in genome sequences would be searched aganist the Repbase database when this parameter set.
--RM_lib <string> default: None
A fasta file of repeat sequences. Generally to be the result of RepeatModeler. If not set, RepeatModeler will be used to product this file automaticly, which shall time-consuming.
--augustus_species_start_from <string> default: None
species identifier for Augustus. The optimization step of Augustus training will start from the parameter file of this species, so it may save much time when setting a close species.
--cpu <int> default: 4
the number of threads.
--strand_specific default: False
enable the ability of analysing the strand-specific information provided by the tag "XS" from SAM format alignments. If this parameter was set, the paramter "--rna-strandness" of hisat2 should be set to "RF" usually.
--Pfam_db <string> default: None
the absolute path of protein family HMM database which was used for filtering of false positive gene models. multiple databases can be input, and the prefix of database files should be seperated by comma.
--gene_prefix <string> default: gene
the prefix of gene id shown in output file.
--help|-h Display this help info
Version: 1.0