-
Notifications
You must be signed in to change notification settings - Fork 6
GeneId Training Tutorial
Training Tutorial
IMPORTANT: The user interested in training geneid should have access to a unix/linux working environment and have at least basic knowledge of linux command line instructions, BASH shell scripting and the UNIX-based computer languages gawk and perl .
Furthermore, some programs are C-based and require compilation. Instructions on how to compile should be contained within the program packages. Finally it is likely that the user will have to make gawk and perl scripts executable after downloading them. This can be done by running the following linux command: chmod 755 filename.awk[pl].
The user should also check that the different "shebangs" for gawk, perl and bash indicate the proper path of the different executables and are at the top of every script. For example "#!/usr/bin/perl" is known as the perl shebang. It is put at the top of a perl script, and it tells the web server (like Apache) where to find the perl executable.
Initial set of files required for training/evaluation of geneid:
- Fasta sequences containing the training set gene models: pmarinus4training.fa
- Fasta sequences containing the evaluation set gene models: pmarinus4evaluation.fa
- Sequence including gene models of training set in tabular format: pmarinus4training.tbl
- Sequence including gene models of evaluation set in tabular format: pmarinus4evaluation.tbl
- Annotation file corresponding to training set gene models: pmarinus4training.gff
- Annotation file corresponding to training set gene models: pmarinus4evaluation.gff
NOTE: For Perkinsus marinus we used 765 sequences as part of the training set and 250 gene models to test the newly developed parameter file.
1)In the first step of training process the CDS and intronic regions (in fasta format) of the each of the annotated gene models are extracted from the fasta files given their annotation coordinates (gff file). The CDS and intonic fastas can be obtained by using the in-house program SSgff (tar-gzipped version).
The following wrapper BASH script gives an example of how the CDS and intron sequences can be extracted from the genomic (training) fastas above:
#!/bin/bash
mkdir -p ./intron
mkdir -p ./cds
mkdir -p ./fastas
cd ./fastas/
#obtain multi -fastas
TblToFastaFile pmarinus4training.tbl
while read genomic_id gene_id
do
echo $genomic_id
echo $gene_id
egrep -w "$gene_id\$" <b>pmarinus4training.gff</b> > /tmp/$$gff
<b>SSgff -cE</b> ../fastas/$genomic_id /tmp/$$gff \
| sed -e 's/:/_/' -e 's/ CDS//' \
> ../cds/$gene_id<br>
<b>SSgff -iE</b> ../fastas/$genomic_id /tmp/$$gff \
| sed -e 's/:/_/' -e 's/ Intron.*//' \
> ../intron/$gene_id
done < locus_id_training
NOTE: "locus_id_training" contains the list of all sequences and genes (fields $1 and $9 of the gff file). In the example above the genomic sequences are placed in a directory "fastas" and we the output CDS and intron fastas are output into directories "CDS" and "Intron".
A series of 765 single or multi-fasta files will be created in each of these 2 directories which should be concatenated into single multi-fasta files. It will also be necessary to convert these files to tabular format which can be done using the following awk script: FastaToTbl.
Finally the CDS sequences should be converted to proteins and the user should look for any sequences with in-frame stops and remove the from the training/evaluation set. Nucleotide CDS sequences can be converted to protein sequences using the Translate script (To run Translate the user will also need genetic.code).
The user will also use TblToFastaFile to convert the multi-fasta file into separate fasta files.
The following wrapper BASH script lists the steps described above:
#!/bin/bash
# Convert fasta files to tabular fasta
FastaToTbl all.cds.fa > all.cds.tbl
FastaToTbl all.intron.fa > all.intron.tbl
# Translate the CDS
mkdir./protein ; cd ./protein;
FastaToTbl ./cds/* | <b>Translate</b> | TblToFastaFile
# Checking for stop codons in the coding region
FastaToTbl./protein/* | grep "[A-Z]\*[A-Z]" |\
gawk '{print $1}' | uniq
In a second step start and stop codons and all splice sites of each of the annotated genes given the fasta files and gff coordinates again using SSgff (tar-gzipped version).
The following wrapper BASH script gives an example of how the splice sites and start/stop codons can be extracted from the genomic (training) fastas above:
#!/bin/bash
mkdir -p ./sites
while read genomic_id gene_id
do
echo $genomic_id
echo $gene_id
egrep -w $gene_id <b>pmarinus4training.gff</b> > /tmp/$$gff
<b>SSgff -dabeE</b> ../fastas/$genomic_id /tmp/$$gff \
> /tmp/$$all_sites
for site in Acceptor Donor Stop Start
do
echo $site
grep -A 1 $site /tmp/$$all_sites | sed '/--/d' \
>> ../sites/${site}_sites
done
done < locus_id_training
NOTE: "locus_id_training" contains the list of all sequences and genes (fields $1 and $9 of the gff file). In the example above the genomic sequences are placed in a directory "fastas" and we the output CDS and intron fastas are output into directories "sites". Four multi-fasta files will be created for each type of site (i.e. "Donor_sites").
These files should be converted to tabular format as described above using FastaToTbl.
It is also suggested that the user filter out those splice sites and start codons that are not canonical from the training set. This can be done by employing the following BASH script (example for Donor sites):
#!/bin/bash
#####Donors
gawk '{print $2}' ./Donor_sites.tbl |\
egrep -v '^[atcgNATCGn]{31}GT' > seqs_Don
#11 out of 4277 non-standard donors
egrep -f seqs_Don Donor_sites.tbl | gawk '{print $1}' - \
| sort | uniq > seqs_wrong_Don
egrep -vf seqs_wrong_Don Donor_sites.tbl > Donor_sites_filter1.tbl
#4266 canonical donor sites
In the following step of the training process the user will derive frequency matrices for the donors, acceptors and start codons.
However to do this the user will first have to parse the sequences (normally the sequences in which the gene models are embedded) for background dinucleotides ("GT","AG" or "ATG") flanked by approximately 30 nucleotides of sequence on each side.
The awk program "FindRegexp" and the BASH script "BoundarySites.sh" can be used to obtain these background sequences in an appropriate format. Furthermore in order to exclude the real donors, acceptors and starts from the background the user can use a combination of awk and linux commands (as shown below). It should be pointed out that this program can be very slow if we are dealing with large fasta files.
The following BASH script demonstrates how the user can obtain the background data for "GT", "AG" and "ATG":
#!/bin/bash
############# Obtaining all the GT sites
FindRegexp GT < pmarinus4training.tbl \
| join pmarinus4training.tbl - | BoundarySites.sh -2 31 \
| gawk '{for (i=2;i<=NF;i+=2) print $1"_"$i,$(i+1)}' | sort \
> all.GTs
#981,326
# Obtaining the false donor sites
gawk '{print $1"_"$5+1}' pmarinus4training.gff | sort \
| join -v 1 all.GTs - \
> all.false_GTs.tbl
#979,004
############# Obtaining all the AG sites
FindRegexp AG < pmarinus4training.tbl\
| join pmarinus4training.tbl - | BoundarySites.sh +4 28 \
| gawk '{for (i=2;i<=NF;i+=2) print $1"_"$i,$(i+1)}' | sort \
> all.AGs
#1,089,374
# Obtaining the false acceptor sites
gawk '{print $1"_"$4-2}' pmarinus4training.gff | sort \
| join -v 1 all.AGs - \
> all.false_AGs.tbl
#1,087,071
############# Obtaining all the ATG sites
FindRegexp ATG < pmarinus4training.tbl \
| join pmarinus4training.tbl - | BoundarySites.sh +2 30 \
| gawk '{for (i=2;i<=NF;i+=2) print $1"_"$i,$(i+1)}' | sort \
> all.ATGs
#304,920
# Obtaining the false Start sites
gawk '{print $1"_"$4}' pmarinus4training.gff | sort \
| join -v 1 all.ATGs - \
> all.false_ATGs.tbl
#304,456
If the user goes through the training protocol described in the BASH scripts of the previous section she should now have the following files required in subsequent steps:
Set of files required in subsequent geneid training steps
- Multi-fasta of training CDS sequences: all.cds.fa
- CDS sequences in tabular format: all.cds.tbl
- Multi-fasta of training intronic sequences: all.intron.fa
- Intronic sequences in tabular format: all.intron.tbl
- Donor Sites in tabular format: Donor_sites_filter1.tbl
- Acceptor sites in tabular format: Acceptor_sites_filter1.tbl
- Start codons in tabular format: Start_sites_complete_filter1.tbl
- Background "false" GT sites in tabular format: all.false_GTs.tbl.gz (gzip compressed)
- Background "false" AG sites in tabular format: all.false_AGs.tbl.gz (gzip compressed)
- Background "false" ATG sites in tabular format: /all.false_ATGs.tbl.gz (gzip compressed)
The files above should be all the user needs can to compute the markov models or PWMs for the splice sites and for deriving a model for coding DNA.
Once the user has obtained a large set of background ATG, GT and AGs ("false" start codon and splice sites) and extracted the real splice sites and start codon profiles from the annotated sequences as described in the previous section she can proceed to calculate the frequencies of nucleotides at the positions surrounding the different sites (false and real) by using the awk script frequency.awk.
Then the user should determine the range around the sites which is informative (i.e the positional frequency of nucleotides flanking the sites is significantly different from the same positional frequencies in "false" background sites) by running another awk script: information.awk. This will give the user an indication of the range of nucleotides around the splice sites and start codon which are usefule to use in building the position weight matrices (PWMs).
Furthermore the information content within the real splice sites and start codon profiles can be displayed as a graph (using the pictogram (tar-gzipped version) tool), in which the nucleotide height at any given position is proportional to the frequency of the base at that position with the information content being displayed as bit scores. The following example shows the Donor site profile diagram the user would obtain for P. marinus. This example shows that the Donor profile includes nucleotides carrying significant information from positions -2 to +3 (with reference to the GT dinucleotide).
Having calculated the frequencies of all real and background sites the user can proceed to obtain PWMs for the splice sites and start codon. PWMs are calculated as log-likelihood ratios of the probability of observing a nucleotide at a given position in a real site over the probability of observing the same nucleotide at the same position in a "false" site (as described in more detail elsewhere). PWMs are computed using the awk script aux4.awk. Furthermore, using another awk script (files.awk) the user can restrict the PWM to the range of nucleotides found to be informative (see above) and save it in a format that can be directly introduced into the newly developed geneid parameter file for this species.
The following BASH script exemplifies what has been described in the above paragraphs on how to compute a PWM for a donor site. PWMs for acceptor sites and start codons were obtained in a similar way.
#!/bin/bash
############ Computing frequencies and the amount of information
#### background GTs
frequency.awk 1 all.false_GTs.tbl > freq.GTs.1
#### real donor sites
frequency.awk 1 Donor_sites_filter2.tbl > freq.donors.1
#### checking for significant positions to be subsequently used in PWM
information.awk freq.GTs.1 freq.donors.1 | gawk 'NF==2' | more
#positions 30-36 seem to be the most useful (informative) to build the PWM
#### Plot graph showing the the proportion of nucleotides around the donor site
gawk '{print substr($2,25,15)}' Donor_sites_filter2.tbl > Donor_sites.sub
pictogram Donor_sites.sub Donor -bits -land
#### COMPUTE THE PWM for the DONOR SITE
gawk -f aux4.awk freq.GTs.1 freq.donors.1 > log_mat.donors
files.awk <b>30 36</b> log_mat.donors > donor.geneid
If the user follows the the steps above and calculates PWMs for each of the two splice sites and start codon she will generate the following files that can be introduced directly into the newly developed parameter file:
PWMs for start codon and splice sites that should be incorporated into parameter file
- PWM for donor site in geneid format : donor.geneid
- PWM for acceptor site in geneid format: acceptor.geneid
- PWM for start codon in geneid format: starts.geneid
Once the user has computed PWMs for all splice sites and start codon she can proceed to derive a model for the coding potential. The model for coding DNA will be a Markov of order 4 or 5 depending on the amount of coding/non-coding data available. In order to estimate the number of coding and non-coding nucleotides de user can run the following BASH script:
#!/bin/bash
gawk '{ l=length($2); L+=l; print l} END{ print "TOTAL: "L;}' all.cds.tbl
#800,322 (markov5)
gawk '{ l=length($2); L+=l; print l} END{ print "TOTAL: "L;}' all.intron.tbl
#364,591 (markov5)
For this species we were provided with enough data to compute a more accurate Markov model of order 5 which was estimated on both exonic (all.cds.tbl) and intronic (all.intron.tbl) sequences. For the exon sequences we estimate the transition probability distribution of each nucleotide given the pentanucleotide preceding it for each of the three possible frames and an initial probability matrix from the pentamers observed at each codon position using the awk program MarkovMatrices.awk.
For the intron sequences a single transition matrix was computed as well as a single initial probabilty matrix using the awk script MarkovMatrices-noframe.awk. Once the Initial and Transition probability matrices are calculated the user proceed to compute the coding potential.
The coding potential is computed as log-likelihood ratio (as described in more detail elsewhere) using the the awk programs pro2log_ini.awk (ratios of Initial probability matrices between exonic and intronic sequences) and pro2log_tran.awk (ratios of Transition probability matrices between exonic and intronic sequences). The user should also open the actual programs to check for running options.
The following BASH script employs a combination of the scripts indicated above, awk and linux commands that will allow the user to compute the coding potential in format which can be introduced into the geneid parameter file for this species.
#!/bin/bash
##########################
# Building coding statistic (Markov model order 5)
#Generating the initial and transition matrices
# without stop codon
#################
#markov5
##################
gawk '{print $1,substr($2,1,length($2)-3)}' all.cds_filter1.tbl | gawk -f MarkovMatrices.awk <b>5</b> set1.cds
sort +1 -2 -o <b>set1.cds.5.initial set1.cds.5.initial</b>
sort +1 -2 -o <b>set1.cds.5.transition set1.cds.5.transition</b>
gawk -f MarkovMatrices-noframe.awk <b>5</b> set1.intron all.intron_filter1.tbl
sort -o <b>set1.intron.5.initial set1.intron.5.initial</b>
sort -o <b> set1.intron.5.transition set1.intron.5.transition</b>
## Compute log-likelihood exon matrices, assuming intron
## matrices describing background probabilities
gawk -f pro2log_ini.awk <b>set1.intron.5.initial set1.cds.5.initial</b> \
> set1.cds-intron.5.initial
gawk -f pro2log_tran.awk <b>set1.intron.5.transition set1.cds.5.transition</b> \
> set1.cds-intron.5.transition
gawk 'BEGIN {p=-1}{if (((NR+2) % 3)==0) p+=1; print $2,p,$1,$3}' \
set1.cds-intron.5.initial > <b>set1.cds-intron.5.initial.geneid</b>
gawk 'BEGIN {p=-1}{if (((NR+2) % 3)==0) p+=1; print $2,p,$1,$4}' \
set1.cds-intron.5.transition > <b>set1.cds-intron.5.transition.geneid</b>
If the user goes through the training protocol described above of she should up with two coding potential probability matrix files which are ready to be inserted into the parameter file for this species:
PWMs for start codon and splice sites that should be incorporated into parameter file:
- coding potential probability distribution ratio (Initial) in geneid format: set1.cds-intron.5.initial.geneid
- coding potential probability distribution ratio (Transition) in geneid format: set1.cds-intron.5.transition.geneid
At this stage the user can start building an initial (un-optimized) version of the parameter file for Perkinsus marinus using the previously computed PWMs and the coding potential data. The user can obtain a "template" of a geneid parameter file here.
At this point the user can start introducing the start codon and splice site profiles into the appropriate places in the template file. It should be noted that at present the syntax has to be strictly followed.
The start site profile (starts.geneid) should be inserted after the comment "# These are the start transition probabilities". Furthermore each of the profiles has four parameters that should be considered when inserting them; for example for the start profile
Parameters to Predict Sites: Lenght, Offset, Cutoff and order (Markov chain)
Start_profile
13 8 -5 0
"10" corresponds to the length of the profile.
"5" corresponds to the offset position.
"-5" is the score cutoff threshold meaning that scores below this value are not considered.
"0" is the order. If the user had computed a Markov chain or order 1 or 2 to model the splice sites she would place a "1" or "2" in this position. For PWMs the order is always "0".
NOTE: In geneid the offset is calculated in relation to the exon (not intron). This can be more easily explained by looking at the following examples of each one of the profiles:
HOW TO CALCULATE THE OFFSET FOR THE DIFFERENT SITES:
#start profile:
1 2 3 4 5 6 **7 8 9** 10 11 12 13
C T A C A G **A T G** A T A G
In the example above the hypothetical start profile has 13 positions, the exon starts at position 7 (ATG) and the previous intron ends at position 6.
Therefore the offset in this example would be "6" (before first position of exon).
#donor profile:
1 2 3 4 5 6 **7 8** 9 10 11
A G C G T G **G T** A G C
In the example above the hypothetical donor profile has 11 positions, the intron starts at position 7 (GT) and the exon ends at position 6.
Therefore the offset in this example would be "5" (before last position of exon).
#acceptor profile:
1 2 3 4 5 6 **7 8** 9 10 11 12
C T A C A G **A G** A T A G
In the example above the hypothetical acceptor profile has 12 positions, the exon starts at position 9 (AG) and the intron ends at position 8.
Therefore the offset in this example would be "8" (before first position of exon)
Following the rules above, the different sites profile parameters for Perkinsus marinus should be:
# Parameters to Predict Sites: Lenght, Offset, Cutoff and order (Markov chain)
Start_profile
13 8 -5 0
# These are the start transition probabilities
Acceptor_profile
15 13 -5 0
# These are the acceptor transition probabilities
Donor_Profile
7 1 -5 0
# These are the donor transition probabilities
Once the user has inserted the different PWM profiles following the instructions shown above she can proceed to insert the markov chain matrices intot the appropriate places in the parameter file.
But first the user must indicate the Markov order number below "Markov_oligo_logs_file" in the parameter file (in this case "5" for a Markov model of this order). The user should then insert the Markov Initial and Transition probability matrices (set1.cds-intron.5.initial.geneid and set1.cds-intron.5.transition.geneid) below the text "Markov_Initial_probability_matrix" and "Markov_Transition_probability_matrix" respectively.
# Parameters to Predict Exons: Markov chains(initial - transition)
Markov_oligo_logs_file
5
Markov_Initial_probability_matrix
Markov_Transition_probability_matrix
The parameter file has another section at its very end ("Gene model") which when optimized can improve the gene prediction by increasing the stringency of the gene assemblies.For example for Perkinsus marinus when know from information extracted from the training set that 99% of 5076 introns range between 21 and 1600.
Therefore the user can force geneid to only assemble genes within these limits by setting low and high boundaries for intron length below "INTRAgenic connections" that are similar to the range of values observed in the training set. It could also prove useful to set a minimum (and maximum) intergenic distance (below "INTERgeneic connections") especially in gene-dense organisms.
The values the user can insert for this species are shown below (in bold):
# Global Parameters
# GENE MODEL: Rules about gene assembling (GenAmic)
General_Gene_Model
# INTRAgenic connections
First+:Internal+ Internal+:Terminal+ 15:2000 block
Terminal-:Internal- First-:Internal- 15:2000 blockr
# External features
Promoter+ First+:Single+ 50:4000
Terminal+:Single+ aataaa+ 50:4000
First-:Single- Promoter- 50:4000
aataaa- Single-:Terminal- 50:4000
# INTERgeneic connections
aataaa+:Terminal+:Single+ Single+:First+:Promoter+ 150:Infinity
aataaa+:Terminal+:Single+ Single-:Terminal-:aataaa- 150:Infinity
Promoter-:First-:Single- Single+:First+:Promoter+ 150:Infinity
Promoter-:First-:Single- Single-:Terminal-:aataaa- 150:Infinity
# BEGINNING and END of prediction
Begin+ First+:Internal+:Terminal+:Single+ 0:Infinity
Begin- First-:Internal-:Terminal-:Single- 0:Infinity
First+:Internal+:Terminal+:Single+ End+ 0:Infinity
First-:Internal-:Terminal-:Single- End- 0:Infinity
At this point the user will have a working Perkinsus marinus parameter file albeit not yet optimized. There are two parameters internal to the parameter file of geneid (elaborated in more detail elsewhere) which can be modified to optimize gene predictions.
These parameters (shown here) basically set cutoffs to the the scores of the predicted exons to minimize overprediction (eWF) and set the ratio of information from signal and coding statistics used (oWF).A more detailed explanation can be found elsewhere in our group's page.
"Optimization" basically consists of evaluating the geneid predictions on the training set against the training set annotations themselves and picking the combination of oWF and eWF that optimize the prediction accuracy as measured by accuracy at nucleotide, exon and gene level.
Very briefly, "evaluate" gives the user a measure of the "sensitivity" or the proportion of annotated nucleotides, exons or genes that are predicted correctly and the "specificity" or the proportion of the predicted nucleotides/exons/genes which are true-positive (i.e.) fall within the annotations; a more detailed explanation can be found elsewhere). The evaluation tool we use to can be downloaded from here (tar-gzipped version).
"Optimization" requires some processing of the input files using a combination of linux and awk command line instructions as well as the awk tool gff2cds. The steps involved in setting up the optimization process are shown in the BASH script below:
#!/bin/bash
###obtain list of training set fastas containing annotated genes with their respective length
gawk '{print $1,length($2)}' pmarinus4training.tbl | sort | uniq \
> cds_tmp4training
###convert gff-format gene annotations into "cds" format using gff2cds tool
gff2cds source="P.marinus" pmarinus4training.gff \
| sort | join cds_tmp4training - > <b>pmarinus4training.cds
### set up gff file for compatibility with evaluation tool
gawk 'BEGIN{while (getline< ARGV[1]>0){len[$1]=$2;};ARGV[1]="";OFS="\t";} \
{if (NR==1) {ant=$1;print $1,$2,"Sequence",1,len[ant],".",".",".","."}; \
if ($1!=ant) {print "#$";ant=$1;print $1,$2,"Sequence",1,len[ant],".",".",\
".","."}; print }' <b>pmarinus4training.cds</b> <b>pmarinus4training.gff</b> > <b>pmarinus4training.eval.gff</b>
If the user follows the setup steps shown above she should end up with the following file: pmarinus4training.eval.gff, which will be necessary in the optimization process.
The optimization process itself requires an AWK BASH-embedded script. This script (Optimization_new.sh) basically allows the user to go through an user-selected range of oWF and eWF. Geneid is run and the resulting predictions evaluated for each combination of oWF and eWF (against the training set annotations). Before running this script the user should edit the "Optimization_new.sh" file to make sure that the $PATH of both geneid ($GENEID) and the evaluation ($EVAL) program are correctly indicated (in bold):
#section of Optimization_new.sh:
.
.
.
while [ $of -ne 0 ]
do
IeWF=$IeWFini
ef=`gawk "BEGIN{print ($IeWF <= $FeWF) ? 1 : 0}"`
while [ $ef -ne 0 ]
do
# update exon score in geneid.param.multi
gawk "{if (NR==27) {print 1-$IoWF+0.2, 1-$IoWF-0.1, 1-$IoWF-0.1, 1-$IoWF+0.2;} else if (NR==30) {print $IoWF, $IoWF, $IoWF, $IoWF;} \
else if (NR==36) {print $IeWF, $IeWF, $IeWF, $IeWF;} else print}" $PARAM > /tmp/tmp.$$
# make gene predictions $BIN/geneid/bin/geneid
$GENEID/geneid -GP /tmp/tmp.$$ $SEQFILE | gawk 'NR>5 {if ($2=="Sequence") print "#$"; if (substr($1,1,1)!="#") print }' - | egrep -v 'exon' - > /tmp/Predictions.$$.gff
# evaluate gene predictions
$EVAL/evaluation -sta /tmp/Predictions.$$.gff $CDSFILE | tail -2 | head -1 | \
gawk "{printf \"%6.2f %6.2f %6.2f %6.2f %6.2f %6.2f %6.2f %6.2f %6.2f %6.2f %6.2f %6.2f\\n\", $IoWF, $IeWF, \$1, \$2, \$3, \$4, \$5, \$6, \$7, \$8, \$12, \$13}"
## remove files
rm /tmp/Predictions.$$.gff
IeWF=`echo $IeWF $deWF | gawk '{print $1+$2}'`
ef=`gawk "BEGIN{print ($IeWF <= $FeWF) ? 1 : 0}"`
done
IoWF=`echo $IoWF $doWF | gawk '{print $1+$2}'`
of=`gawk "BEGIN{print ($IoWF <= $FoWF) ? 1 : 0}"`
done
.
.
.
The optimization script requires as input the newly built Perkinsus marinus parameter file (Pmarinus.param.gz (gzipped)), the multi-fasta sequence containing the annotations (pmarinus4training.fa) and the evaluation tool-compatible gff annotations contained in the fastas (properly sorted to match the order of the sequences in the fasta file): pmarinus4training.eval.gff. The actual command line instruction and output is as follows: (the accuracy performance was sorted so that the bottom values should indicate oWF/ eWF values which optimize performance). It should be mentioned that this optimization process can be quite slow when dealing with large fastas as is the case here.
#!/bin/bash
Optimization_new.sh Pmarinus.param pmarinus4training.fa pmarinus4training.eval.gff > pmarinus4training.param.optimization
###sorting of accuracy performance
sort +4n -5 +7n -8 +8nr -9 pmarinus4training.param.optimization
oWF eWF SN SP CC SNe SPe SNSP raME raWE raMG raWG
0.40 -3.00 0.95 0.12 0.26 0.77 0.10 0.44 0.07 0.87 0.00 0.87
0.40 -2.50 0.96 0.12 0.26 0.79 0.09 0.44 0.05 0.89 0.00 0.88
0.30 -2.50 0.96 0.12 0.26 0.78 0.09 0.44 0.04 0.89 0.00 0.87
0.35 -2.50 0.96 0.12 0.26 0.79 0.09 0.44 0.04 0.88 0.00 0.88
0.35 -3.00 0.95 0.12 0.27 0.77 0.10 0.44 0.07 0.87 0.00 0.86
oWF = exon factor
eWF = exon weight
SN = sensitivity nucleotide level
SP = specificity nucleotide level
CC = correlation SN/SP
SNe = sensitivity exon level
SPe = specificity exon level
SNSP = correlation SNe/Spe
raME = ratio missing exons
raWE = ratio wrong exons
raMG = ratio missing genes
raWG = ratio wrong genes
NOTE: From the output of the "Optimization_new.sh" it will be obvious to the user that all accuracy estimates related to specificity are very low. The explanation for this is that the fasta files correponding to Perkinsus marinus sequences are not fully annotated and therefore geneid is predicting potentially real genes in regions that remain annotated hence lowering the specificity of the predictions. Regardless based on this optimization we select the combination of oWF/ eWF values that seem to maximize performance. These values ( 0.35 and -3.00 respectively) are then inserted into the parameter file.
In a final step of the training process the fully optimized parameter file (pmarinus_geneid.param (gzip compressed)) is used to predict sequences on the evaluation (test) set fastas (pmarinus4evaluation.fa). However, in order to minimize the low specificity caused by the lack of a complete annotation of the multi-fasta sequence, prior to evaluation we extract the annotations with 500 nt of flanking sequence from the original fasta file. In order to do so the the use must run the awk script gff2gp.awk (setting the value of the line "context=" to 500) using pmarinus4evaluation.gff as the input file. The command line that should be used to obtain a golden path (gp)-compliant file is:
#!/bin/bash
gff2gp.awk pmarinus4evaluation.gff | sort > pmarinus4evaluation.gp
The user can then use the output of "gff2gp.awk", pmarinus4evaluation.gp as the input of the perl script getgenes.pl that can be used to both obtain a new set of fasta files with the number of flanking nucleotides previously selected (i.e. 500), and to convert "pmarinus4evaluation.gp" file to gff format. It should be noted that "getgenes.pl" requires that another program (chromosomechunk ) be in the same directory as the input file (pmarinus4evaluation.gp). It is also necessary that the user convert the multi-fasta pmarinus4evaluation.tbl into 250 separate fastas using the awk script TblToFastaFile, and place them into a separate directory (i.e. ./fastas4evaluation/). Finally "getgenes.pl" outputs the sequences in tabular format that can be converted to fasta with TblToFasta.
The following wrapper BASH script lists the steps described above:
#!/bin/bash
## Convert the gff to gp (golden path)
## 500 flanking nucleotidesk
gff2gp.awk pmarinus4evaluation.gff | sort > pmarinus4evaluation.gp
##fix to ensure that getgenes.pl does not crash when flanking sequence extends past end of sequence
gawk '{print $1,$9}' pmarinus4evaluation.gff | sort | uniq > locus_id_evaluation
gawk '{print $1,length($2)}' pmarinus4evaluation.tbl | sort > all_fastas_length
join all_fastas_length locus_id_evaluation | gawk '{print $3,$1,$2}' > evaluation_contigs_length
join evaluation_contigs_length pmarinus4evaluation.gp > pmarinus4evaluation_mod.gp
gawk 'BEGIN{OFS=" ";FS=" ";}{if ($3<=$7){coord=1;tot=split($12,array,",");array[tot-1]=$3;s="";\
for(coord=1;coord<=tot-1;coord++){s=s""array[coord]","};print $1,$2,$5,$6,$3,$8,$9,$10,$11,s;}else{print $1,$2,$5,$6,$7,$8,$9,$10,$11,$12 }}' \
pmarinus4evaluation_mod.gp</b> > <b> pmarinus4evaluation.gp
### extract fastas from multi-fasta file and place them in a directory
-
mkdir -p ./fastas4evaluation/; cd ./fastas4evaluation/
TblToFastaFile ../pmarinus4evaluation.tbl
cd ..
###extract sequence fastas + flanking sequence (make sure "chromosomechunk" is in the pmarinus4evaluation.gp directory)
getgenes.pl ./fastas4evaluation/ pmarinus4evaluation.gp | \
gawk 'BEGIN{b="x"}{if ($1!=b){printf "\n%s ",$1}printf "%s",$6;b=$1}END{printf "\n"}' \
| sort | sed '/^$/d' - | gawk '{print $1, toupper($2)}' - > <b>pmarinus4evaluation.gp.tbl</b>
TblToFasta pmarinus4evaluation.gp.tbl > pmarinus4evaluation.gp.fa
mkdir -p ./fastas_gp4evaluation; cd ./fastas_gp4evaluation;
TblToFastaFile pmarinus4evaluation.gp.tbl
cd ..
###obtain gff annotation corresponding to extracted fastas
getgenes.pl ./fastas4evaluation/ pmarinus4evaluation.gp | \
gawk 'BEGIN{OFS="\t";pos=1;b="x"}{if ($1!=b){pos=1}; print $1,"P.marinus",$3,pos,pos+$5-1,".","+",".",$1$2; pos+=$5;b=$1 }' \
| egrep -v "(Intron|Utr)" - > <b>pmarinus4evaluation.gp.gff</b>
Once the user has obtained the gff (pmarinus4evaluation.gp.gff) and fasta (pmarinus4evaluation.gp.fa) output files she should proceed to set up the gff file for evaluation (pmarinus4evaluation.gp.eval.gff; shown below and in the "optimization" section above), and then evaluate the accuracy of the geneid predictions on the test sequences using the newly developed and optimized parameter file (Pmarinus.param.gz (gzipped)) and the evaluation tool (Evaluation.tgz (tar-gzipped version)). Setting up the gff for the evaluation process requires some processing of the input files using a combination of linux and awk command line instructions as well as the awk script gff2cds. The command line instructions necessary in setting up and running the evaluation process are shown in the BASH script below:
#!/bin/bash
###obtain list of training set fastas containing annotated genes with their respective length
gawk '{print $1,length($2)}' pmarinus4evaluation.gp.tbl | sort | uniq \
> cds_tmp_gp4evaluation
###convert gff-format gene annotations into "cds" format using gff2cds tool
gff2cds source="P.marinus" pmarinus4evaluation.gp.gff \
| sort | join cds_tmp_gp4evaluation - > pmarinus4evaluation.gp.cds
### set up gff file for compatibility with evaluation tool
gawk 'BEGIN{while (getline< ARGV[1]>0){len[$1]=$2;};ARGV[1]="";OFS="\t";} \
{if (NR==1) {ant=$1;print $1,$2,"Sequence",1,len[ant],".",".",".","."}; \
if ($1!=ant) {print "#$";ant=$1;print $1,$2,"Sequence",1,len[ant],".",".",\
".","."}; print }' pmarinus4evaluation.gp.cds pmarinus4evaluation.gp.gff > pmarinus4evaluation.gp.eval.gff
##run geneid on the 250 test fastas and evaluate based on previous optimization on the training set (oWF=0.35 eWF=-3.00)
geneid -vGP Pmarinus.param pmarinus4evaluation.gp.fa \
| gawk 'NR>5 {OFS="\t";if ($3=="Gene") print "#$"; $2="geneid_Pmarinus"; \
if (substr($1,1,1)!="#") print }' - | egrep -v 'exon' - \
> pmarinus.geneid.eval.gff
evaluation -sta pmarinus.geneid.eval.gff pmarinus4evaluation.gp.eval.gff | gawk '{OFS=" & ";$1=$1;print}' \
> evaluation.gp.geneid.pmarinus
SN SP CC SNe SPe SNSP raME raWE SNg SPg SNSPg raMG raWG raJG raSG
0.92 0.90 0.84 0.72 0.71 0.72 0.10 0.12 0.38 0.28 0.33 0.00 0.27 1.00 1.01
SN = sensitivity nucleotide level
SP = specificity nucleotide level
CC = correlation SN/SP
SNe = sensitivity exon level
SPe = specificity exon level
SNSP = correlation SNe/Spe
raME = ratio missing exons
raWE = ratio wrong exons
SNg = sensitivity gene level
SPg = specificity gene level
SNSPg = correlation SNg/SPg
raMG = ratio missing genes
raWG = ratio wrong genes
raJG = ratio joined genes
raSG = ratio split genes
As the end of the training/evaluation protocol the user should have a geneid parameter file which is rather optimized for the species of interest. The average prediction performance observed for P. marinus is fairly typical for a geneid parameter file (80-90% nucleotides, 70-80% exons and 30-40% genes accurately predicted) given a species of this complexity. (evaluation.gp.geneid.pmarinus)
NOTE: There will also be instances in which the user will only have access to a limited amount of gene models and will not therefore be able to set aside sequences for testing the newly developed parameter file. In these cases to reduce the bias that would arise from testing accuracy on the same sequences as those on the training set it is recommended that the user apply a "leave-one-out" testing strategy. In this strategy one sequence at a time will be excluded from the training process and the accuracy subsequently determined on the excluded sequence. This process is then repeated for every sequence in the set.
If the user requires help with any aspect of the geneid training process or in testing the accuracy of any newly developed geneid parameter file, for example by using this bootstrapping leave-one-out strategy she should contact us for assistance.
If you encounter problems using geneid, or have suggestions on how to improve it send an e-mail to
[email protected]
- E. Blanco, G. Parra and R. Guigo,
"Using geneid to Identify Genes.",
In A. Baxevanis, editor:
Current Protocols in Bioinformatics. Unit 4.3.
John Wiley & Sons Inc., New York (2002) (in press)
- E. Blanco, G. Parra, S. Castellano, J.F. Abril,
M. Burset, X. Fustero, X. Messeguer and R. Guigó
"Gene Prediction in the Post-Genomic Era."
IX th ISMB (Poster), Copenhagen, Denmark (2001)
- G. Parra, E. Blanco, and R. Guigo,
"Geneid in Drosophila",
Genome Research 10(4):511-515 (2000).
- R. Guigo,
"Assembling genes from predicted exons in linear time with dynamic programming",
Journal of Computational Biology, 5:681-702 (1998).
- M. Burset and R. Guigo
"Evaluation of gene structure prediction programs",
Genomics, 34(3):353-367 (1996).
- R. Guigo, S. Knudsen, N. Drake, and T. F. Smith,
"Prediction of gene structure",
Journal of Molecular Biology, 226:141-157 (1992).
The current version of geneid has been written by
Enrique Blanco,
Tyler Alioto and
Roderic Guigó.
The parameter files have been constructed by
Genis Parra,
Tyler Alioto
and Francisco Camara.
With contributions from Josep F.Abril, Moises Burset and Xavier Messeguer.
This training tutorial document was prepared by: Francisco Camara.
CopyRight © 2002
geneid is under GNU General Public License.