-
Notifications
You must be signed in to change notification settings - Fork 2
Day 2: Genome Annotation Practical
For this practical session, you will need the complete genome is it was in the last step of the Day 1 tutorial, meaning 1 contig that is the complete genome, preferably in its correct orientation if it has defined ends.
The first step in finding genes in genomes, is finding the Open Reading Frames (ORFs). There is currently no consensus on the definition of an ORF which can cause confusion. For the purposes of this training, an ORF is defined as a sequence of DNA in a single frame between a start codon and a stop codon. This means that each stretch of DNA can contain many different ORFs, many of which are nested within larger ORFs.
Take the first 50,000 bp of your phage genome and use the NCBI tool ORFfinder to delineate all possible ORFs in your genome.
Use phage-appropriate parameters for the ORF length, genetic code and start codons.
What happens when you change the parameters?
Do the ORFs predicted here meet our definition of an ORF?
A protein coding sequence is that section of the ORF delineated by a start codon and a stop codon. But for the CDS to encode actual proteins, there needs to be a bacterial ribosome-binding site (Shine Dalgarno sequence) upstream from the start codon. Look for the consensus sequence AGGAGG
, or 4-mer or 5-mer permutations of this sequence, predicted in the majority of bacteriophages (Krishnamurthy & Wang, 2018).
Most annotation tools will combine the ORF calling and CDS calling in one tool, where the software makes the decisions for you.
We will use the annotation tool Prokka which in our experience works well for the majority of phages in terms of finding the CDSs.
In our VMs, we have pre-installed the programme. To test if the installation went well and to display the help text, type in the terminal.
prokka
# run the setup database command to make sure it is functioning well
prokka --setupdb
There are quite a few options to choose from when running prokka, but not all will be useful at this stage when having a first look. Replace <phage-code>
with the name of your phage and <genome-file.fasta>
with the name of the file containing the phage genome.
prokka --kingdom Viruses --gcode 11 --locustag <phage-code> --prefix <phage-code> \
--outdir prokka_<phage-code> <genome-file.fasta>
We'll now check the output files generated by prokka, by typing ls
followed by the name of the directory that was created in the previous step.
# I named my directory prokka_T4 so that's the example I will use here.
ls prokka_T4
This list will contain a lot of different files. If you are familiar with file extensions, you will know what is in most of them. For example, the .faa
file will contain all predicted proteins and their amino acid sequences in multifasta format, and the .gbk
file will contain the same information in GenBank format. We can use the less
command to view what's in the file, or the head
command to view the first 10 lines.
less <PROKKA_file.gbk>
LOCUS AF158101.6 168903 bp DNA linear 26-OCT-2021
DEFINITION Genus species strain strain.
ACCESSION
VERSION
KEYWORDS .
SOURCE Genus species
ORGANISM Genus species
Unclassified.
COMMENT Annotated using prokka 1.14.6 from
https://github.com/tseemann/prokka.
FEATURES Location/Qualifiers
source 1..168903
/organism="Genus species"
/mol_type="genomic DNA"
/strain="strain"
CDS complement(12..2189)
/gene="rIIA"
/locus_tag="DIDAOBOA_00001"
/inference="ab initio prediction:Prodigal:002006"
/inference="similar to AA sequence:UniProtKB:P03690"
/codon_start=1
/transl_table=11
/product="Protein rIIA"
/translation="MIITTEKETILGNGSKSKAFSITASPKVFKILSSDLYTNKIRAV
VRELITNMIDAHALNGNPEKFIIQVPGRLDPRFVCRDFGPGMSDFDIQGDDNSPGLYN
SYFSSSKAESNDFIGGFGLGSKSPFSYTDTFSITSYHKGEIRGYVAYMDGDGPQIKPT
FVKEMGPDDKTGIEIVVPVEEKDFRNFAYEVSYIMRPFKDLAIINGLDREIDYFPDFD
DYYGVNPERYWPDRGGLYAIYGGIVYPIDGVIRDRNWLSIRNEVNYIKFPMGSLDIAP
SREALSLDDRTRKNIIERVKELSEKAFNEDVKRFKESTSPRHTYRELMKMGYSARDYM
ISNSVKFTTKNLSYKKMQSMFEPDSKLCNAGVVYEVNLDPRLKRIKQSHETSAVASSY
RLFGINTTKINIVIDNIKNRVNIVRGLARALDDSEFNNTLNIHHNERLLFINPEVESQ
IDLLPDIMAMFESDEVNIHYLSEIEALVKSYIPKVVKSKAPRPKAATAFKFEIKDGRW
EKRNYLRLTSEADEITGYVAYMHRSDIFSMDGTTSLCHPSMNILIRMANLIGINEFYV
IRPLLQKKVKELGQCQCIFEALRDLYVDAFDDVDYDKYVGYSSSAKRYIDKIIKYPEL
DFMMKYFSIDEVSEEYTRLANMVSSLQGVYFNGGKDTIGHDIWTVTNLFDVLSNNASK
NSDKMVAEFTKKFRIVSDFIGYRNSLSDDEVSQIAKTMKALAA"
...
Here's where phage annotation truly gets tricky. No annotation tool accurately predicts all CDSs for all phages. Without doing any laboratory experiments, having a good look at your data will help find most of the "missing" CDSs.
The first step is downloading your annotated phage genome to your computer to have a look.
Explanation on how to use scp can be found in this blog post
The easiest way is to open a new tab or window in the terminal and use scp.
# use cd to navigate to the folder where you want to download your file(s).
scp <climb-address>:<path-to-file-on-climb> .
We suggest you use the software WinScp to download files, guide documents can be found here.
I (Evelien) have started to use UniPro UGene as my main visualisation and data exploration software. I am in no way affiliated with this programme, but do enjoy using it.
You may want to make a project specific for this workshop. Then open the .gbk
file created by prokka. This will automatically open 3 tracks in the UGene interface: the predicted features at the top, then a zoomed in track of the nucleotide sequence and the six translation frames with any annotations, and a list with features.
If you haven't checked the orientation of the genome in a previous step, you can have a quick look with UGene. If there's a reference genome that has nucleotide similarity, you can use the dotplot feature. If you know what the first CDSs is and on what strand, you can also check those. Sometimes, opening the circularised genome has been an arbitrary decision taken by scientists a couple of decades ago, sometimes it's up to you.
Here, we'll use the example of T4. The 2003 publication by Miller and colleagues describes:
Genes are listed sequentially as they appear in the GenBank file (accession no. AF158101), clockwise on the circular map (by convention) starting with the first base 5' of rIIB.
Looking at TABLE 1 of that publication, we can see that rIIB is encoded on the complementary strand.
Now, we have to find the rIIB gene, make sure that it in the reverse position (from right to left), find the start codon, take one base upstream and reopen the genome at that position. Once the correct coordinate is found, the genome can be reoriented using the seqfu
suite of tools.
# first I make the reverse complement, if it is necessary
seqfu rc T4.fasta > T4_rc.fasta
# then I am using the coordinate determined earlier to restart
seqfu rotate -i <integer> T4_rc.fasta > T4_rcr.fasta
If you've had to perform these actions, you will have to rerun prokka to proceed to the next step.
IMPORTANT: This method of reorienting the genome assumes that the assembly is otherwise correct and circular, i.e. no assembler-based repeats at the ends of the original contig and no missing sections. Use read mapping and - if necessary - PCR for verification.
Because we know we need to rerun prokka after restarting the genome, we might as well have a look at the functional annotations at the same time, i.e. the /gene and /product qualifiers of the GenBank file.
We can get an easy overview by looking at the .tsv
file generated by prokka. This can be opened in any type of text editor or by using the less
command.
Be mindful when opening in Excel which is notorious for changing gene names into dates without the possibility of changing it back.
For phages and without using a phage-specific reference database, prokka will likely output the vast majority of CDSs as "hypothetical protein". This is the default for when no confident predictions can be made.
One method to increase the automated functional assignments, is to use a specific/custom database when running prokka. This database can consist of one single well-annotated reference genome, the whole phage catalog of NCBI or a set of well-curated protein annotations such as PHROGS.
We will now use the .gbk
file of T4 itself to increase the functional annotations of our own T4 genome, using the --proteins
function. Y
.gbk
file from NCBI and upload it into your VM yourself!
prokka --kingdom Viruses --locustag T4-QIB --prefix T4-QIB \
--proteins references/T4.gb \
--outdir prokka_T4_db T4i_rcr.fasta
Prokka also cleans up a lot of the product names. You can check the .log
file to see which /product
names were modified. If you'd like to keep the unmodified names, use the --rawproduct
flag.
In general, keep the annotations simple. If you have additional information on a gene, you can use the /note
qualifier. Check whether all products comply with what you learned during the lecture.
.gbk
file you made to your computer and open it in UGene as shown before.
Take your time to scroll through the gene annotations in UGene. Zoom in to the predicted CDS so you can look for gaps. Verify that there aren't missing CDSs, by looking for unannotated stretches of sequence.
Make a note of locations where regulatory features may be present, e.g in the 5' end of a genome with defined ends, or in the intergenic region between CDSs on opposite strands.
The changes you may make to CDS annotations can include:
- extending the CDS by choosing a different initiation codon
- reducing the size of the CDS by choosing a different initiation codon
- adding (a) small CDS(s) in an unannotated region
We can update the predictions manually by incorporating data from BLAST and from HMM searches. There are a couple of different ways of performing these searches:
- Using the BLAST interface of NCBI. Make sure to limit your search to
viruses (taxid:10239)
orCaudoviricetes (taxid:2731619)
for tailed phages, to reduce computational demands. The entire nucleotide or protein database can be used in cases where no homology is found, but might not yield a result. - Using HHPred or HHBlits from the University of Tübingen Bioinformatics Toolkit
- The STEP3 virion prediction tool from Monash University
We can also add other information on features to the CDS annotations, such as prediction of transmembrane domains with Phobius, TMHMM or signal peptides with SignalP.
All changes can be made in UGene using the add
button or the edit
buttons to add new qualifiers, features or change the names.
Regulatory elements do not have to be annotated in phage genomes for them to be deposited as complete genomes, but it is good practice to do so.
We will use the programme ARNold to predict stem-loop structures that could represent rho-independent terminators. All you need to do is paste the nucleotide sequence of your phage genome in the webserver and click Run
.
The output is a simple text file with the stem-loop structure indicated in blue and red. You may want to narrow down the search results to
- motifs found with both programmes
- motifs with a free energy lower than -10 kcal/mol
- motifs containing a poly-T tail
- motifs in intergenic regions
# Example output for phage T4
3537 Rnamotif + CTTGAACAATAGCATGCTCATCATATTGGCGTGCaTATTTCTTAAAA -8.90
4319 Rnamotif - TTTAAATAAAAGGCCTTCGGGCCTTTAGCTTTATG -7.30
4320 Rnamotif + ATAAAGCTAAAGGCCCGAAGGCCTTTTATTTAAAA -6.40
4931 Rnamotif + ATTTCTTTTCTTCGCCTTCGCGAAGGCGATATTTCTTACCG -9.40
7093 Both + GTATTTGACATTGTCTGGAAATCAATTTCAGATGgTTTTTTCATTTG -5.50
14821 Both - AATAAATTTAAGGACTCCTTCGGGAGTCCTTTTTCATTTAA -13.00
14822 Both + TAAATGAAAAAGGACTCCCGAAGGAGTCCTTAAATTTATTCA NA
15775 Rnamotif - TTTTATTACCTTTGATGGCATTAACAAGTGCCATTAATTTTTGTACCAT -8.40
15831 Rnamotif + AGCTTTTACTATTTCTGCTATGGTAGGAATTTTACTCTGGA -4.80
22200 Both - TTATCAGCGTGTAGGGGCAGTTTCTGCCCTTGaTATGTTTCCGAC -10.90
23939 Both - GAATAAATCTAGGGACCTCCGGGTCCCTTTTTCACACAA -12.80
27598 Rnamotif + AATCAGCAGCAGATTGGACAGCTGTCCAATCTTTTGAATACCA -11.30
33712 Rnamotif - TAGGCATTTAGCTGGATATCATGGTTCAGTTTTATCTAAAA -4.70
Try it out for yourself. You can copy-paste the stem-loop sequence into the search field of UGene and use the Add annotation
feature to include terminators.
As far as we know, there are no tools looking for phage rho-dependent terminators.
Promoters are tricky, because there can be multiple different promoter sequence motifs within one phage, e.g. early, middle & late, or phage-specific promoters recognised by the phage RNA polymerase (if it encodes one). If the bacterial promoter sequences are well characterised, you can use these. Otherwise, a pattern recognition search upstream of each CDSs works best to find new promoters.
First, we'll need to extract the upstream sequences. We will use the script created by Andrea.
getUpstreamFromGbk -i prokka_T4/PROKKA_11152021.gbk -o upstream_T4 -l 100
Then we can use the output file to search for patterns using MEME. If all goes well, you will be able to find conserved ribosome binding site motifs and at least one putative promoter sequence. As with the terminators, you will now need to verify whether these promoters are logically situated given the location of the CDSs.
A very important item on your to-do list for phage genomics, should be providing the correct metadata for each genome. This can be done by providing the necessary source modifiers in your GenBank file, or at the time of submission. A detailed overview of all the possible information to use and the correct qualifiers, can be found on the INSDC website.
Prokka and other tools often add additional qualifiers to each of the features (genes, CDSs, tRNAs etc.) that are identified. There's a debate on whether that's "polluting the GenBank file" or being thorough in reporting information. We're not getting involved, as long as you don't add the additional information in the /product
or /gene
descriptions.
Any of the qualifiers can be edited or removed manually in either UGene or a text editor. Here, we'll show a way to systematically delete a set of qualifiers using grep
on the command line.
# we want to get rid of all the lines starting with /inference
grep -v "/inference" PROKKA_T4.gbk > clean.gbk
At this point, we're suggesting you check out the guidance by NCBI in the The GenBank Submissions Handbook. We hope to update this with more practical tips and tricks soon.
Phage Annotation Workshop 2021