Skip to content

Day 2: Genome Annotation Practical

evelien edited this page Nov 19, 2021 · 29 revisions

Phage annotation

For this practical session, you will need the complete genome is it was in the last step of the Day 1 tutorial, meaning 1 contig that is the complete genome, preferably in its correct orientation if it has defined ends.

🎥 YouTube screencast

1️⃣ ORF calling

The first step in finding genes in genomes, is finding the Open Reading Frames (ORFs). There is currently no consensus on the definition of an ORF which can cause confusion. For the purposes of this training, an ORF is defined as a sequence of DNA in a single frame between a start codon and a stop codon. This means that each stretch of DNA can contain many different ORFs, many of which are nested within larger ORFs.

Optional exercise

Take the first 50,000 bp of your phage genome and use the NCBI tool ORFfinder to delineate all possible ORFs in your genome.
Use phage-appropriate parameters for the ORF length, genetic code and start codons.
What happens when you change the parameters?
Do the ORFs predicted here meet our definition of an ORF?

2️⃣ Assigning the Coding Sequences (CDSs)

A protein coding sequence is that section of the ORF delineated by a start codon and a stop codon. But for the CDS to encode actual proteins, there needs to be a bacterial ribosome-binding site (Shine Dalgarno sequence) upstream from the start codon. Look for the consensus sequence AGGAGG, or 4-mer or 5-mer permutations of this sequence, predicted in the majority of bacteriophages (Krishnamurthy & Wang, 2018).

Most annotation tools will combine the ORF calling and CDS calling in one tool, where the software makes the decisions for you.

Annotation with Prokka

We will use the annotation tool Prokka which in our experience works well for the majority of phages in terms of finding the CDSs.

In our VMs, we have pre-installed the programme. To test if the installation went well and to display the help text, type in the terminal.

prokka
# run the setup database command to make sure it is functioning well
prokka --setupdb

There are quite a few options to choose from when running prokka, but not all will be useful at this stage when having a first look. Replace <phage-code> with the name of your phage and <genome-file.fasta> with the name of the file containing the phage genome.

prokka --kingdom Viruses --gcode 11 --locustag <phage-code> --prefix <phage-code> \
 --outdir prokka_<phage-code> <genome-file.fasta>

⚠️ the locustag needs to be registered with EMBL or NCBI before submitting the genome sequence. If in doubt, use a placeholder that we can swap out at the end.

We'll now check the output files generated by prokka, by typing ls followed by the name of the directory that was created in the previous step.

# I named my directory prokka_T4 so that's the example I will use here.
ls prokka_T4

This list will contain a lot of different files. If you are familiar with file extensions, you will know what is in most of them. For example, the .faa file will contain all predicted proteins and their amino acid sequences in multifasta format, and the .gbk file will contain the same information in GenBank format. We can use the less command to view what's in the file, or the head command to view the first 10 lines.

less <PROKKA_file.gbk>
LOCUS       AF158101.6            168903 bp    DNA     linear       26-OCT-2021
DEFINITION  Genus species strain strain.
ACCESSION   
VERSION
KEYWORDS    .
SOURCE      Genus species
  ORGANISM  Genus species
            Unclassified.
COMMENT     Annotated using prokka 1.14.6 from
            https://github.com/tseemann/prokka.
FEATURES             Location/Qualifiers
     source          1..168903
                     /organism="Genus species"
                     /mol_type="genomic DNA"
                     /strain="strain"
     CDS             complement(12..2189)
                     /gene="rIIA"
                     /locus_tag="DIDAOBOA_00001"
                     /inference="ab initio prediction:Prodigal:002006"
                     /inference="similar to AA sequence:UniProtKB:P03690"
                     /codon_start=1
                     /transl_table=11
                     /product="Protein rIIA"
                     /translation="MIITTEKETILGNGSKSKAFSITASPKVFKILSSDLYTNKIRAV
                     VRELITNMIDAHALNGNPEKFIIQVPGRLDPRFVCRDFGPGMSDFDIQGDDNSPGLYN
                     SYFSSSKAESNDFIGGFGLGSKSPFSYTDTFSITSYHKGEIRGYVAYMDGDGPQIKPT
                     FVKEMGPDDKTGIEIVVPVEEKDFRNFAYEVSYIMRPFKDLAIINGLDREIDYFPDFD
                     DYYGVNPERYWPDRGGLYAIYGGIVYPIDGVIRDRNWLSIRNEVNYIKFPMGSLDIAP
                     SREALSLDDRTRKNIIERVKELSEKAFNEDVKRFKESTSPRHTYRELMKMGYSARDYM
                     ISNSVKFTTKNLSYKKMQSMFEPDSKLCNAGVVYEVNLDPRLKRIKQSHETSAVASSY
                     RLFGINTTKINIVIDNIKNRVNIVRGLARALDDSEFNNTLNIHHNERLLFINPEVESQ
                     IDLLPDIMAMFESDEVNIHYLSEIEALVKSYIPKVVKSKAPRPKAATAFKFEIKDGRW
                     EKRNYLRLTSEADEITGYVAYMHRSDIFSMDGTTSLCHPSMNILIRMANLIGINEFYV
                     IRPLLQKKVKELGQCQCIFEALRDLYVDAFDDVDYDKYVGYSSSAKRYIDKIIKYPEL
                     DFMMKYFSIDEVSEEYTRLANMVSSLQGVYFNGGKDTIGHDIWTVTNLFDVLSNNASK
                     NSDKMVAEFTKKFRIVSDFIGYRNSLSDDEVSQIAKTMKALAA"
...

3️⃣ Visualisation of annotation and manual curation

Here's where phage annotation truly gets tricky. No annotation tool accurately predicts all CDSs for all phages. Without doing any laboratory experiments, having a good look at your data will help find most of the "missing" CDSs.

File transfers

The first step is downloading your annotated phage genome to your computer to have a look.

Explanation on how to use scp can be found in this blog post

Scp Mac/Linux

The easiest way is to open a new tab or window in the terminal and use scp.

# use cd to navigate to the folder where you want to download your file(s).
scp <climb-address>:<path-to-file-on-climb> .
On Windows

We suggest you use the software WinScp to download files, guide documents can be found here.

Cyberduck

🎥 youtube tutorial by Andrea

Visualisation with UGene

I (Evelien) have started to use UniPro UGene as my main visualisation and data exploration software. I am in no way affiliated with this programme, but do enjoy using it.

Open the GenBank file

You may want to make a project specific for this workshop. Then open the .gbk file created by prokka. This will automatically open 3 tracks in the UGene interface: the predicted features at the top, then a zoomed in track of the nucleotide sequence and the six translation frames with any annotations, and a list with features.

Checking the orientation of the genome

If you haven't checked the orientation of the genome in a previous step, you can have a quick look with UGene. If there's a reference genome that has nucleotide similarity, you can use the dotplot feature. If you know what the first CDSs is and on what strand, you can also check those. Sometimes, opening the circularised genome has been an arbitrary decision taken by scientists a couple of decades ago, sometimes it's up to you.

Here, we'll use the example of T4. The 2003 publication by Miller and colleagues describes:

Genes are listed sequentially as they appear in the GenBank file (accession no. AF158101), clockwise on the circular map (by convention) starting with the first base 5' of rIIB.

Looking at TABLE 1 of that publication, we can see that rIIB is encoded on the complementary strand.

Now, we have to find the rIIB gene, make sure that it in the reverse position (from right to left), find the start codon, take one base upstream and reopen the genome at that position. Once the correct coordinate is found, the genome can be reoriented using the seqfu suite of tools.

# first I make the reverse complement, if it is necessary
seqfu rc T4.fasta > T4_rc.fasta

# then I am using the coordinate determined earlier to restart
seqfu rotate -i <integer> T4_rc.fasta > T4_rcr.fasta

If you've had to perform these actions, you will have to rerun prokka to proceed to the next step.

IMPORTANT: This method of reorienting the genome assumes that the assembly is otherwise correct and circular, i.e. no assembler-based repeats at the ends of the original contig and no missing sections. Use read mapping and - if necessary - PCR for verification.

4️⃣ : Improving the functional annotations with Prokka

Because we know we need to rerun prokka after restarting the genome, we might as well have a look at the functional annotations at the same time, i.e. the /gene and /product qualifiers of the GenBank file. We can get an easy overview by looking at the .tsv file generated by prokka. This can be opened in any type of text editor or by using the less command.
Be mindful when opening in Excel which is notorious for changing gene names into dates without the possibility of changing it back.

For phages and without using a phage-specific reference database, prokka will likely output the vast majority of CDSs as "hypothetical protein". This is the default for when no confident predictions can be made.

One method to increase the automated functional assignments, is to use a specific/custom database when running prokka. This database can consist of one single well-annotated reference genome, the whole phage catalog of NCBI or a set of well-curated protein annotations such as PHROGS.

We will now use the .gbk file of T4 itself to increase the functional annotations of our own T4 genome, using the --proteins function. Y

⚠️ you will need to find the T4 .gbk file from NCBI and upload it into your VM yourself!

prokka --kingdom Viruses --locustag T4-QIB --prefix T4-QIB \
--proteins references/T4.gb \
--outdir prokka_T4_db T4i_rcr.fasta

Prokka also cleans up a lot of the product names. You can check the .log file to see which /product names were modified. If you'd like to keep the unmodified names, use the --rawproduct flag.

In general, keep the annotations simple. If you have additional information on a gene, you can use the /note qualifier. Check whether all products comply with what you learned during the lecture.

Checking the CDS annotations

⚠️ Download the latest .gbk file you made to your computer and open it in UGene as shown before.

Take your time to scroll through the gene annotations in UGene. Zoom in to the predicted CDS so you can look for gaps. Verify that there aren't missing CDSs, by looking for unannotated stretches of sequence.
Make a note of locations where regulatory features may be present, e.g in the 5' end of a genome with defined ends, or in the intergenic region between CDSs on opposite strands.

The changes you may make to CDS annotations can include:

  • extending the CDS by choosing a different initiation codon
  • reducing the size of the CDS by choosing a different initiation codon
  • adding (a) small CDS(s) in an unannotated region

5️⃣ Adding additional functional information

We can update the predictions manually by incorporating data from BLAST and from HMM searches. There are a couple of different ways of performing these searches:

  1. Using the BLAST interface of NCBI. Make sure to limit your search to viruses (taxid:10239) or Caudoviricetes (taxid:2731619) for tailed phages, to reduce computational demands. The entire nucleotide or protein database can be used in cases where no homology is found, but might not yield a result.
  2. Using HHPred or HHBlits from the University of Tübingen Bioinformatics Toolkit
  3. The STEP3 virion prediction tool from Monash University

We can also add other information on features to the CDS annotations, such as prediction of transmembrane domains with Phobius, TMHMM or signal peptides with SignalP.

All changes can be made in UGene using the add button or the edit buttons to add new qualifiers, features or change the names.

6️⃣ Regulatory elements

Regulatory elements do not have to be annotated in phage genomes for them to be deposited as complete genomes, but it is good practice to do so.

Terminators

We will use the programme ARNold to predict stem-loop structures that could represent rho-independent terminators. All you need to do is paste the nucleotide sequence of your phage genome in the webserver and click Run.

The output is a simple text file with the stem-loop structure indicated in blue and red. You may want to narrow down the search results to

  • motifs found with both programmes
  • motifs with a free energy lower than -10 kcal/mol
  • motifs containing a poly-T tail
  • motifs in intergenic regions
# Example output for phage T4
   3537  Rnamotif + CTTGAACAATAGCATGCTCATCATATTGGCGTGCaTATTTCTTAAAA -8.90
   4319  Rnamotif - TTTAAATAAAAGGCCTTCGGGCCTTTAGCTTTATG -7.30
   4320  Rnamotif + ATAAAGCTAAAGGCCCGAAGGCCTTTTATTTAAAA -6.40
   4931  Rnamotif + ATTTCTTTTCTTCGCCTTCGCGAAGGCGATATTTCTTACCG -9.40
   7093      Both + GTATTTGACATTGTCTGGAAATCAATTTCAGATGgTTTTTTCATTTG -5.50
  14821      Both - AATAAATTTAAGGACTCCTTCGGGAGTCCTTTTTCATTTAA -13.00
  14822      Both + TAAATGAAAAAGGACTCCCGAAGGAGTCCTTAAATTTATTCA NA
  15775  Rnamotif - TTTTATTACCTTTGATGGCATTAACAAGTGCCATTAATTTTTGTACCAT -8.40
  15831  Rnamotif + AGCTTTTACTATTTCTGCTATGGTAGGAATTTTACTCTGGA -4.80
  22200      Both - TTATCAGCGTGTAGGGGCAGTTTCTGCCCTTGaTATGTTTCCGAC -10.90
  23939      Both - GAATAAATCTAGGGACCTCCGGGTCCCTTTTTCACACAA -12.80
  27598  Rnamotif + AATCAGCAGCAGATTGGACAGCTGTCCAATCTTTTGAATACCA -11.30
  33712  Rnamotif - TAGGCATTTAGCTGGATATCATGGTTCAGTTTTATCTAAAA -4.70

Try it out for yourself. You can copy-paste the stem-loop sequence into the search field of UGene and use the Add annotation feature to include terminators.

As far as we know, there are no tools looking for phage rho-dependent terminators.

Promoters

Promoters are tricky, because there can be multiple different promoter sequence motifs within one phage, e.g. early, middle & late, or phage-specific promoters recognised by the phage RNA polymerase (if it encodes one). If the bacterial promoter sequences are well characterised, you can use these. Otherwise, a pattern recognition search upstream of each CDSs works best to find new promoters.

First, we'll need to extract the upstream sequences. We will use the script created by Andrea.

getUpstreamFromGbk -i prokka_T4/PROKKA_11152021.gbk  -o upstream_T4 -l 100 

Then we can use the output file to search for patterns using MEME. If all goes well, you will be able to find conserved ribosome binding site motifs and at least one putative promoter sequence. As with the terminators, you will now need to verify whether these promoters are logically situated given the location of the CDSs.

7️⃣ Tidying up your GenBank file

Adding source modifiers

A very important item on your to-do list for phage genomics, should be providing the correct metadata for each genome. This can be done by providing the necessary source modifiers in your GenBank file, or at the time of submission. A detailed overview of all the possible information to use and the correct qualifiers, can be found on the INSDC website.

Removing unwanted information from GenBank files

Prokka and other tools often add additional qualifiers to each of the features (genes, CDSs, tRNAs etc.) that are identified. There's a debate on whether that's "polluting the GenBank file" or being thorough in reporting information. We're not getting involved, as long as you don't add the additional information in the /product or /gene descriptions.

Any of the qualifiers can be edited or removed manually in either UGene or a text editor. Here, we'll show a way to systematically delete a set of qualifiers using grep on the command line.

# we want to get rid of all the lines starting with /inference
grep -v "/inference" PROKKA_T4.gbk > clean.gbk

8️⃣ Submission to INSDC

At this point, we're suggesting you check out the guidance by NCBI in the The GenBank Submissions Handbook. We hope to update this with more practical tips and tricks soon.