-
Notifications
You must be signed in to change notification settings - Fork 2
Day 3: Classification and Taxonomy Practical
When you only have a few genomes, and you are most comfortable with web-based searches, we can recommend NCBI Virus.
You can search for relatives with the Search by sequence
function using the whole genome (nucleotide search) or a single protein. The underlying algorithm is the same BLAST you have probably used before.
💡 Try searching with the genome sequence first. If this yields results with both high % coverage
and % identity
values, then you have close relatives. If not, you can use searches with proteins to find more distant relatives.
My search using the restarted T7 genome yields 499 results. The first thing to do is to sort them according to % coverage
(click on the column header) so that we can see the database sequences that share the highest fraction of their genome with our search query. Now we can see that the top "hits" are all isolates belonging to the same species as T7.
From this list you can select the closest relatives for further inspection or download. We suggest not to download more than 20 sequences for the next step.
- Use the Refine Search button to narrow down the number of search results shown
- Limit to Sequence type > RefSeq if you only want one representative of each species
- Use the select column to limit or expand the information shown
Once you have selected a subset of sequences, you can use the Align
and Build phylogenetic tree
buttons to further investigate the search results. However, this will not include your own phage sequence.
If you have done the investigation of your genome in the previous step, you may have already found information on closest relatives from the prokka information. You can also use the NCBI virus interface to search for a specific taxon.
For example, the search for Tequatrovirus
gives 1497 nucleotide sequences, of which 251 are flagged as complete and 83 RefSeq records.
If your BLAST searches yield little results, you can alternatively submit your genome to the VipTree server which will compare the translated genome with the reference genomes in its database and add it to the tree.
💡 VipTree has a lot of features that you can explore on your own time.
As discussed in the theory session, at the species and genus level, we use intergenomic distance or nucleic acid identities to delineate species and genus ranks. For the purpose of this workshop, we will run VIRIDIC developed by Cristina Moraru.
If you want to assess whether your phage falls within the boundaries of existing species and genera, you will need to provide these when running VIRIDIC. Hopefully, your selection from the previous step will be sufficient.
❗ Don't forget to add your own genome to the file when submitting. This can be done manually by copy-pasting into the fasta file with the relatives or by using the cat
command to join the two files.
cat T7_restart.fasta T7-likes.fasta > T7_viridic.fasta
When running VIRIDIC, create a project first, this will check the validity of the uploaded fasta file. Then click run.
We have pre-run the T7 genome with some selected genomes from NCBI Virus as an example to investigate available from our repository of day 3.
❓ Can you clearly see the species and genus boundaries?
The same way we looked for genomic relatives, we can do a search for particular sets of conserved proteins and download them quite easily.
Using the example of T7, we've seen that it belongs to the genus Teseptimavirus. Now we can do a search for Teseptimavirus and Refine search
the Protein tab for "DNA polymerase". We can now easily download all DNA polymerases.
Use the annotated genome to copy a marker gene sequence. The choice of sequence will depend on your phage. Use the ICTV Taxonomy history to look for taxonomy proposals and the marker genes used, use the ICTV Reports or published manuscripts to determine what is the most appropriate marker gene. When in doubt, use the terminase large subunit, major capsid protein, or polymerase genes.
Use the protein sequence search function, refine your search to a manageable number of protein sequences (<50) with appropriate outliers and download.
Phage Annotation Workshop 2021