Skip to content

Day 3: Classification and Taxonomy Practical

evelien edited this page Nov 25, 2021 · 7 revisions

Phage classification and taxonomy

🎥 Live screencast by Evelien

1️⃣ Finding the closest relatives for your phage

Using NCBI Virus sequence searches

When you only have a few genomes, and you are most comfortable with web-based searches, we can recommend NCBI Virus.

You can search for relatives with the Search by sequence function using the whole genome (nucleotide search) or a single protein. The underlying algorithm is the same BLAST you have probably used before.

💡 Try searching with the genome sequence first. If this yields results with both high % coverage and % identity values, then you have close relatives. If not, you can use searches with proteins to find more distant relatives.

My search using the restarted T7 genome yields 499 results. The first thing to do is to sort them according to % coverage (click on the column header) so that we can see the database sequences that share the highest fraction of their genome with our search query. Now we can see that the top "hits" are all isolates belonging to the same species as T7. From this list you can select the closest relatives for further inspection or download. We suggest not to download more than 20 sequences for the next step.

Tips

  • Use the Refine Search button to narrow down the number of search results shown
  • Limit to Sequence type > RefSeq if you only want one representative of each species
  • Use the select column to limit or expand the information shown

Further investigations within NCBI Virus

Once you have selected a subset of sequences, you can use the Align and Build phylogenetic tree buttons to further investigate the search results. However, this will not include your own phage sequence.

⚠️ If you select too many genomes, these features will not work.

Using NCBI Virus taxonomy information

If you have done the investigation of your genome in the previous step, you may have already found information on closest relatives from the Prokka information. You can also use the NCBI virus interface to search for a specific taxon.

For example, the search for Tequatrovirus gives 1497 nucleotide sequences, of which 251 are flagged as complete and 83 RefSeq records.

Using VipTree to find closest relatives

If your BLAST searches yield little results, you can alternatively submit your genome to the VipTree server which will compare the translated genome with the reference genomes in its database and add it to the tree.

💡 VipTree has a lot of features that you can explore on your own time.

2️⃣ Assessment of the intergenomic distances

As discussed in the theory session, at the species and genus level, we use intergenomic distance or nucleic acid identities to delineate species and genus ranks. For the purpose of this workshop, we will run VIRIDIC developed by Cristina Moraru.

If you want to assess whether your phage falls within the boundaries of existing species and genera, you will need to provide these when running VIRIDIC. Hopefully, your selection from the previous step will be sufficient.

❗ Don't forget to add your own genome to the file when submitting. This can be done manually by copy-pasting into the fasta file with the relatives or by using the cat command to join the two files.

cat T7_restart.fasta T7-likes.fasta > T7_viridic.fasta

When running VIRIDIC, create a project first, this will check the validity of the uploaded FASTA file. Then click run.

VIRIDIC output

We have pre-run the T7 genome with some selected genomes from NCBI Virus as an example to investigate available from our repository of day 3.

❓ Can you clearly see the species and genus boundaries?

3️⃣ A new environment for the taxonomer

You are now familiar with conda environments and how to create them. For this last day of the workshop, we prepared a new environment file with the tools for multiple sequence alignment and tree generation.

Today we'll use this repository as the source of the environment file.

  1. Check that you cloned the repository on day1

If in your home you have a directory called phage-annotation-workshop you did so, and you can check this by:

cd ~/phage-annotation-workshop

If you don't find it:

# Go to your home directory first
cd
# Clone the repository
git clone https://github.com/quadram-institute-bioscience/phage-annotation-workshop.git
  1. Update the repository

When you clone a repository you can retrieve the latest updates by entering the directory and invoking git pull:

# Enter the repository directory
cd ~/phage-annotation-workshop

# Ask git to download changed files
git pull
  1. Create the environment

Similarly to what we did on Day 1, we can use the newly added environment file to create the new environment:

# You will have now a taxon.yaml file to use as a recipe for a "taxon" environment
mamba env create -n taxon --file ~/phage-annotation-workshop/scripts/taxon.yaml

# Remember to activate it!
conda activate taxon

4️⃣ Building a marker gene tree

Collecting conserved genes with NCBI Virus

The same way we looked for genomic relatives, we can do a search for particular sets of conserved proteins and download them quite easily.

Search by name

Using the example of T7, we've seen that it belongs to the genus Teseptimavirus. Now we can do a search for Teseptimavirus and Refine search the Protein tab for "DNA polymerase". We can now easily download all DNA polymerases.

Search by sequence

Use the annotated genome to copy a marker gene sequence. The choice of sequence will depend on your phage. Use the ICTV Taxonomy history to look for taxonomy proposals and the marker genes used, use the ICTV Reports or published manuscripts to determine what is the most appropriate marker gene. When in doubt, use the terminase large subunit, major capsid protein, or polymerase genes.

Use the protein sequence search function, refine your search to a manageable number of protein sequences (<50) with appropriate outliers and download.

Multiple sequence alignment

The first step in building a phylogenetic tree is doing a multiple sequence alignment. There are many great tools out there that produce equivalent results. Today we will use MAFFT version 7 on the VM.

First, log in to your VM as during the previous days. We have created a convenient conda environment with the tools you will use

mamba env create -n taxon --file /data/tools/taxon.yaml
conda activate taxon

We can now run MAFFT

# to access help
mafft

# unsure about which settings to use
mafft  --auto {input_file} > {output_file}

💡 I (Evelien) always like to name my output files so that I can remember what I did. So if I start with T4_terL.fasta then I will name the output file something like mafft_T4_terL.

Checking the MSA

There are a couple of ways to visualise MSAs. We can have a quick look on the command-line with less. We can also open the file in UGene to have a closer look and see whether there are any issues with the selection of proteins we have made. In UGene, we can delete sequences if necessary, or use this information to update the original file and redo the alignments.

Trimming the alignment

We will use the tool TrimAl to perform a quick automatic trimming of the MSA, using the setting gappyout. There are many other possible settings and we refer you to the manual if you want to know more.

# to access the help function type trimal -h
trimal -gappyout -in mafft_terL_T4 -out mafft_gappyout_terL_T4

Checking the trimmed MSA

❓ Check the MSA again in UGene. What has changed? Is this MSA of good quality to move forward?

Tree building

We will use the [Iq-Tree[(http://www.iqtree.org/doc/Tutorial) suite to build the tree. This is a set of tools, each with their own reference
paper describing the methods. Depending on which parameters you use, you will need to cite multiple reference when you publish a tree built with Iq-Tree.

iqtree -s mafft_gappyout_terL_T4 \
-m MFP \
-B 1000 \
-alrt 1000 \
-T auto

Let's break down this command:

  • -s: input MSA
  • -m MFP: this step asks iqtree to perform a step to find the best substitution model for your dataset and subsequently use that model to build the phylogenetic tree
  • -B 1000: run 1000 ultrafast bootstraps to assess branch support
  • -alrt 1000: perform the approximate likelihood ratio test to assess branch support
  • -T auto: let the system decide how many threads to use

If all went well, you should have been able to follow the progress of the software and at the end have a number of new files, including the tree file. We suggest to check the .iqtree file first, because this shows you the entire process of the software to come to the final trees. You can now open the tree (.contree) file in your favourite phylogenetic tree editor.

5️⃣ One-click trees with Phylogeny.fr

If you don't have many sequences, and you want to have a quick look on a webserver on what the phylogenetic tree of a set of sequences looks like, we can recommend Phylogeny.fr One Click Mode. In the One Click Mode, all decisions about multiple sequence alignment, curation of the alignment and tree building algorithm have been mode for you. You supply it with a .fasta formatted file of the sequences (nucleotide or amino acid) that you want aligned and it will spit out a tree.

  • Alignment: MUSCLE
  • curation: Gblocks (can be switched off)
  • tree building/phylogeny: PhyML
  • tree rendering: TreeDyn