Skip to content

Extra Multiple Sequence Alignment

Andrea Telatin edited this page Nov 19, 2021 · 11 revisions

Try to perform a small MSA and Tree

Multiple sequence alignment (more here) leads to the identification of the homologous regions in a set of sequences, and to the "edit paths" from one sequence to another.

The tools

There are different options for MSA, and the ideal choice might depend on the nature of the dataset (sequence size, dataset size, nucleotide or protein...) and a good summary of the available options is in this review here

During the workshop we will use:

The file formats

MSA

A common output format is "FASTA with gaps":

>seq1
--CAGTCGATCGGTAGCAGCTGACGTAGCAG--GAAGCT
>seq2
GGCAGTCGATC-GTAGCAGCTGACGTAGCAG--GAAGCT
>seq3
--CAGTCGATCGGTAGCAGCTGACGTAGCAG--CTAGC-

Another popular format is Phylip. The format was originally defined and used in Joe Felsenstein’s PHYLIP package. The first line specifies we have 5 sequences of 42 residues.

      5    42
Turkey    AAGCTNGGGC ATTTCAGGGT GAGCCCGGGC AATACAGGGT AT
Salmo gairAAGCCTTGGC AGTGCAGGGT GAGCCGTGGC CGGGCACGGT AT
H. SapiensACCGGTTGGC CGTTCAGGGT ACAGGTTGGC CGTTCAGGGT AA
Chimp     AAACCCTTGC CGTTACGCTT AAACCGAGGC CGGGACACTC AT
Gorilla   AAACCCTTGC CGGTACGCTT AAACCATTGC CGGTACGCTT AA

Trees

The newick format is the last example of a simple text file format of the workshop, and describes the topology and the attributes of a dendrogram.

It is based on parentheses to group closer nodes like: (A,B,(C,D)E)F;

 /---------------+ A
 |                  
=+-F-------------+ B
 |                  
 |       /-------+ C
 \-------+ E        
         \-------+ D
                    

MAFFT

mkdir msa_test

mafft ~/phage-annotation-workshop/day_3/msa/polymerases.faa > msa_test/polymerases.msa

💡 try visualising the output file with less -S to see the gaps!

The tree

IQtree can be used as described in the practical of Day 3:

iqtree -m MFP -B 1000 -alrt 1000 -s msa_test/polymerases.msa --prefix msa_test/iq

Newick format from the command line

We can print trees even from the terminal (well, not ideal maybe)

nw_display poly.msa.treefile

 /-------------------------------------------------------------------------------------------------------+ DENOBDJG 00531
 |                                                                                                                       
 / DENOBDJG 02989                                                                                                        
 + 0/47                                                                                                                  
=\ DENOBDJG 01754                                                                                                        
 |                                                                                                                       
 |         /+ DENOBDJG 00191                                                                                             
 \---------+ 94/98                                                                                                       
           \ DENOBDJG 00144                                                                                              
                                                                                                                         
 |----------------------|-----------------------|----------------------|----------------------|-----------               
 0                   0.02                    0.04                   0.06                   0.08                          
 substitutions/site                                                                                                      
                                                                                                                        

Newick utils can be used also to convert a Newick file to SVG:

nw_display -s poly.msa.treefile > poly.tree.svg

# In most Unix system is possible to have InkScape working from the command line and...
inkscape -f poly.tree.svg -A poly.pdf