Utilities for curating a hierarchical set of sequence profiles representing a protein superfamily.
If you use this software in a publication, please cite our paper that describes it:
Talevich, E. & Kannan, N. (2013) Structural and evolutionary adaptation of rhoptry kinases and pseudokinases, a family of coccidian virulence factors. BMC Evolutionary Biology 13:117 doi:10.1186/1471-2148-13-117
Available at: http://www.biomedcentral.com/1471-2148/13/117
Freely distributed under the permissive BSD 2-clause license (see LICENSE).
The installation consists of a Python library, fammerlib
, and two scripts,
fammer.py
and tmalign.py
.
Download the .zip file and unpack it, or clone this Git repository, to get the source code.
To use all the features of Fammer, you'll need the following third-party programs installed:
- Python 2.7
- MAFFT
- HMMer 3.0
- MAPGAPS (optional)
- TMalign (optional) -- for structural alignments
- FastTree (optional) -- for clustering
If you're on a Debian-based Linux system, check your package manager for these first to save yourself some time:
sudo apt-get install mafft hmmer tm-align fasttree python-pip
Then, install the Python library dependencies and Fammer itself as follows.
Install the Python packaging system pip or setuptools. Then run the setup script, and all Python dependencies will be pulled in:
python setup.py build python setup.py install
(You might need root privileges for the last step.)
Install the Python libraries Biopython, biofrills, biocma and networkx. Then run the setup script as above.
Global options:
-h
,--help
- Show a help message and basic usage.
--quiet
- Don't print status messages, only warnings and errors.
Sub-commands:
- build
- Build profiles from a given directory tree.
- update-fasta
- Replace original FASTA sequence sets with the ungapped sequences from the corresponding alignment (.aln) files, sorted by decreasing length.
- scan
- Scan and classify input sequences using a set of profiles.
- add
- Scan a target database with the given HMM profile set. Add hits that meet acceptance thresholds to the profile FASTA files.
- refine
- Leave-one-out validation of HMM profiles.
- cluster
- Split a sequence set into clusters (based on phylogeny).
Construct a profile database from a directory tree of family profile alignments.
Assume we have a directory tree set up under Superfamily/
as above.
Next, run fammer.py build Superfamily
to align all sequence files with
MAFFT, and (recursively up) align the consensus sequences of each subfamily
together:
Superfamily/ Group1/ subfam1_Group1.fasta subfam1_Group1.aln subfam2_Group1.fasta subfam2_Group1.aln subfam3_Group1.fasta subfam2_Group1.aln ... Group1-Unclassified.fasta Group1-Unclassified.aln Group1.aln ... Superfamily-Unclassified.fasta Superfamily-Unclassified.aln Superfamily.aln
The alignments are in un-wrapped Clustal format.
You can manually adjust the alignments and rebuild, if desired, perhaps
iteratively. Only the "parent" family alignments will be rebuilt as needed, e.g.
if subfam1_Group1.aln
is edited, then only Group1.aln
and
Superfamily.aln
will be rebuilt the next time fammer.py build
Superfamily
is called because the consensus sequences that constitute those
alignments may have changed. (It's like Make.)
Finally, use the option --hmmer
to build profiles:
Superfamily/ Group1/ subfam1_Group1.fasta subfam1_Group1.aln subfam2_Group1.hmm ... Group1.aln Group1.hmm ... Superfamily.aln Superfamily.hmm Superfamily_all.hmm # concatenated profiles Superfamily_all.hmm.{h3f,h3i,h3m,h3p} # indexes from hmmpress
The --mapgaps
option works similarly, if you have the necessary programs
installed.
The --clean
option can be included with any of the above commands to remove
intermediate files.
If you have included PDB structures in your directory tree and have a structure
alignment program installed, the --pdb
option will first create a structural
alignment of the PDBs in the directory, then use that alignment as the seed for
higher-up alignments:
Superfamily/ Group1/ subfam1_Group1.fasta subfam1_Group1.aln 1ATP.pdb 1O6K.pdb 3C4X.pdb ... Group1.pdb.seq # Alignment of 1ATP, 1O6K, 3C4X Group1.aln ... Superfamily.aln
In this example, the alignment generated by aligning the structures 1ATP, 1O6K
and 3C4X is passed to MAFFT as a seed for Group1.aln
, along with the
unaligned consensus sequences of each subfamily of Group1 (subfam1, subfam2,
...). The seed sequences are removed from Group1.aln after the alignment of
consensus sequences is completed. This can help correctly align the more
divergent families and groups to each other.
For nested directory trees, the option --tree
generates a Newick file
representing the structure of the directory tree. A tree based on the above
examples would look something like this (ignoring whitespace), created as
Superfamily.nwk
:
((subfam1_Group1, subfam2_Group1, subfam3_Group1, Group1-Unclassified)Group1, (subfam1_Group2, subfam2_Group2, subfam3_Group2, Group2-Unclassified)Group2, (subfam1_Group3, subfam2_Group3, subfam3_Group3, Group3-Unclassified)Group3, Superfamily-Unclassified)Superfamily;
This tree could be passed to RAxML as a constraint tree in an effort to identify deeper subfamilies, for example.
Convert the contents of the .aln
sequence alignment files back to unaligned
FASTA format, overwriting the corresponding .fasta
files.
After initially building a tree of sequence alignments, you might edit the
Clustal alignments, deleting spurious sequences or trimming the alignment to the
edges of a conserved domain. With update-fasta
, you update the contents of
the unaligned sequence files to match the .aln
files.
The next step is usually to either (a) do some sequence processing unrelated to fammer, e.g. clustering, or (b) realign everything. Since you've presumably removed some junk from the input sequences, the resulting alignments may be better.
Scan/search a set of sequences (FASTA) with the HMM profile database and assign a classification to each hit.
This is essentially a set of wrappers to process the output of hmmsearch
,
simplifying the results for common use cases. The three output forms are:
- summary (default):
- Print two formated columns for each profile in the given HMM profile database that matched at least one hit: the name of the profile and the number of hits for which it was the best match.
- table (
--table
):- For each sequence in the target sequence set that matched a profile in the HMM profile database, print the sequence ID/accession and the name of the best-matching profile, separated by a tab character.
- sequence sets (
--seqsets
):- For each profile and matching sequence set (as they'd appear in summary output), write a file containing the matching sequences. The output filenames indicate the name of the source sequence file name and the matching HMM profile names.
Note that --table
and --seqsets
can be combined.
Scan a target database with the given HMM profile set and add hits that meet a series of acceptance thresholds to the profile FASTA files.
Once you've constructed profiles from a collection of carefully selected
sequences representing each subfamily, you can use this command to scan another
sequence set and automatically add strong hits to the corresponding profile
sequence sets. The target database could be the *-Unclassified.fasta
sequence sets, to catch any classifiable members that were not noticed
initially, or a larger sequence database like refseq_proteins, if you're
confident in your coverage of the superfamily and want to improve the
sensitivity of your profiles.
Leave-one-out validation of sequence profiles. Unlike the other commands, this is non-recursive.
Given a target subdirectory and the name of the subdirectory's
*-Unclassified.fasta
file (if not specified, it looks for
dirname-Unclassified.fasta), scan each subfamily's sequence set (.fasta
)
with the corresponding HMM profile (.hmm
), and also scan the
-Unclassified.fasta
file with all the HMMs to obtain scores for each
sequence and each profile. Then, compare the scores of sequences in a subfamily,
starting with the worst-scoring sequence, to the highest-scoring "unclassified"
sequence by the same profile. If, for a given profile, a classified sequence
scores worst than an unclassified one, mark the classified one for removal from
the sequence profile.
Note that if a member of a known subfamily was mistakenly placed in
-Unclassified.fasta
(i.e. was missed by the initial classification), then
many of the legitimate members of the subfamily profile could score worse than
this high-scoring "unclassified" sequence and be erroneously marked for removal
from the profile. This is easy enough to spot in the logged output. One way to
avoid it is to first use the add
command with -Unclassified.fasta
as
the target, to catch and classify such sequences beforehand.
Extract clusters from a sequence set based on phylogenetic relationships.
Uses FastTree to quickly build a tree with branch support values, then extracts well-supported clades from the tree and writes the corresponding sequence sets to FASTA files. Unclustered sequences are written to another "Unique" file. Also writes a phyloXML tree file (.xml) showing clusters as colorized clades.
Serves the basic purpose of a build tool like Make or Rake: compare the time stamps of input and output files at each step of the build process, and only rebuild the outputs that are out of date. Also track intermediate files to clean up after the process successfully completes. See this blog post about it.
Align multiple structures using TMalign for pairwise alignments and a minimum spanning tree constructed from the pairwise TM-scores to assemble the pairwise alignments into a multiple sequence alignment. This module can also be used as a command-line script.
Begin by creating a directory tree with each subfamily's representative
sequences in an unaligned FASTA file. The FASTA file names must end in
.fasta
.
A simple un-nested layout looks like:
Superfamily/ subfam1.fasta subfam2.fasta subfam3.fasta ... Superfamily-Unclassified.fasta
Typically, a protein superfamily will have some members that cannot be cleanly classified into subfamilies.
Recursive nesting is also allowed (in fact, that's the point of this project). It looks like this:
Superfamily/ Group1/ subfam1_Group1.fasta subfam2_Group1.fasta subfam3_Group1.fasta ... Group1-Unclassified.fasta Group2/ subfam1_Group2.fasta subfam2_Group2.fasta subfam3_Group2.fasta ... Group2-Unclassified.fasta Group3/ subfam1_Group3.fasta subfam2_Group3.fasta subfam3_Group3.fasta ... Group3-Unclassified.fasta ... Superfamily-Unclassified.fasta
At least 2 sequences are needed to define each subfamily. The sequences are initially retrieved and organized manually. External databases can help; for example, KinBase provides a classification scheme and representative sequences for the protein kinase superfamily. The directory hierarchy defines the higher levels of organization of the superfamily.
Now build the initial alignments:
fammer.py build --clean Superfamily/
For best results, sequences should all cover the same conserved domain region and align exactly at N- and C-termini. For example, the eukaryotic protein kinase domain begins 7 residues before the glycine-rich loop (GxGxxG motif) and ends 12 residues after a conserved arginine in subdomain XI (RP[TS] motif). The consensus function used in Fammer includes the flanking sequences before the first conserved block or after the last conserved block, regardless of "gappiness", so even one sequence in a set can be used to define domain boundaries for the whole set.
Edit the subfamily-level alignments (.aln) to trim the sequences to the domain boundaries. I use Vim's block-selection mode (Ctrl-v); you might prefer JalView. Do not edit higher-level .aln files; just look at them to see which subfamily-level alignments have extended tails, then edit those .aln files.
When you think you've trimmed all the .aln files to the conserved domain region, update the corresponding unaligned sequence files (.fasta) to match:
fammer.py update-fasta Superfamily/
Rebuild the alignments whose source sequence sets have changed:
fammer.py build --clean Superfamily/
MAFFT may create better alignments for sets where the unalignable regions have now been removed. When this step completes, examine the top-level alignment (e.g. Superfamily.aln) -- are the domain boundaries aligned cleanly across all sequence sets? Repeat the process if any subfamilies need to be trimmed further.
If you realize you've trimmed a profile too far, use your version control system (you are using Git or Mercurial to take snapshots of the tree, right?) to revert the .fasta file to an earlier version, then rebuild and try trimming again.
If you never had the full-length sequences for a subfamily to begin with, try to find one representative full-length sequence from a database like UniProt, add it to the .fasta file, realign, and trim that sequence to the region it should cover. This won't improve the HMM or MAPGAPS profile for that subfamily much, but it will help the higher-level alignments that include the consensus sequence of that subfamily.
While MAFFT typically creates good alignments within a subfamily, for high-level profiles it may struggle to align the consensus sequences of very divergent families or groups to each other. In these cases, seeding with a structural alignment can help line up homologous regions.
Manually identify the high-quality solved crystal structures that correspond to families in your tree, and place those PDB files in the directory tree at the same level as the subfamily they represent.
Open the PDB file (.pdb) in a text editor and:
- Remove the ATOM records corresponding to residues outside the conserved domain.
- If multiple chains are present, choose the best, most complete chain and delete the others. (Otherwise, Fammer will take the first chain by default.)
To determine the domain boundaries, load a "reference" PDB (e.g. 1ATP for kinases) and the other PDB together in PyMOL and align using the command "cealign" or "fit". Visually find which residues correspond to the start and end of the conserved domain. If your PDB structure diverges from the reference structure drastically before or after a certain point (i.e. N- or C-terminal region of the domain is non-homologous -- excluding inserts), it may be best to truncate the PDB to remove the non-homologous portion as it cannot be aligned accurately.
Save the edited .pdb file, and rebuild the higher-level profiles:
fammer.py build --clean Superfamily/
Now that PDB files are present in the directory tree, Fammer will writes structural alignments in FASTA format to the .pdb.seq files.
Once complete, examine the top-level alignment (e.g. Superfamily.aln) for misalignments. To fix these, first edit the corresponding .pdb.seq file. Usually there is just one structure that was misaligned. Rebuild and re-edit as necessary. If a particular PDB is very poorly aligned to the others, it may be best to just remove it altogether -- it may have a different conformation from the others.