-
Notifications
You must be signed in to change notification settings - Fork 33
Home
PhyloPhlAn is a computational pipeline for reconstructing highly accurate and resolved phylogenetic trees based on whole-genome sequence information. The pipeline is scalable to thousands of genomes and uses the most conserved 400 proteins for extracting the phylogenetic signal. PhyloPhlAn also implements taxonomic curation, estimation, and insertion operations.
The main features of PhyloPhlAn are:
- completely automatic, as the user needs only to provide the (unannotated) protein sequences of the input genomes (as multifasta files of peptides - not nucleotides)
- very high topological accuracy and resolution because of the use of up to 400 previously identified most conserved proteins
- the possibility of integrating new genomes in the already reconstructed most comprehensive tree of life (3,171 microbial genomes)
- taxonomy estimation for the newly inserted genomes
- taxonomic curation for the produced phylogenetic trees
We are developing a new version of PhyloPhlAn and https://bitbucket.org/nsegata/phylophlan/wiki/phylophlan2 you can find the new PhyloPhlAn wiki page.
Please note that it is still an alpha release available in the dev branch of the repository.
PhyloPhlAn can be https://bitbucket.org/nsegata/phylophlan/get/default.tar.gz or accessed from our https://bitbucket.org/nsegata/phylophlan.
PhyloPhlAn can also be obtained using http://mercurial.selenic.com/ as follows:
$ hg clone https://bitbucket.org/nsegata/phylophlan
The package can also be downloaded as a compressed file in https://bitbucket.org/nsegata/phylophlan/get/default.zip, and https://bitbucket.org/nsegata/phylophlan/get/default.tar.bz2 formats.
PhyloPhlAn has been developed and tested on Unix-based systems. On Windows or Mac systems, PhyloPhlAn may require some tweaking.
If you find the software or methodology useful, please cite the accompanying manuscript:
http://www.ncbi.nlm.nih.gov/pubmed/23942190
Nicola Segata, Daniela Börnigen, Xochitl C. Morgan, and Curtis Huttenhower.
Nature Communications 4, 2013
You can download PhyloPhlAn's https://bitbucket.org/nsegata/phylophlan/wiki/bs_tree.reroot.nwk (with bootstrapping support) in which the genome labels are encoded with http://img.jgi.doe.gov/cgi-bin/w/main.cgi taxon ID (prefixed with 't'). The same tree with leaf nodes annotated with labels for https://bitbucket.org/nsegata/phylophlan/wiki/bs_tree.reroot.spe_labels.nwk, https://bitbucket.org/nsegata/phylophlan/wiki/bs_tree.reroot.gen_labels.nwk, https://bitbucket.org/nsegata/phylophlan/wiki/bs_tree.reroot.fam_labels.nwk, and https://bitbucket.org/nsegata/phylophlan/wiki/bs_tree.reroot.phy_labels.nwk are available. In addition, we provide the https://bitbucket.org/nsegata/phylophlan/src/ee2e2ed911c8/data/ppaalns/ppa.aln.tar.bz2, and the https://bitbucket.org/nsegata/phylophlan/wiki/ppafull.aln.faa.tar.bz2.
The image below reports the comprehensive, automated, and high-resolution microbial tree of life with taxonomic annotations obtained with PhyloPhlAn. It contains a total of 3,737 microbial genomes
A high-resolution version of this image can be downloaded .
Software updates will be posted on the https://bitbucket.org/nsegata/phylophlan/. You are more than welcome to use the https://bitbucket.org/nsegata/phylophlan/issues on Bitbucket (or email mailto:[email protected]) to provide feedback, report bugs, and suggest/request new features.
If you questions and comments or you would like to be notified about new version, new features, or any other news related to PhyloPhlAn please join our mailing list:
https://groups.google.com/d/forum/phylophlan-users
If you would like to build a phylogenetic tree using any set of private or public genomes all you need to do is creating a folder in the input folder and copy inside one multifasta file (with extension ".faa") for each genome containing the peptidic sequences. If you call this folder "my_genomes" here is the command you need to call:
#!bash $ ./phylophlan.py -u my_genomes
when finished, the resulting tree will appear in the output/my_genomes folder.
You can try out this operation (-u) using an example included in the PhyloPhlAn package you downloaded called example_corynebacteria and stored in the input folder. In contains a protein multifasta file for each of the 30 genomes available for the http://wikipedia.org/wiki/Corynebacterium as February 2012 plus two http://wikipedia.org/wiki/Streptomyces genomes as a meaningful outgroup. As mentioned above, the command for obtaining the phylogenetic tree is:
#!bash $ ./phylophlan.py -u example_corynebacteria --nproc 4
Using 4 threads (specified with --nproc 4) this operation should take no more than 4-5 minutes, but even using one processor only (default) should give you the results in 10 minutes or so.
In the output/example_corynebacteria/ folder you'll find a http://en.wikipedia.org/wiki/Newick_format file of the resulting tree as provided by http://www.microbesonline.org/fasttree/, and a http://en.wikipedia.org/wiki/PhyloXML file containing the same tree rerooted with a procedure which tries to maximize the distance from the root to any leaf. The two files are available for download (, and can be inspected with http://en.wikipedia.org/wiki/List_of_phylogenetic_tree_visualization_software and drawn with https://bitbucket.org/nsegata/graphlan. Figure 3B in the http://www.ncbi.nlm.nih.gov/pubmed/xxxxxx reports and discuss this example.
Also the full three of life reported above has been originally generated in this way. Notice that the concatenated alignment used to generate the tree with FastTree is stored in data/example_corynebacteria/aln.fna and can be used as input for other phylogenetic reconstruction software such as https://github.com/stamatak/standard-RAxML or http://www.megasoftware.net/ among http://en.wikipedia.org/wiki/List_of_phylogenetics_software.
PhyloPhlAn let you insert a genome (or a set of genomes) into the already built microbial tree of life (containing >3,000 genomes, see figure and tree files above). Also in this case you need to create a dedicated folder (e.g. my_genomes_to_insert) in the input folder to store the protein multifasta files of interest. The command is:
#!bash $ ./phylophlan.py -i my_genomes_to_insert --nproc 16
If possible, we would recommend to use as many threads as possible (--nproc) because this operation is quite computationally demanding as it requires the alignments with other 3,000 genomes to be updated and the full tree of life to be rebuilt.
The resulting tree file output/my_genomes_to_insert/my_genomes_to_insert.tree.int.nwk can be inspected with tree visualization software to check where the new genomes are rooted and their relations with already well characterized strains.
As an example of insertion, we included in the input folder contained in the PhyloPhlAn package, three genomes recently sequenced and not yet included into the PhyloPhlAn tree and repository. These are two http://wikipedia.org/wiki/Lactobacillus and one http://wikipedia.org/wiki/Sulfolobus genomes available in IMG (accessions http://img.jgi.doe.gov/cgi-bin/w/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2511231185, http://img.jgi.doe.gov/cgi-bin/w/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2519899592, and http://img.jgi.doe.gov/cgi-bin/w/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2524023197 respectively).
#!bash $ ./phylophlan.py -i example_insertion --nproc 16
The resulting file example_insertion.tree.int.nwk now contains the thousands of genomes in the PhyloPhlAn repository as well as the three "new" genomes.
You can also ask PhyloPhlAn to try to automatically assign a taxonomic labels to the genomes integrated into the tree of life (-i option introduced above). This is possible simply adding the -t flag (for taxonomic analysis) to the same command line:
#!bash $ ./phylophlan.py -i -t my_genomes_to_insert --nproc 16
In addition to the output/my_genomes_to_insert/my_genomes_to_insert.tree.int.nwk file, you will obtain tab-separated text files with the most confident taxonomic predictions for your genomes in the output/my_genomes_to_insert/ folder.
Suppose you don't know the taxonomic labels of the Lactobacillus and Sulfolobus genomes used as examples above, possibly because of insufficient phenotipic characterization or because you obtained them with metagenomic assembly. You can call the PhyloPhlAn taxonomic imputation pipeline as: {{{
- !bash $ ./phylophlan.py -i -t example_insertion --nproc 16 }}} And check the predictions in the file that we report below:
Sulfolobus_acidocaldarius_N8 d__Archaea.p__Crenarchaeota.c__Thermoprotei.o__Sulfolobales.f__Sulfolobaceae.g__Sulfolobus.s__?.t__? Lactobacillus_rhamnosus_K_ATCC_8530 d__Bacteria.p__Firmicutes.c__Bacilli.o__Lactobacillales.f__Lactobacillaceae.g__Lactobacillus.s__rhamnosus.t__? Lactobacillus_rhamnosus_LRHMDP3 d__Bacteria.p__Firmicutes.c__Bacilli.o__Lactobacillales.f__Lactobacillaceae.g__Lactobacillus.s__rhamnosus.t__?
As expected, the all three genomes are assigned to the right genera. The two lactobacilli could also be assigned to the right species (s__rhamnosus) whereas PhyloPhlAn does not find enough support to assign the Sulfolobus genome to the "acidocaldarius" species.
$ ./phylophlan.py -h usage: phylophlan.py [-h] [-i] [-u] [-t] [--tax_test TAX_TEST] [-c] [--cleanall] [--nproc N] [-v] [PROJECT NAME] NAME AND VERSION: PhyloPhlAn version 0.99 (8 May 2013) AUTHORS: Nicola Segata ([email protected]) and Curtis Huttenhower ([email protected]) DESCRIPTION PhyloPhlAn is a computational pipeline for reconstructing highly accurate and resolved phylogenetic trees based on whole-genome sequence information. The pipeline is scalable to thousands of genomes and uses the most conserved 400 proteins for extracting the phylogenetic signal. PhyloPhlAn also implements taxonomic curation, estimation, and insertion operations. positional arguments: PROJECT NAME The basename of the project corresponding to the name of the input data folder inside input/. The input data consist of a collection of multifasta files (extension .faa) containing the proteins in each genome. If the project already exists, the already executed steps are not re-ran. The results will be stored in a folder with the project basename in output/ Multiple project can be generated and they safetely coexists. optional arguments: -h, --help show this help message and exit -i, --integrate Integrate user genomes into the PhyloPhlAn tree -u, --user_tree Build a phylogenetic tree using user genomes only -t, --taxonomic_analysis Check taxonomic inconsistencies and refine/correct taxonomic labels --tax_test TAX_TEST nerrors:type:taxl:tmin:tex:name (alpha version, experimental!) -c, --clean Clean the final and partial data produced for the specified project. (use --cleanall for removing general installation and database files) --cleanall Remove all instalation and database file leaving untouched the initial compressed data that is automatically extracted and formatted at the first pipeline run. Projects are not remove (specify a project and use -c for removing projects). --nproc N The number of CPUs to use for parallelizing the blasting [default 1, i.e. no parallelism] -v, --version Prints the current PhyloPhlAn version and exit
- http://www.drive5.com/muscle/ version v3.8.31 or higher must be present in the system path and called "muscle"
- http://www.drive5.com/usearch/ version v5.2.32 (notice that version 6 is currently NOT supported) must be present in the system path and called "usearch"
- http://www.microbesonline.org/fasttree/ version 2.1 or higher must be present in the system path and called "FastTree"
- http://biopython.org/wiki/Download it is a PyPhlAn dependency, actually, but used inside PhyloPhlAn
The authors of PhyloPhlAn would like to thank Ashlee Earl and the Human Microbiome Project Strains Working Group for insightful suggestions, Morgan Price for his helpful comments on applying FastTree, and Levi Waldron, Joshua Reyes and Timothy Tickle for their suggestions on methodology and tree visualization
Changes in version 0.99 (8 May 2013)
Updates: - Pyphlan dependency removal - command line arguments simplified
Changes in version 0.98 (28 July 2012)
Bug fixes: - missing data file added
Changes in version 0.97 (24 July 2012)
First public release