-
Notifications
You must be signed in to change notification settings - Fork 33
Home
PhyloPhlAn is an integrated pipeline for large-scale phylogenetic profiling of genomes and metagenomes.
Most likely the easiest way to understand how you can use PhyloPhlAn in your analysis is to check out the examples in the PhyloPhlAn tutorial.
There are two installation methods available, we recommend you use the Conda-based ones to guarantee that all PhyloPhlAn dependencies will be automatically satisfied.
This requires a working Conda installation.
conda install -c bioconda phylophlan
Note 1: we recommend you install PhyloPhlAn in a new, dedicated environment so that all dependencies will be properly resolved by conda. This can be easily done with:
conda create -n "phylophlan" -c bioconda phylophlan=3.1.1
Note 2: for generating the four default configuration files, after the installation please execute:
phylophlan_write_default_configs.sh [output_folder]
Step 1: Get the PhyloPhlAn from the GitHub repository
This requires git.
git clone https://github.com/biobakery/phylophlan
cd phylophlan
python setup.py install
Step 2: Install the Dependencies and Tools necessary to run PhyloPhlAn
To verify that PhyloPhlAn is properly installed, you can execute the following command:
phylophlan --version
that should output:
PhyloPhlAn version 3.0 (1 April 2020)
Note: The above version number and date might be different according to the version you have installed.
If you used PhyloPhlAn please cite the following paper:
Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using PhyloPhlAn 3.0
Francesco Asnicar, Andrew Maltez Thomas, Francesco Beghini, Claudia Mengoni, Serena Manara, Paolo Manghi, Qiyun Zhu, Mattia Bolzan, Fabio Cumbo, Uyen May, Jon G. Sanders, Moreno Zolfo, Evguenia Kopylova, Edoardo Pasolli, Rob Knight, Siavash Mirarab, Curtis Huttenhower, and Nicola Segata
Nat Commun 11, 2500 (2020)
DOI: https://doi.org/10.1038/s41467-020-16366-7
phylophlan -i <input_folder> \
-d <database> \
--diversity <low-medium-high> \
-f <configuration_file>
where:
-
<input_folder>
is the folder containing your input genomes and/or proteomes, a detailed description is available here -
<database>
is the name of the database of markers to use, a detailed description is available here -
--diversity
takes value in {low
,medium
,high
} and it's used to automatically set the analysis to the type of phylogeny to build, a detailed description is available here -
<configuration_file>
is the path to the configuration file necessary to properly run PhyloPhlAn 3.0, a detailed description is available here
PhyloPhlAn 3.0 takes FASTA files (also compressed in Gzip, .gz
and/or Bzip2, .bz2
) as input.
Inputs can be both genomes and proteomes, also mixed, and by default genomes and proteomes are distinguished by the .fna
and .faa
extension, respectively.
If needed, genomes and proteomes file extensions can be specified using the --genome_extension
and --proteome_extension
params, respectively.
When using PhyloPhlAn 3.0, the user can customize each step of the pipeline to build the tree (marker genes identification, multiple sequence alignment, concatenation or gene trees inference, and phylogeny reconstruction) by specifying the desired tools in the configuration file. These steps should be tuned according to the type of markers present in the database and the input used in the analysis:
- when both markers and inputs are nucleotides, the phylogenetic analysis will be done on nucleotides and the configuration file should specify the tools and params to work with nucleotides
- when markers are proteins and inputs a mix of genomes and proteomes, it will proceed in translated sequence space, so amino acids. If the inputs are all genomes, the user can decide to specify the
--force_nucletides
parameter to perform the phylogenetic analysis on nucleotides. The configuration file should be created using the--force_nucleotides
parameter with thephylophlan_write_config_file
script.
The --diversity
parameter allows for three pre-defined options to set several parameters at once (e.g., trimming, subsampling, fragmentary removal, etc.) in accordance with the expected diversity of the phylogeny to be built.
The user can choose among three values:
Diversity | Description |
---|---|
low |
for species- and strain-level phylogenies |
medium |
for genus- and family-level phylogenies |
high |
for tree-of-life and higher-ranked taxonomic levels phylogenies |
If not specified, PhyloPhlAn 3.0 will automatically run with the --accurate
option, which will consider more phylogenetic positions and should result in a more accurate phylogenetic reconstruction.
The --fast
option can be specified to have a faster phylogenetic pipeline.
Both options will affect several other parameters that depend on the --diversity
parameter. A detailed description is available here.
All files produced by PhyloPhlAn 3.0 are available in the <input_folder>_<database>
folder (or in the folder specified with --output_folder
).
Inside there is a temporary folder (<input_folder>_<database>/tmp
) that contains all the intermediate and temporary files produced during the analysis.
Depending on the configuration file and hence on the type pf phylogenetic analysis performed, the resulting output files may have different names.
For instance, using the supermatrix_aa.cfg
configuration file that can be automatically generated using the phylophlan_write_default_configs.sh
script, the output files will be:
Filename | Description |
---|---|
RAxML_bestTree.input_folder_refined.tre | is the final (refined) phylogeny produced by RAxML starting from the FastTree phylogeny |
input_folder.tre | is the phylogeny built by FastTree |
input_folder.aln | is the multiple sequence alignment used as input for the phylogenies, in FASTA format |
The user can specify the number of CPUs to use with the --nproc
parameter:
phylophlan -i <input_folder> \
-d <database> \
--diversity <low-med-high> \
-f <configuration_file> \
--nproc <N>
Please note that regardless of the number of CPUs specified with --nproc
, PhyloPhlAn 3.0 will run:
- RAxML with no more than 20 CPUs in the case
--nproc
is greater than 20 as in our experience using more than 20 CPUs with RAxML does not shorten the computational time required for the phylogeny reconstruction. - FastTree with 3 CPUs (as suggested in the FastTree FAQs) but this is not regulated by the
--nproc
param because FastTree uses theOMP_NUM_THREADS
variable, which is defined in the configuration file.
Note: if you specify with --nproc
a higher number of CPUs compared to the ones available in your machine, you will experience a significant drop in performances, as also reported in the RAxML manual.
PhyloPhlAn 3.0 is able to automatically download two databases of universal markers for prokaryotes:
-
PhyloPhlAn (
-d phylophlan
, 400 universal marker genes) presented in Segata, N et al. NatComm 4:2304 (2013) -
AMPHORA2 (
-d amphora2
, 136 universal marker genes) presented in Wu M, Scott AJ Bioinformatics 28.7 (2012)
Moreover, in addition to the two databases provided, as explained in the following database setup section, it is possible to retrieve a set of core proteins of a specific species, or even build custom databases starting from either a folder containing marker files or a multi-fasta file containing the marker sequences (e.g., multi-fasta file with the core genes sequences from Roary).
If you wish to download the databases and make them available offline, you can follow one of the following options:
Option 1. The easiest thing to do is to run phylophlan from a machine with an internet connection specifying the database you want to use and the location where to store it using the --databases_folder
param.
phylophlan [mandatory_params] -d phylophlan --databases_folder /my/databases/folder --verbose
phylophlan [mandatory_params] -d amphora2 --databases_folder /my/databases/folder --verbose
Note: You can kill the runs above as soon as the database is downloaded and set up.
Option 2. Download the phylophlan_databases.txt
file and then download the files listed inside it and put them inside /my/databases/folder
:
Note 1: The following commands assume that the current working directory is /my/databases/folder
.
You can verify the md5 checksums of the .tar
archives and compare them with those in the .md5
file just downloaded:
diff <(md5sum amphora2.tar) amphora2.md5
diff <(md5sum phylophlan.tar) phylophlan.md5
Then you need to untar the tar files and decompress their contents:
tar -xf amphora2.tar
bzcat amphora2/*.bz2 > amphora2/amphora2.faa
tar -xf phylophlan.tar
bunzip2 -k phylophlan/phylophlan.bz2
Finally you can index the databases, but in doing so you should make sure you use the very same version you'll specify in the PhyloPhlAn configuration file when running your phylogenetic analysis. For instance, if you are going to use diamond
you can index the databases with the following commands:
diamond makedb --in amphora2/amphora2.faa --db amphora2/amphora2
diamond makedb --in phylophlan/phylophlan.faa --db phylophlan/phylophlan
Note: Thanks to Eric Deveaud for the suggestions in putting this section together.
In this section, we provide as many details as possible for the parameters and configurations available in PhyloPhlAn 3.0.
When building a phylogeny, PhyloPhlAn 3.0 makes sure that input genomes/proteomes and markers respect a certain threshold of quality. It is possible to customize these thresholds through the two following parameters:
-
--min_num_proteins <n>
: used to discard proteomes (.faa) with less than the specified number of proteins. Default is 1. -
--min_len_protein <n>
: this parameter is associated with the previous one and it is used to specify the minimum length of a protein in the proteomes. Proteins shorter than this value will not be considered. Default is 50.
The above two parameters have no effect when inputs are only genomes, see this section for more information.
-
--min_num_markers <n>
: input genomes or proteomes that map to less than the specified number of markers will be discarded. Default is 1, unless the database specified with-d
isphylophlan
oramphora
, in these cases default is respectively 100 and 34. -
--min_num_entries <n>
: database markers that are found in less than the specified number of inputs will be discarded. Default is 4. -
--remove_fragmentary_entries
: if specified, the multiple sequence alignment (MSA) will be checked and cleaned from fragmentary entries. See--fragmentary_threshold
for the threshold values above which an entry will be considered fragmentary. Default is False. -
--fragmentary_threshold <n>
: used to specify the fraction of gaps for each input in the MSA to be considered fragmentary and hence removed. Default is 0.85. -
--remove_only_gaps_entries
: if specified, entries in the MSAs composed only of gaps will be removed. This is equivalent to specifying--remove_fragmentary_entries
and--fragmentary_threshold 1
. Default is False.
The following table shows the parameters affected by the combination of the --diversity
and --accurate/--fast
parameters.
--accurate |
--fast |
|
---|---|---|
--diversity low |
--submat pfasum60 --trim not_variant --remove_fragmentary_entries --not_variant_threshold 0.99
|
--submat pfasum60 --trim greedy --remove_fragmentary_entries --fragmentary_threshold 0.85 --subsample fivehundred --scoring_function trident --gap_perc_threshold 0.67
|
--diversity medium |
--submat pfasum60 --trim gap_trim --remove_fragmentary_entries --fragmentary_threshold 0.85 --subsample onehundred --scoring_function trident
|
--submat pfasum60 --trim greedy --remove_fragmentary_entries --fragmentary_threshold 0.75 --subsample fifty --scoring_function trident --not_variant_threshold 0.97 --gap_perc_threshold 0.75
|
--diversity high |
--submat pfasum60 --trim greedy --remove_fragmentary_entries --fragmentary_threshold 0.75 --subsample twentyfive --scoring_function trident --not_variant_threshold 0.95 --gap_perc_threshold 0.85
|
--submat pfasum60 --trim greedy --remove_fragmentary_entries --fragmentary_threshold 0.67 ( --subsample phylophlan or --subsample tenpercent ) --scoring_function trident --not_variant_threshold 0.9 --gap_perc_threshold 0.85
|
Note: if you manually specify in the command line one or more of the above parameters, that will override the automatic value for the specific combination of --diversity
and --accurate/--fast
for that parameter(s).
You can specify the trimming strategy to use with the --trim
parameter. The user can choose between four different options:
--trim |
Description |
---|---|
gap_trim |
will perform what specified in the trim section of the configuration file, which by default is trimAl with the --gappyout parameter, as presented in Capella-Gutiérrez S, et al. Bioinformatics 25.15 (2009) and in the trimAl website
|
gap_perc |
remove columns with a percentage of gaps above a certain threshold, regulated by the --gap_perc_threshold parameter, whose default value is 0.67 |
not_variant |
removes columns from a multiple-sequence aligned file that has at least one amino acid appearing above a certain threshold set by the --not_variant_threshold parameter, whose default value is 0.99 |
greedy |
performs all the above trimming options |
The default is None
, the trimming step will not be performed.
Site subsampling strategy allows retaining only a certain amount of phylogenetically relevant positions (selected based on the scoring function).
In PhyloPhlAn 3.0, you can specify the subsample strategy using the --subsample
parameter.
There are several options available that will set a different amount of retained positions:
--subsample |
Description |
---|---|
phylophlan |
uses the formula presented in Segata, N et al. NatComm 4:2304 (2013) to determine how many positions to retain for each of the 400 PhyloPhlAn markers |
onethousand |
retains up-to 1,000 positions for each marker |
sevenhundred |
retains up-to 700 positions for each marker |
fivehundred |
retains up-to 500 positions for each marker |
threehundred |
retains up-to 300 positions for each marker |
onehundred |
retains up-to 100 positions for each marker |
fifty |
retains up-to 50 positions for each marker |
twentyfive |
retains up-to 25 positions for each marker |
tenpercent |
retains 10% of the positions for each marker |
twentyfivepercent |
retains 25% of the positions for each marker |
fiftypercent |
retains 50% of the positions for each marker |
Note: the --subsample phylophlan
option works only when using the PhyloPhlAn database, specified via -d phylophlan
The default is None
. In this case, the subsampling will not be performed and the full-length alignment will be used.
In PhyloPhlAn 3.0, a scoring function is used to assign a phylogenetic score to each column in the MSAs, that will be then used to rank the MSA positions to retain a subset of them (see Subsampling).
The --scoring_function
parameter allows three different scoring functions:
--scoring_function |
Description |
---|---|
muscle |
implements the same scoring function defined in Edgar, RC NAR 32.5 (2004), when specifying the -scorefile param |
trident |
implements the trident scoring function as presented in Valdar, WSJ. Proteins 48.2 (2002), which is a weighted combination of symbol diversity, stereochemical diversity, and gap cost |
random |
assigns random scores to each position in the MSAs (for testing purposes only) |
Some of the functions for scoring the MSA columns need a substitution matrix to evaluate the expected substitution rates of amino acids.
Substitution matrices can be specified using the --submat
param that could assume one of the following values:
--submat |
Description |
---|---|
vtml200 |
substitution matrix proposed by Yamada K, Tomii K Bioinformatics 30.3 (2014) |
vtml240 |
substitution matrix used in Edgar RC NAR 32.5 (2004) |
miqs |
substitution matrix proposed by Tomii K and Kazunori Y Humana Press, New York, NY, 1415 (2016) |
pfasum60 |
substitution matrix proposed by Keul F et al. BMC Bioinformatics 18.1 (2017) |
The substitution matrices presented above are distributed within PhyloPhlAn 3.0.
However, the set of substitution matrices could be extended with user-defined ones.
The user can generate its own substitution matrices using the scripts (generate_matrices.sh
and serialize_matrix.py
) provided into the phylophlan_substitution_matrices
folder.
If you are running a gene tree pipeline you have to specify also the --maas
parameter providing a mapping file that specifies the substitution model to use for each specific marker. Within PhyloPhlAn you can find the phylophlan.tsv
file (present inside the phylophlan_substitution_models
folder) that lists the substitution models for each of the 400 universal markers of the PhyloPhlAn database.
The format of the file is very simple, it should be a two-columns file separated by TAB, where in the first column you specify the name of the marker and in the second the name of the substitution model to use.
For example, the first 5 lines of the phylophlan.tsv
file:
p0000 PROTCATLG
p0001 PROTCATLG
p0002 PROTCATLG
p0003 PROTCATLG
p0004 PROTCATLG
p0005 PROTCATCPREVF
PhyloPhlAn 3.0 implements the --mutation_rates
option that computes the amount of nucleotide or amino acid changes in each aligned marker.
In the output folder <input_folder>_<database>/mutation_rates/
, you can find a mutation rate table for all the markers whereas the <input_folder>_<database>/mutation_rates.tsv
output file contains the summarized mutation rates table for the complete multiple sequence alignment.
The upper-triangular of the mutation rates table contains the decimal value of the mutation rate (e.g., 0.01), while the lower-triangular contains the fraction (e.g., 1/100), which can be used to evaluate if the value is computed over a meaningful number of positions w.r.t. the length of the MSA.
Using the --sort
parameter it is possible to sort the markers and hence force PhyloPhlAn 3.0 to consider them in a specific order when concatenating the sequences.
When using the PhyloPhlAn database (-d phylophlan
), --sort
will be automatically set to True
.
Note: the sort preference is used only for the super-matrix approach (concatenation).
To build a custom database, we provide the phylophlan_setup_database
script to be run with the following syntax:
phylophlan_setup_database -i <input_file_or_folder> \
-d <database_name> \
-e <input_extension> \
-t <database_type>
where:
-
<input_file_or_folder>
: is the folder containing markers' files or a multi-fasta file containing the markers -
<database_name>
: is the database name chosen by the user (the name to provide tophylophlan
when running it) -
<input_extension>
: is the extension of the input file(s) -
<database_type>
: has to ben
if the user is using a nucleotide database ora
if the user is using an amino acids database
The database will be created in the same folder of the input file(s), or you can specify an output folder with the -o
option.
The phylophlan_setup_database
script can also be used to automatically retrieve a set of core proteins of a specific species using the -g
option (instead of the -i
param). In this case, you need to specify the species name like -g s__<species_name>
. This is also going to be the default name of the database if not differently specified with -d
.
In this case, a set of UniRef90 species-specific proteins for the species_name
provided will be downloaded. As UniRef90 IDs might change in time, you might see failed downloads in the output of the program for some of the proteins. The phylophlan_setup_database
script will save them and re-try the download by using the UniRed APIs to resolve the old IDs into the new ones. If also the second attempt fails to download some of the UniRef90 proteins, those will be reported in the <species_name>_core_proteins_not_mapped.txt
file, saved inside the database folder.
PhyloPhlAn 3.0 relies on the configuration file for handling the external software and their parameters.
A configuration file can be specified in phylophlan
with -f <config_file>
.
A configuration file is composed of different sections (some are mandatory and needed to ensure to execute the minimum steps in the pipeline to complete a phylogenetic analysis, and some are optional). Each section refers to a specific step in the phylogenetic pipeline and contains all the details for the external software to be correctly executed.
In PhyloPhlAn 3.0 you can find the phylophlan_write_default_configs.sh
script that will generate four ready-to-use configuration files:
supermatrix_aa.cfg
supermatrix_nt.cfg
supertree_aa.cfg
supertree_nt.cfg
More information about the supermatrix and supertree approaches are available in the following section.
If you want to generate your own configuration file, you can use the phylophlan_write_config_file
script.
Below is an example of the command used to create a customized configuration file where diamond
is used instead of blastn
and muscle
instead of mafft
, with respect to the supermatrix_nt.cfg
configuration file generated by the phylophlan_write_default_configs.sh
script:
python phylophlan_write_config_file \
-o custom_config_nt.cfg \
-d n \
--db_dna makeblastdb \
--map_dna diamond \
--msa muscle \
--trim trimal \
--tree1 fasttree \
--tree2 raxml
where:
-
-o
: is the output filename -
-d
: indicates the type of database this configuration file is tailored for (a detailed description is available here ) -
--db_dna
,--map_dna
,--msa
,--trim
,--tree1
,--tree2
: indicate the sections the configuration file will contain
Note 1: Please note that if you are going to use MAFFT and in your system either /local-storage
or /tmp
is available, it will be used for the temporary files by exporting the TMPDIR
variable. If you want to change the temporary folder for MAFFT please add to (or edit) your config file under the [msa]
section where MAFFT is specified:
environment = TMPDIR=/path/to/temp/folder
Note 2: Please note that if you specified fasttree in your configuration file the number of CPUs will be set to 3 as suggested in the FastTree FAQs. If you want to change the number of CPUs for fasttree you can add (or edit) your config file under the [tree1]
section where fasttree is specified:
environment = OMP_NUM_THREADS=3
Note 3: Please, if you are going to use DIAMOND in your analysis, be aware that there are known issues.
Note 4: Please, if you are going to use MAFFT in your analysis, be aware that there are known issues.
The following sections are strictly required in any configuration file:
Mandatory section | Description |
---|---|
--db_dna and/or --db_aa
|
specify the command to use for creating and indexed database; choices for db_dna : makeblastdb ; choices for db_aa : usearch , diamond
|
--map_dna and/or --map_aa
|
specify the software for mapping the database against genomes and proteomes, respectively; choices for map_dna : blastn , tblastn , diamond ; choices for map_aa : usearch , diamond
|
--msa |
specify the software for performing the multiple-sequence alignment; choices are: muscle , mafft , opal , upp
|
--tree1 |
specify the software for inferring the phylogeny; choices are: fasttree , raxml , iqtree , astral , astrid
|
Optional section | Description |
---|---|
--trim |
specify the software trimal for performing the trimming of gappy regions |
--gene_tree1 |
specify the software to use for building the single-gene trees. Choices are fasttree , raxml , iqtree
|
--gene_tree2 |
specify the software ramxl for refining the phylogenies built at the gene_tree1 step |
--tree2 |
specify the software raxml for refining the phylogeny built at the tree1 step |
PhyloPhlAn 3.0 allows users to integrate new tools that are not available in the framework, as well as their parameters, for each of the different steps. This is done by manually editing the configuration file or creating a new configuration file with the desired tools/parameters.
Important: The only requirement for this integration is that the input and output files of the tool to be integrated are in the same format used by PhyloPhlAn.
Here is an example section of a default supermatrix configuration file that uses MAFFT for MSA:
[msa]
program_name = mafft
params = --quiet --anysymbol --thread 1 --auto
version = --version
command_line = #program_name# #params# #input# > #output#
And here is the same section manually modified to use Clustal Omega that is not a default option in PhyloPhlAn 3.0 for the MSA:
[msa]
program_name = clustalo
input = -i
output = -o
params = --threads 1 --auto
version = --version
command_line = #program_name# #params# #input# #output#
A configuration file can be composed of several different sections, but there is a minimum set of sections that has to be present to complete a phylogenetic analysis.
The mandatory sections are:
- either
map_dna
and/ormap_aa
msa
tree1
The complete list of available sections is:
map_dna
map_aa
msa
trim
gene_tree1
gene_tree2
tree1
tree2
Each of the above sections can have several different options specified.
These are required in order to compose a command line that can run an external tool.
The set of mandatory options that each of the sections in a configuration file has to specify are:
program_name
command_line
The complete list of available options is:
program_name
params
threads
input
database
output_path
output
version
environment
command_line
In particular, the command_line
option specifies how the other options should be arranged in order to build a running command line. For instance, taking the following section of a configuration file as an example:
[msa]
program_name = mafft
params = --quiet --anysymbol --thread 1 --auto
version = --version
command_line = #program_name# #params# #input# > #output#
In the command_line
option it is specified that there should be the information provided in the program_name
option as the first element, followed by the information in the params
option and the information about the input
option.
After the input
option, there is the output redirect sign (>
) followed by the output
option.
Note 1: if no input
option is specified, PhyloPhlAn 3.0 will read the input from the standard input.
Note 2: if no output
option is specified, PhyloPhlAn 3.0 will redirect the output to the output file.
PhyloPhlAn 3.0 allows performing either a Supermatrix (or concatenation) pipeline or a Supertree (or gene trees) pipeline.
The type of phylogenetic pipeline that will be executed is determined based on the settings present in the configuration file.
The Supermatrix pipeline is the default in PhyloPhlAn 3.0, determined also by the mandatory sections.
In other words, when neither gene_tree1
nor gene_tree2
sections are present in the configuration file, PhyloPhlAn 3.0 will perform a concatenation pipeline.
This approach is to be preferred when building a large phylogeny, and the required section in the configuration file to be present is: gene_tree1
.
In order to use a gene trees pipeline, the user has to manually edit the [tree1]
section in the configuration file in which the paths to the ASTRAL jar file and the example file for the version
option (needed to verify the correct installation of ASTRAL) need to be specified.
Below the [tree1]
section example template that needs to be edited:
[tree1]
command_line = #program_name# #input# #output#
program_name = java -jar /../path_to_astral/../astral.4.11.1.jar
input = -i
output = -o
version = -i /../path_to_astral/../astral-4.11.1/test_data/song_mammals.424.gene.tre
Note: the order of the options for the [tree1]
section can differ from the above example when the config is automatically generated.
PhyloPhlAn 3.0 allows you to assign to each bin that comes from a metagenomic assembly analysis its closest species-level genome bins (SGBs, as defined in Pasolli, E et al. Cell (2019)).
The only mandatory parameter is -i
, followed by the name of the input directory that contains the bins, for example:
phylophlan_metagenomic -i <input_folder>
Other parameters that can be specified are:
-
-o
: allows you to decide the output prefix that will be used for the two output directories and the output file. If not specified, the prefix used is<input_folder>
, so the two output folders will be<input_folder>_dists
and<input_folder>_sketches
, and the output file will be<input_folder>.tsv
-
-n
: allows you to decide how many SGBs (sorted by increasing average genomic distance) will be reported for each input bin in the output file, the keywordall
is accepted. If not specified, default is10
-
--nproc
: allows you to set how many CPUs can be used. Default is1
A practical example of its usage is given in the example 3. Metagenomic analysis of the Ethiopian cohort
The phylophlan_metagenomic
script has three different types of outputs: (1) list of the top -n/--how_many
SGBs sorted by their average Mash distance, (2) closest SGB, GGB, FGB, and reference genomes, and (3) "all vs. all" matrix of all pairwise Mash distances.
Output 1
Each line reports the bin name and the list of the closest SGBs (sorted by their increasing average Mash distance) in a tab-separated fashion.
The information of each SGB are separated by :
. For example:
my_bin (k|u)SGB_ID:taxa_level:taxonomy:average_mash_distance [(k|u)SGB_ID:taxa_level:taxonomy:average_mash_distance]
Where:
-
my_bin
: is the input bin name -
(k|u)SGB_ID
: is the SGB ID and starts with eitherk
oru
to indicate whether it is a known or an unknown SGB -
taxa_level
: can be eitherSpecies
,Genus
,Family
, orPhylum
, depending at which taxonomic level the SGB has been assigned to -
taxonomy
: is the full taxonomic label assigned to the SGB -
average_mash_distance
: is the average Mash distance of the input bin w.r.t. all the genomes in the SGB.
Output 2
Similar to Output 1., with the difference that the information reported are for the closest SGB, then the closest GGB, followed by the closest FGB, and finally the closest reference genomes, according to their respective Mash distances.
Output 3
In this case, phylophlan_metagenomic
produces a square matrix of all pairwise distances of the only input bins against themselves.
This feature is used for retrieving reference genomes of a specified taxonomy.
This is particularly useful when you need to build a tree to phylogenetically compare your genomes with those available in public databases.
The only mandatory parameter is -g <label>
used to specify the taxonomic label for which you need to download the reference genomes.
The <label>
must represent any valid taxonomic level or the special case all
:
-
-g s__<species_name>
: an example is given in 1. High-resolution phylogeny of 135 Staphylococcus aureus isolate genomes -
-g all
: an example is given in 2. Build the tree of life and insert newly sequenced genomes into it
The phylophlan_strain_finder
script can be used to automatically detect subtrees in a phylogeny that are likely representing a strain, based on two measures that can be computed during the PhyloPhlAn 3.0 phylogenetic analysis: the phylogenetic distance and the mutation rates between all nodes of a subtree.
These threshold values for these two measures can be tuned using:
-
--phylo_thr <num>
: the normalized phylogenetic distance between any node from the same subtree -
--mutrate_thr <num>
: the mutation rates between any node from the same subtree
When both of these conditions are satisfied for all nodes of a sub-tree, they are defined as the same strain.
The phylophlan_strain_finder
script requires as input the phylogenetic tree (-i
param) and the mutation rates table with the -m
param:
phylophlan_strain_finder -i <input_tree> -m <mutation_rates.tsv>
Note: PhyloPhlAn 3.0 outputs the <mutation_rates.tsv>
table only if the parameter --mutation_rates
is specified when executing phylophlan
, as explained here.
The phylophlan_draw_metagenomic
script can be used to visualize the results obtained form phylophlan_metagenomic
. Its basic usage is:
phylophlan_draw_metagenomic -i <output_metagenomic> --map <bin2meta.tsv>
where:
-
<output_metagenomic>
: is the output file generated byphylophlan_metagenomic
as detailed above -
<bin2meta.tsv>
: is a mapping file that links each bin to the metagenome it has been reconstructed from. It is a tab-separated file where the input bins are in the first column and metagenomes in the second column
Note: when building the mapping file, make sure the names used for bins are consistent with the ones used as inputs with phylophlan_metagenomic
A usage example of phylophlan_draw_metagenomic
is given in the example 3. Metagenomic analysis of the Ethiopian cohort.
- Python (version >=3.0)
- NumPy (version >=1.12.1)
- Biopython (version >=1.70)
- DendroPy (version >=4.2.0)
PhyloPhlAn 3.0 also needs the following tools:
- At least one phylogenetic inference software tool: RAxML, FastTree, IQ-TREE
- At least one multiple sequence alignment tool: MUSCLE, MAFFT, Opal, UPP
- trimAl for the trimming of the multiple sequence alignment (optional)
- blast+ for database building and mapping of nucleotides databases
- USEARCH and/or DIAMOND for database building and mapping of nucleotides and/or amino acids databases
In general, given that PhyloPhlAn 3.0 is a pipeline that interacts with external software, it might happen that from time to time the failure of one of the external tools may cause an unwanted interruption of the execution.
If you use DIAMOND or MAFFT, be aware that sometimes they might crash, most likely due to temporary files not correctly removed.
This means that if PhyloPhlAn 3.0 crashes during the execution of either DIAMOND or MAFFT, what we advise you to do to continue the analysis is, in this order:
-
Remove the last directory that has been generated in the
output/tmp
folder and re-launch the command, PhyloPhlan will re-start from where it failed, so the computation made up to that point is not lost. -
If the previous solution does not work, re-start PhyloPhlAn changing the
-i
parameter with-c
in order to clean all theoutput
andoutput/tmp
folders. -
If also the previous solution does not work, re-start PhyloPhlAn with
--clean_all
and this will remove all installation and database files that are automatically generated at the first run of PhyloPhlAn 3.0.
This is the main PhyloPhlAn 3.0 script, other information available here.
usage: phylophlan.py [-h] [-i PROJECT_NAME | -c CLEAN] [-o OUTPUT]
[-d DATABASE] [-t {n,a}] [-f CONFIG_FILE] --diversity
{low,medium,high} [--accurate | --fast] [--clean_all]
[--database_list] [-s SUBMAT] [--submat_list]
[--submod_list] [--nproc NPROC]
[--min_num_proteins MIN_NUM_PROTEINS]
[--min_len_protein MIN_LEN_PROTEIN]
[--min_num_markers MIN_NUM_MARKERS]
[--trim {gap_trim,gap_perc,not_variant,greedy}]
[--gap_perc_threshold GAP_PERC_THRESHOLD]
[--not_variant_threshold NOT_VARIANT_THRESHOLD]
[--subsample {phylophlan,onethousand,sevenhundred,fivehundred,threehundred,onehundred,fifty,twentyfive,tenpercent,twentyfivepercent,fiftypercent}]
[--unknown_fraction UNKNOWN_FRACTION]
[--scoring_function {trident,muscle,random}] [--sort]
[--remove_fragmentary_entries]
[--fragmentary_threshold FRAGMENTARY_THRESHOLD]
[--min_num_entries MIN_NUM_ENTRIES] [--maas MAAS]
[--remove_only_gaps_entries] [--mutation_rates]
[--force_nucleotides] [--input_folder INPUT_FOLDER]
[--data_folder DATA_FOLDER]
[--databases_folder DATABASES_FOLDER]
[--submat_folder SUBMAT_FOLDER]
[--submod_folder SUBMOD_FOLDER]
[--configs_folder CONFIGS_FOLDER]
[--output_folder OUTPUT_FOLDER]
[--genome_extension GENOME_EXTENSION]
[--proteome_extension PROTEOME_EXTENSION] [--update]
[--verbose] [-v]
PhyloPhlAn is an accurate, rapid, and easy-to-use method for large-scale
microbial genome characterization and phylogenetic analysis at multiple levels
of resolution. PhyloPhlAn can assign finished, draft, or metagenome-assembled
genomes (MAGs) to species-level genome bins (SGBs). For individual clades of
interest (e.g. newly sequenced genome sets), PhyloPhlAn reconstructs strain-
level phylogenies from among the closest species using clade-specific
maximally informative markers. At the other extreme of resolution, PhyloPhlAn
scales to very-large phylogenies comprising >17,000 microbial species
optional arguments:
-h, --help show this help message and exit
-i PROJECT_NAME, --input PROJECT_NAME
-c CLEAN, --clean CLEAN
Clean the output and partial data produced for the
specified project (default: None)
-o OUTPUT, --output OUTPUT
Output folder name, otherwise it will be the name of
the input folder concatenated with the name of the
database used (default: None)
-d DATABASE, --database DATABASE
The name of the database of markers to use (default:
None)
-t {n,a}, --db_type {n,a}
Specify the type of the database of markers, where "n"
stands for nucleotides and "a" for amino acids. If not
specified, PhyloPhlAn will automatically detect the
type of database (default: None)
-f CONFIG_FILE, --config_file CONFIG_FILE
The configuration file to load. Four ready-to-use
configuration files can be generated using the
"write_default_configs.sh" script present in the
"configs" folder (default: None)
--diversity {low,medium,high}
Specify the expected diversity of the phylogeny to
automatically adjust some parameters: "low": for
genus-/species-/strain-level phylogenies; "medium":
for class-/order-level phylogenies; "high": for
phylum-/tree-of-life size phylogenies (default: None)
--accurate Use more phylogenetic signal, which can result in more
accurate phylogeny; affected parameters depend on the
"--diversity" level (default: False)
--fast Perform a faster phylogeny reconstruction by
reducing the phylogenetic positions to be used; affected
parameters depend on the "--diversity" level (default:
False)
--clean_all Remove all installation and database files
automatically generated (default: False)
--database_list List of all the available databases that can be
specified with the -d/--database option (default:
False)
-s SUBMAT, --submat SUBMAT
Specify the substitution matrix to use. Available
substitution matrices can be listed with "--
submat_list" (default: None)
--submat_list List of all the available substitution matrices that
can be specified with the -s/--submat option (default:
False)
--submod_list List of all the available substitution models that can
be specified with the --maas option (default: False)
--nproc NPROC The number of cores to use (default: 1)
--min_num_proteins MIN_NUM_PROTEINS
Proteomes with less than this number of proteins will
be discarded (default: 1)
--min_len_protein MIN_LEN_PROTEIN
Proteins in proteomes shorter than this value will be
discarded (default: 50)
--min_num_markers MIN_NUM_MARKERS
Input genomes or proteomes that map to less than the
specified number of markers will be discarded
(default: 1)
--trim {gap_trim,gap_perc,not_variant,greedy}
Specify which type of trimming to perform: "gap_trim":
execute what specified in the "trim" section of the
configuration file; "gap_perc": remove columns with a
percentage of gaps above a certain threshold (see "--
gap_perc_threshold" parameter); "not_variant": remove
columns with at least one nucleotide/amino acid
appearing above a certain threshold (see "--
not_variant_threshold" parameter); "greedy": performs
all the above trimming steps; if not specified, no
trimming will be performed (default: None)
--gap_perc_threshold GAP_PERC_THRESHOLD
Specify the value used to consider a column not
variant when "--trim not_variant" is specified
(default: 0.67)
--not_variant_threshold NOT_VARIANT_THRESHOLD
Specify the value used to consider a column not
variant when "--trim not_variant" is specified
(default: 0.99)
--subsample {phylophlan,onethousand,sevenhundred,fivehundred,threehundred,onehundred,fifty,twentyfive,tenpercent,twentyfivepercent,fiftypercent}
The number of positions to retain from each single
marker. Available options are: "phylophlan": specific
number of positions for each PhyloPhlAn marker (only
when "--database phylophlan" is specified); "onethousand":
return the top 1000 positions; "sevenhundred": return the
top 700; "fivehundred": return the top 500; "threehundred"
return the top 300; "onehundred": return the top 100
positions; "fifty": return the top 50 positions;
"twentyfive": return the top 25 positions;
"fiftypercent": return the top 50 percent positions;
"twentyfivepercent": return the top 25% positions;
"tenpercent": return the top 10% positions; if not
specified, the complete alignment will be used
(default: None)
--unknown_fraction UNKNOWN_FRACTION
Define the amount of unknowns ("X" and "-") allowed in
each column of the MSA of the markers (default: 0.3)
--scoring_function {trident,muscle,random}
Specify which scoring function to use to evaluate
columns in the MSA results (default: None)
--sort If specified, the markers will be ordered. When using
the PhyloPhlAn database, it will be automatically set
to "True" (default: False)
--remove_fragmentary_entries
If specified, the MSAs will be checked and cleaned from
fragmentary entries. See --fragmentary_threshold for
the threshold values above which an entry will be
considered fragmentary (default: False)
--fragmentary_threshold FRAGMENTARY_THRESHOLD
The fraction of gaps in the MSA to be considered
fragmentary and hence discarded (default: 0.85)
--min_num_entries MIN_NUM_ENTRIES
The minimum number of entries to be present for each
of the markers in the database (default: 4)
--maas MAAS Select a mapping file that specifies the amino acid
substitution model to be used for each of the markers
for the gene tree reconstruction. The file must be tab-
separated (default: None)
--remove_only_gaps_entries
If specified, entries in the MSAs composed only of
gaps ("-") will be removed. This is equivalent to
specify "--remove_fragmentary_entries
--fragmentary_threshold 1" (default: False)
--mutation_rates If specified, will produce a mutation rates table for
each of the aligned markers and a summary table for
the concatenated MSA. This operation can take a long
time to finish (default: False)
--force_nucleotides If specified, force PhyloPhlAn to use nucleotide
sequences for the phylogenetic analysis, even in the
case of a amino acids database (default: False)
--update Update the databases file (default: False)
--verbose Make PhyloPhlAn verbose (default: False)
-v, --version Print the current PhyloPhlAn version and exit
Folder paths:
Parameters for setting folder locations
--input_folder INPUT_FOLDER
Path to the folder containing the input data (default:
input/)
--data_folder DATA_FOLDER
Path to the folder where to store the intermediate
files. Default is "tmp" inside the project's output
folder (default: None)
--databases_folder DATABASES_FOLDER
Path to the folder containing the database files
(default: phylophlan_databases/)
--submat_folder SUBMAT_FOLDER
Path to the folder containing the substitution
matrices to be used to compute the column score for
the subsampling step (default:
phylophlan_substitution_matrices/)
--submod_folder SUBMOD_FOLDER
Path to the folder containing the mapping file with
substitution models for each marker for the gene tree
building (default: phylophlan_substitution_models/)
--configs_folder CONFIGS_FOLDER
Path to the folder containing the configuration files
(default: phylophlan_configs/)
--output_folder OUTPUT_FOLDER
Path to the output folder where to save the results
(default: )
Filename extensions:
Parameters for setting the extensions of the input files
--genome_extension GENOME_EXTENSION
Extension for input genomes (default: .fna)
--proteome_extension PROTEOME_EXTENSION
Extension for input proteomes (default: .faa)
This script is used to build a custom database and it should be used if the user decides not to use one of the two databases provided. The output is a folder containing the markers ready to be used in phylophlan
through the option -d
, followed by the name of the said folder.
Other information here.
usage: phylophlan_setup_database.py [-h] [-i INPUT | -g GET_CORE_PROTEINS]
[--database_update] [-o OUTPUT]
[-d DB_NAME] [-e INPUT_EXTENSION]
[-t {n,a}] [-x OUTPUT_EXTENSION]
[--overwrite] [--verbose] [-v]
The phylophlan_setup_database.py script can be used to either format an input
folder or multi-fasta file to be used as database in phylophlan.py, or to
automatically download a pre-identified set of core UniRef90 proteins for the
taxonomic label of a given species
optional arguments:
-h, --help Show this help message and exit
-i INPUT, --input INPUT
Specify the path to either the folder containing the
marker files or the file of markers, in (multi-)fasta
format (default: None)
-g GET_CORE_PROTEINS, --get_core_proteins GET_CORE_PROTEINS
Specify the taxonomic label for which to download the set
of core proteins. The label must represent a species:
"--get_core_proteins s__Escherichia_coli" (default:
None)
--database_update Update the databases file (default: False)
-o OUTPUT, --output OUTPUT
Specify path to the output folder where to save the
database (default: None)
-d DB_NAME, --db_name DB_NAME
Specify the name of the output database (default:
None)
-e INPUT_EXTENSION, --input_extension INPUT_EXTENSION
Specify the extension of the input file(s) specified
via -i/--input (default: None)
-t {n,a}, --db_type {n,a}
Specify the type of the database, where "n" stands for
nucleotides and "a" for amino acids (default: None)
-x OUTPUT_EXTENSION, --output_extension OUTPUT_EXTENSION
Set the database output extension (default: None)
--overwrite If specified and the output file exists, it will be
overwritten (default: False)
--verbose Print more stuff (default: False)
-v, --version Print the current phylophlan_setup_database.py
version and exit
This script allows the user to customize the phylogenetic analysis by creating a personalized configuration file, deciding which software to use for every mandatory section among the available ones, as seen above. The output is a text file, so if the user desires to customize the parameters of the selected software according to specific needs and the type of the analysis to be executed, the user should open the generated configuration file with a text editor and then add/remove the specific options. Other information here.
usage: phylophlan_write_config_file.py [-h] -o OUTPUT -d {n,a}
(--db_dna {makeblastdb} | --db_aa {usearch,diamond})
[--map_dna {blastn,tblastn,diamond}]
[--map_aa {usearch,diamond}] --msa
{muscle,mafft,opal,upp}
[--trim {trimal}]
[--gene_tree1 {fasttree,raxml,iqtree}]
[--gene_tree2 {raxml}] --tree1
{fasttree,raxml,iqtree,astral,astrid}
[--tree2 {raxml}] [-a]
[--force_nucleotides] [--overwrite]
[--verbose] [-v]
The phylophlan_write_config_file.py script generates a configuration file to
be used with the phylophlan.py script. It implements some standard parameters
for the software integrated, but if needed, the parameters of the selected
software can be added/modified/removed by editing the generated configuration
file using a text editor
optional arguments:
-h, --help Show this help message and exit
-o OUTPUT, --output OUTPUT
Specify the output file where to write the
configurations (default: None)
-d {n,a}, --db_type {n,a}
Specify the type of the database, where "n" stands for
nucleotides and "a" for amino acids (default: None)
--db_dna {makeblastdb}
Add the "db_dna" section of the selected software that
will be used for building the indexed database
(default: None)
--db_aa {usearch,diamond}
Add the "db_aa" section of the selected software that
will be used for building the indexed database
(default: None)
--map_dna {blastn,tblastn,diamond}
Add the "map_dna" section of the selected software
that will be used for mapping the database against the
input genomes (default: None)
--map_aa {usearch,diamond}
Add the "map_aa" section of the selected software that
will be used for mapping the database against the
input proteomes (default: None)
--msa {muscle,mafft,opal,upp}
Add the "msa" section of the selected software that
will be used for producing the MSAs (default: None)
--trim {trimal} Add the "trim" section of the selected software that
will be used for the removal of the gappy regions of
the MSAs (default: None)
--gene_tree1 {fasttree,raxml,iqtree}
Add the "gene_tree1" section of the selected software
that will be used for building the phylogenies for the
markers in the database (default: None)
--gene_tree2 {raxml} Add the "gene_tree2" section of the selected software
that will be used for refining the phylogenies
previously built with what specified in the
"gene_tree1" section (default: None)
--tree1 {fasttree,raxml,iqtree,astral,astrid}
Add the "tree1" section of the selected software that
will be used for building the first phylogeny
(default: None)
--tree2 {raxml} Add the "tree2" section of the selected software that
will be used for refining the phylogeny previously
built with what specified in the "tree1" section
(default: None)
-a, --absolute_path Write the absolute path to the executable instead of
the executable name as found in the system path
environment (default: False)
--force_nucleotides If specified, sets parameters for phylogenetic analysis
software so that they use nucleotide sequences, even
in the case of a database of amino acids (default:
None)
--overwrite Overwrite output file if it exists (default: False)
--verbose Print more stuff (default: False)
-v, --version Print the current phylophlan_write_config_file.py
version and exit
For each bin that comes from a metagenomic assembly analysis, this script reports the closest species-level genome bins (SGBs). This is particularly useful when the user needs to analyze bins assembled from metagenomes. The main output file to consider will be a tsv file containing, for each bin of interest, information about the SGB it has been assigned to. Other information here.
usage: phylophlan_metagenomic.py [-h] [-i INPUT] [-o OUTPUT_PREFIX]
[-d DATABASE] [--database_list]
[--database_update] [-e INPUT_EXTENSION]
[-n HOW_MANY] [--nproc NPROC]
[--database_folder DATABASE_FOLDER]
[--only_input] [--add_ggb] [--add_fgb]
[--overwrite] [--verbose] [-v]
The phylophlan_metagenomic.py script assigns SGB and taxonomy to a given set of
input genomes. Outputs can be of three types: (1) for each input genome,
returns the list of the closest -n/--how_many SGBs sorted by average Mash
distance; (2) for each input genome, returns the closest SGB, GGB, FGB, and
reference genomes; (3) returns an all vs. all matrix with all the pairwise mash
distances
optional arguments:
-h, --help Show this help message and exit
-i INPUT, --input INPUT
Input folder containing the metagenomic bins to be
indexed (default: None)
-o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
Prefix used for the output folders: indexed bins,
distance estimations. If not specified, the input
folder will be used (default: None)
-d DATABASE, --database DATABASE
Database name. Available options can be listed using
the --database_list parameter (default: None)
--database_list List of all the available databases that can be
specified with the -d/--database option (default:
False)
--database_update Update the databases file (default: False)
-e INPUT_EXTENSION, --input_extension INPUT_EXTENSION
Specify the extension of the input file(s) specified
via -i/--input. If not specified, PhyloPhlAn will try
to infer it from the input files (default: None)
-n HOW_MANY, --how_many HOW_MANY
Specify the number of SGBs to report in the output;
"all" is a special value to report all the SGBs; this
param is not used when "--only_input" is specified
(default: 10)
--nproc NPROC The number of CPUs to use (default: 1)
--database_folder DATABASE_FOLDER
Path to the folder that contains the database file
(default: phylophlan_databases/)
--only_input If specified, provide a distance matrix between only
the input genomes provided (default: False)
--add_ggb If specified, add GGB assignments. If specified with
--add_fgb, then -n/--how_many will be set to 1 and
will be adding a column that reports the closest
reference genome (default: False)
--add_fgb If specified, add FGB assignments. If specified with
--add_ggb, then -n/--how_many will be set to 1 and
will be adding a column that reports the closest
reference genome (default: False)
--overwrite If specified, overwrite the output file if exists
(default: False)
--verbose Print more stuff (default: False)
-v, --version Print the current phylophlan_metagenomic.py version
and exit
This script is used to get reference genomes of a specified species. This is particularly useful when the user needs to build a tree to confront samples with an existing one. When using the -g
parameter, the output will be a directory with the requested genomes. Other information here.
usage: phylophlan_get_reference.py [-h] [-g GET | -l] [--database_update]
[-e OUTPUT_FILE_EXTENSION] [-o OUTPUT]
[-n HOW_MANY] [-m GENBANK_MAPPING]
[--verbose] [-v]
The phylophlan_get_reference.py script allows to download a specified number
(-n/--how_many) of reference genomes from the Genbank repository. Special case
"all" allows to download a specified number of reference genomes for all
available taxonomic species. With the -l/--list_clades params the
phylophlan_get_reference.py script returns the list of all species in the
database
optional arguments:
-h, --help Show this help message and exit
-g GET, --get GET Specify the taxonomic label for which download the set
of reference genomes. The label must represent a valid
taxonomic level or the special case "all" (default:
None)
-l, --list_clades Print for all taxa the total number of species and
reference genomes available (default: False)
--database_update Update the databases file (default: False)
-e OUTPUT_FILE_EXTENSION, --output_file_extension OUTPUT_FILE_EXTENSION
Specify extension of the output files
(default: .fna.gz)
-o OUTPUT, --output OUTPUT
Specify the path to the output folder where to save
the files, required when -g/--get is specified
(default: None)
-n HOW_MANY, --how_many HOW_MANY
Specify how many reference genomes to download, where
-1 stands for "all available" (default: 4)
-m GENBANK_MAPPING, --genbank_mapping GENBANK_MAPPING
The local GenBank mapping file. If not found, it will
be automatically downloaded (default:
assembly_summary_genbank.txt)
--verbose Print more stuff (default: False)
-v, --version Print the current phylophlan_get_reference.py version
and exit
This script can be used to perform analysis on trees built with phylophlan
. The output is a table that contains the subtrees and information about the minimum, mean, and maximum distance between nodes in the subtree, the minimum, mean and maximum mutation rate between nodes in the subtree, and the distance and mutation rate between each node in the subtree.
Other information here.
usage: phylophlan_strain_finder.py [-h] -i INPUT -m MUTATION_RATES
[--p_threshold P_THRESHOLD]
[--m_threshold M_THRESHOLD]
[--tree_format {newick,nexus,phyloxml,cdao,nexml}]
[-o OUTPUT] [--overwrite] [-s {;,,, }]
[--verbose] [-v]
The phylophlan_strain_finder.py script analyzes the phylogeny and the mutation
rates table generated from the phylophlan.py script and returns sub-trees
representing the same strain, according to both a phylogenetic threshold
(computed on the normalized pairwise phylogenetic distances) and a mutation
rate threshold (computed on the aligned sequences of the markers used in the
phylogenetic analysis)
optional arguments:
-h, --help Show this help message and exit
-i INPUT, --input INPUT
Specify the file of the phylogenetic tree as generated
from phylophlan.py (default: None)
-m MUTATION_RATES, --mutation_rates MUTATION_RATES
Specify the file of the mutation rates as generated
from phylophlan.py (default: None)
--p_threshold P_THRESHOLD
Maximum phylogenetic distance threshold for every pair
of nodes in the same subtree (inclusive) (default:
0.05)
--m_threshold M_THRESHOLD
Maximum mutation rate ratio for every pair of nodes in
the same subtree (inclusive) (default: 0.05)
--tree_format {newick,nexus,phyloxml,cdao,nexml}
Specify the format of the input tree (default: newick)
-o OUTPUT, --output OUTPUT
Specify the output filename. If not specified, it will
be stdout (default: None)
--overwrite Overwrite the output file if exists (default: False)
-s {;,,, }, --separator {;,,, }
Specify the separator to use in the output (default: )
--verbose Print more stuff (default: False)
-v, --version Print the current phylophlan_strain_finder.py version
and exit
This script can be used to visualize the results obtained with phylophlan_metagenomic
. The outputs are two heatmaps, one showing the presence/absence of the top SGBs (customizable through --top
) in the metagenomes, the other showing the number of kSGBs and uSGBs in each metagenome, and two relative output files containing the data used to build them.
Other information here
usage: phylophlan_draw_metagenomic.py [-h] -i INPUT -m MAP [--top TOP]
[-o OUTPUT] [-s SEPARATOR] [--dpi DPI]
[-f F] [--verbose] [-v]
The phylophlan_draw_metagenomic.py script takes as input the output table
generated from the phylophlan_metagenomic.py script and produces two heatmap
figures: (1) presence/absence heatmap of the SGBs in the metagenomic samples;
and (2) heatmap showing the amount of kSGBs and uSGBs in each metagenome.
optional arguments:
-h, --help Show this help message and exit
-i INPUT, --input INPUT
The input file generated from
phylophlan_metagenomic.py (default: None)
-m MAP, --map MAP A mapping file that maps each bin to its metagenome
(default: None)
--top TOP The number of SGBs to display in the figure (default:
20)
-o OUTPUT, --output OUTPUT
Prefix of the output files (default: output_heatmap)
-s SEPARATOR, --separator SEPARATOR
The separator used in the mapping file (default: )
--dpi DPI Dpi resolution of the images (default: 200)
-f F Images output format (default: svg)
--verbose Print more stuff (default: False)
-v, --version Print the current phylophlan_draw_metagenomic.py
version and exit
You can find here the wiki of the first PhyloPhlAn implementation and here the zip or tar.bz2 as in:
PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes
Nicola Segata, Daniela Börnigen, Xochitl C. Morgan, and Curtis Huttenhower
Nature Communications, vol. 4, p. 2304, Jul. 2013
DOI: 10.1038/ncomms3304