-
Notifications
You must be signed in to change notification settings - Fork 33
Home
PhyloPhlAn2 is an integrated pipeline for large-scale phylogenetic profiling of genomes and metagenomes.
Most likely the easiest way to understand how you can use PhyloPhlAn2 in your analysis is to check out the examples below:
- Phylogenetically characterized isolate genomes of a given species (S. aureus)
- Tree of life
- Metagenomics
- High-resolution phylogeny of known genomes reconstructed from metagenomes of a given species (E. coli)
- Phylogenetically characterization of unknown SGB from Proteobacteria phylum
In PhyloPhlAn2 the markers database used to extract the phylogenetic signal can be either one of the two provided: PhyloPhlAn and AMPHORA2, or one defined by the user. For the correct working of PhyloPhlAn2, a script that format the database to be used by PhyloPhlAn2 is provided. The phylogenetic pipeline can be entirely configured through a set of input parameters and a configuration file (scripts are provided to generate different configuration files to perform different phylogenetic analysis) where the user can specify the preferred software to be used to perform each step of the pipeline. In addition, several parameters can be used to define whether to apply trimming and subsampling of the alignments produced, for instance, the type of trimming, the function to use to score the multiple-sequence alignments and the function that selects how many position to consider for each of the multiple-sequence alignment.
PhyloPhlAn ver. 1.0: If you are looking for the tutorial of PhyloPhlAn ver. 1.0, have a look here
[TOC]
There are three installation methods available, we recommend you to use the Conda-based ones to guarantee that all PhyloPhlAn2 dependencies will be automatically satisfied.
This option will be available soon
This requires a working Conda installation.
#!bash
conda install phylophlan2
Potentially add -c bioconda
in case the Bioconda channel is not in your default channels list.
This installation procedure doesn't bring the external tools. You are advised to follow this procedure if you wish to install and configure the tools independently.
Step 1: Clone the ppa2
conda environment
This requires a working Conda installation.
#!bash
conda env create fasnicar/ppa2
Step 2: Clone the PhyloPhlAn repository
This requires Mercurial.
#!bash
hg clone https://bitbucket.org/nsegata/phylophlan
PhyloPhlAn2 is at the moment available only in the
dev
branch of the repository, so you should change branch after having cloned the repository withcd phylophlan && hg up dev
Step 3: Install the Dependencies and Tools necessary to run the PhyloPhlAn2 pipeline
Step 4: Activate the ppa2
conda environment
Before running PhyloPhlAn2, you need to activate the conda environment:
#!bash
conda activate ppa2
If correctly activated, at the beginning of the command line you should see the (ppa2)
prefix.
Note: if you follow the medium installation process you should be aware that every command has to be preceded by ./
, including those in the bash files provided for each example that execute all commands
Step 1: Get the latest PhyloPhlAn2 version from the repository
This requires Mercurial.
#!bash
hg clone https://bitbucket.org/nsegata/phylophlan
Step 2: Install the Dependencies and Tools necessary to run the PhyloPhlAn2 pipeline
Note: if you follow the hard installation process you should be aware that every command has to be preceded by ./
, including those in the bash files provided for each example that execute all commands
In order to verify that PhyloPhlAn2 is properly installed, the following command:
#!bash
phylophlan2.py --version
should output something like below:
PhyloPhlAn2 version 0.40 (10 September 2019)
Note: if you have followed the medium or hard installation method in order to use PhyloPhlAn2 you should do one of the following.
- Change to the repository you have cloned:
#!bash
cd phylophlan/
- Alternatively, add to the last line of the .bashrc file in your home directory this line, where
<path-to-directory>
is the absolute path to the repository you have cloned:
#!bash
export PATH="<path-to-directory>:$PATH"
#!bash
phylophlan2.py -i <input_folder> -d <database> --diversity <low-medium-high> -f <configuration_file>
where:
-
<input_folder>
is the folder containing your input genomes and/or proteomes, a detailed description is available here -
<database>
is the name of the database of markers to use, a detailed description is available here -
--diversity
takes value in {low
,medium
,high
} and its used to automatically set the analysis to the type of phylogeny to build, a detailed description is available here -
<configuration_file>
is the path to the configuration file necessary to properly run PhyloPhlAn2, a detailed description is available here
PhyloPhlAn2 takes FASTA files (also compressed in Gzip, .gz
and/or Bzip2, .bz2
) as input.
Inputs can be both genomes and proteomes, also mixed, and by default genomes and proteomes are distinguished by the .fna
and .faa
extension, respectively.
If needed, genomes and proteomes file extensions can be configured using the --genome_extension
and --proteome_extension
params, respectively.
When using PhyloPhlAn2 the user can customize each step of the pipeline used to build the tree (marker genes identification, multiple sequence alignment, concatenation or gene trees inference and phylogeny reconstruction) by specifying the desired tools in the configuration file. These should be selected according to the type of markers and input used in the analysis and will lead to diverse phylogenetic pipelines:
- when both markers and inputs are nucleotides it will proceed in the nucleotide space
- when markers are proteins and inputs a mix of genomes and proteomes it will proceed in translated sequence space, so amino acidic. If the input is strictly genomic one can decide to specify the
--force_nucletides
parameter to use a nucleotidic pipeline even though the database is amino acidic. The configuration file should be created accordingly with the--force_nucleotides
parameter.
The --diversity
parameter is an easy way that allows for three distinct pre-defined options used to set several other PhyloPhlAn2 parameters (e.g., trimming, subsampling, fragmentary removal, etc.) to specific values for the expected type of diversity in the phylogeny to be built.
The user can choose among three values:
Diversity | Description |
---|---|
low |
for species- and strain-level phylogenies |
medium |
for genus- and family-level phylogenies |
high |
for tree-of-life and higher-ranked taxonomic levels phylogenies |
If not specified, PhyloPhlAn2 will automatically run with the --accurate
option, that will consider more phylogenetic positions that should reflect in a more accurate phylogenetic reconstruction.
The --fast
option can be specified to have a faster phylogeny pipeline reconstruction.
Both options will affect several other PhyloPhlAn2 params, setting optimal parameters based also on the --diversity
chosen by the user. A detailed description is available here.
All PhyloPhlAn2 produced files are available in the <input_folder>_<database>
folder (or in the folder specified with --output_folder
) created in the directory where the script is run.
Inside there is a temporary folder (<input_folder>_<database>/tmp
) that contains all the intermediate files produced during the analysis pipeline.
Depending on the configuration file and hence on the pipeline executed, the resulting output files may have different names.
For instance, using the supermatrix_aa.cfg
configuration file that can be automatically generated using the phylophlan2_write_default_configs.sh
script provided with PhyloPhlAn2, the output files will be:
Filename | Description |
---|---|
RAxML_bestTree.input_folder_refined.tre | is the final (refined) phylogeny produced by RAxML starting from the FastTree phylogeny |
input_folder.tre | is the phylogeny built by FastTree |
input_folder.aln | is the multiple sequence alignment used as input for the phylogenies, in FASTA format |
The user can specify the number of CPUs to use with the --nproc
parameter:
#!bash
phylophlan2.py -i <input_folder> -d <database> --diversity <low-med-high> -f <configuration_file> --nproc <N>
Please note that regardless of the number of CPUs specified with --nproc
, PhyloPhlAn2 will run:
- FastTree with 3 CPUs (as suggested in the FastTree FAQs) and, in any case, this is not regulated by the
--nproc
param because FastTree uses theOMP_NUM_THREADS
variable, which is defined in the configuration file. - RAxML with no more than 20 CPUs in the case
--nproc
is greater than 20 as in our experience using more than 20 CPUs with RAxML do not shorten the computational time required for the phylogeny reconstruction.
Note: if you specify with --nproc
a higher number of CPUs compared to the ones available in your machine, you will experience a significant drop in the software performances, as also reported in the RAxML manual.
PhyloPhlAn2 is able to automatically download two databases:
-
PhyloPhlAn (
-d phylophlan
, 400 universal marker genes) presented in Segata, N et al. NatComm 4:2304 (2013) -
AMPHORA2 (
-d amphora2
, 136 universal marker genes) presented in Wu M, Scott AJ Bioinformatics 28.7 (2012)
Moreover, in addition to the two databases provided, as explained in the following database setup section, it is possible to retrieve a set of core proteins of a specific species, or even build custom databases starting from either a folder containing marker files or a multi-fasta file containing the marker sequences (e.g., multi-fasta file with the core genes sequences from Roary).
In this section, we provide as many details as possible for the other parameters and configurations available in PhyloPhlAn2.
When building a phylogeny PhyloPhlAn2 makes sure that input genomes/proteomes and markers respect a certain threshold of quality. It is possible to customize these controls through the following parameters:
-
--min_num_proteins <n>
: used to make sure that proteomes (.faa) with less than this number of proteins will be discarded. Default is 1 -
--min_len_protein <n>
: this parameter is associated with the previous--min_num_proteins
and it is used to specify the minimum length proteins in proteomes (.faa) should have. Proteins that are shorter than this value will be discarded. Default is 50
The above parameters have no effect when the pipeline is strictly genomic (both markers and input are nucleotides) or when --force_nucleotides
is specified in the command line; see this section for more information.
-
--min_num_markers <n>
: input genomes or proteomes that map to less than the specified number of markers will be discarded. Default is 0, unless the database specified with-d
isphylophlan
oramphora
, in these cases default is respectively 100 and 34 -
--min_num_entries <n>
: database markers that are found in less than the specified number of input entries will be discarded. Default is 4 -
--remove_fragmentary_entries
: if specified the multiple sequence alignment (MSA) will be checked and cleaned from fragmentary entries. See--fragmentary_threshold
for the threshold values above which an entry will be considered fragmentary. Default is false -
--fragmentary_threshold <n>
: used to specify the fraction of gaps in each row in the MSA to be considered fragmentary and hence discarded. Default is 0.85 -
--remove_only_gaps_entries
: if specified, entries in the MSAs composed only of gaps will be removed. This is equivalent to specifying--remove_fragmentary_entries
and--fragmentary_threshold 1
. Default is false
The following table shows which parameters are affected and how their values will automatically change according to the combination of the --diversity
and --accurate/--fast
parameters.
--diversity low |
--diversity medium |
--diversity high |
|
---|---|---|---|
--accurate |
--trim not_variant --submat pfasum60 --remove_fragmentary_entries --not_variant_threshold 0.99
|
--trim gap_trim --remove_fragmentary_entries --fragmentary_threshold 0.85 --submat pfasum60 --subsample onehundred --scoring_function trident
|
--trim greedy --remove_fragmentary_entries --fragmentary_threshold 0.75 --submat pfasum60 --subsample twentyfive --scoring_function trident --not_variant_threshold 0.95 --gap_perc_threshold 0.85
|
--fast |
--trim greedy --remove_fragmentary_entries --fragmentary_threshold 0.85 --submat pfasum60 --subsample fivehundred --scoring_function trident --gap_perc_threshold 0.67
|
--trim greedy --remove_fragmentary_entries --fragmentary_threshold 0.75 --submat pfasum60 --subsample fifty --scoring_function trident --not_variant_threshold 0.97 --gap_perc_threshold 0.75
|
--trim greedy --remove_fragmentary_entries --fragmentary_threshold 0.67 --submat pfasum60 --subsample phylophlan or --subsample tenpercent --scoring_function trident --not_variant_threshold 0.9 --gap_perc_threshold 0.85
|
Note: if you manually specify in the command line one or more of the above parameters, that will override the automatic value for the specific combination of --diversity
and --accurate/--fast
for that parameter(s).
You can specify the trimming strategy to use with the --trim
parameter. The user can choose between four different options:
--trim |
Description |
---|---|
gap_trim |
will perform what specified in the trim section of the configuration file, which by default is trimAl with the --gappyout parameter, as presented in Capella-Gutiérrez S, et al. Bioinformatics 25.15 (2009) and in the trimAl website
|
gap_perc |
remove columns with a percentage of gaps above a certain threshold, regulated by the --gap_perc_threshold parameter, which default value is 0.67 |
not_variant |
removes columns from a multiple-sequence aligned file that has at least one amino acid appearing above a certain threshold (threshold set by the --not_variant_threshold parameter, which default value is 0.95) |
greedy |
performs all the above trimming options |
The default is None
. In this case, the trimming step will not be performed.
Site subsampling strategy allows retaining a certain amount of phylogenetically relevant (decided based on the scoring function) positions only.
In PhyloPhlAn2 you can specify the subsample strategy using the --subsample
parameter.
There are several options available that will set a different amount of retained positions:
--subsample |
Description |
---|---|
phylophlan |
uses the formula presented in Segata, N et al. NatComm 4:2304 (2013) to determine how many positions to retain for each one of the 400 PhyloPhlAn markers |
onethousand |
retains up-to 1000 positions for each marker |
sevenhundred |
retains up-to 700 positions for each marker |
fivehundred |
retains up-to 500 positions for each marker |
threehundred |
retains up-to 300 positions for each marker |
onehundred |
retains up-to 100 positions for each marker |
fifty |
retains up-to 50 positions for each marker |
twentyfive |
retains up-to 25 positions for each marker |
tenpercent |
retain 10% of the positions for each marker |
twentyfivepercent |
retain 25% of the positions for each marker |
fiftypercent |
retain 50% of the positions for each marker |
Note: the --subsample phylophlan
option works when using the PhyloPhlAn database only, specified via -d phylophlan
The default is None
. In this case, the subsampling will not be performed and the full-length alignment will be used.
A scoring function is used in PhyloPhlAn2 to assign to each column in the MSAs a phylogenetic score, that will be then used to rank the MSA positions to retain a subset (see Subsampling).
The --scoring_function
parameter allows three different scoring functions:
--scoring_function |
Description |
---|---|
muscle |
implements the same scoring function defined in Edgar, RC NAR 32.5 (2004), when specifying the -scorefile param |
trident |
implements the trident scoring function as presented in Valdar, WSJ. Proteins 48.2 (2002), which is a weighted combination of symbol diversity, stereochemical diversity, and gap cost |
random |
assigns random scores to each position in the MSAs |
Some of the functions for scoring the MSA columns need a substitution matrix to evaluate the substitution of amino acids.
Substitution matrices can be specified using the --submat
param that could assume one
of the following values.
--submat |
Description |
---|---|
vtml200 |
substitution matrix proposed by Yamada K, Tomii K Bioinformatics 30.3 (2014) |
vtml240 |
substitution matrix used in Edgar RC NAR 32.5 (2004) |
miqs |
substitution matrix proposed by Tomii K and Kazunori Y Humana Press, New York, NY, 1415 (2016) |
pfasum60 |
substitution matrix proposed by Keul F et al. BMC Bioinformatics 18.1 (2017) |
The substitution matrices presented above are distributed with PhyloPhlAn2.
However, the set of substitution matrices could be extended with user-defined ones.
The user can generate its own substitution matrices using the scripts (generate_matrices.sh
and serialize_matrix.py
) provided into the phylophlan2_substitution_matrices
folder.
PhyloPhlAn2 has the --mutation_rates
option that computes the amount of nucleotide or amino acid changes in each aligned marker.
In the output folder <input_folder>_<database>/mutation_rates/
, you can find a mutation rate table for all the markers whereas the <input_folder>_<database>/mutation_rates.tsv
file contains the summarized mutation rates table for the complete multiple sequence alignment.
Using the --sort
it is possible to sort the markers and hence force PhyloPhlAn2 to consider them in a specific order when concatenating the sequences.
When using the PhyloPhlAn database (-d phylophlan
), --sort
will be automatically set to True
.
Note: the sort preference is used only for the super-matrix approach (concatenation) only.
To build a custom database, we provide the phylophlan2_setup_database.py
script to be run with the following syntax:
#!bash
phylophlan2_setup_database.py -i <input_file_or_folder> -d <database_name> -e <input_extension> -t <database_type>
where:
-
<input_file_or_folder>
is the folder containing markers files or a multi-fasta file containing the markers -
<database_name>
is the database name chosen by the user (the name to use when running PhyloPhlAn2) -
<input_extension>
is the extension of the input file(s) -
<database_type>
has to ben
if it is a nucleotide database ora
if it is an amino acids database (depending on the input provided by the user)
The database will be created in the same folder of the input file(s), or you can specify an output folder with the -o
option.
The phylophlan2_setup_database.py
script can be used to automatically retrieve a set of core proteins of a specific species using the -g
option (instead of the -i
param). In this case, you need to specify the species name by typing -g s__<species_name>
. This is also going to be the default name of the database if not differently specified with -d
.
The decision between -i
or -g
depends on whether the user already has a folder containing the markers or is asking PhyloPhlAn2 to download a set of core markers.
PhyloPhlAn2 relies on the configuration file for handling the external software.
A configuration file can be specified in phylophlan2.py
with either -f <configuration>
or --config_file <configuration>
.
Each configuration file is composed of different sections (some are mandatory, to ensure to be able to complete a phylogenetic analysis, and some are optional). Each section refers to a specific step in the phylogenetic pipeline and contains all the details for the external software to use.
In PhyloPhlAn2 you can find the phylophlan2_write_default_configs.sh
script that will generate four ready-to-use configuration files:
supermatrix_aa.cfg
supermatrix_nt.cfg
supertree_aa.cfg
supertree_nt.cfg
More information about the supermatrix and supertree approaches are available in the following section.
If you want to generate your own configuration file, you can use the phylophlan2_write_config_file.py
script.
Below, an example of the command used to create a customized configuration file is provided. It uses diamond
instead of blastn
and muscle
instead of mafft
, with respect to the supermatrix_nt.cfg
configuration file generated by the phylophlan2_write_default_configs.sh
script:
#!bash
python phylophlan2_write_config_file.py \
-o custom_config_nt.cfg \
-d n \
--db_dna makeblastdb \
--map_dna diamond \
--msa muscle \
--trim trimal \
--tree1 fasttree \
--tree2 raxml
where:
-
-o
is the output filename -
-d
indicates the type of database this configuration file is tailored for, a detailed description is available here -
--db_dna
,--map_dna
,--msa
,--trim
,--tree1
,--tree2
, indicate the sections the configuration file will contain
The following sections are strictly required to be defined in a configuration file:
Mandatory section | Description |
---|---|
--db_dna and/or --db_aa
|
Choices for db_dna : (makeblastdb ), for db_aa : (usearch , diamond ) |
--map_dna and/or --map_aa
|
specify the software for mapping the database against genomes and proteomes, respectively. Choices for map_dna : (blastn , tblastn , diamond ), for map_aa : (usearch , diamond ) |
--msa |
specify the software for performing the multiple-sequence alignment. Choices are: muscle , mafft , opal , upp
|
--tree1 |
specify the software for inferring the phylogeny. Choices are: fasttree , raxml , iqtree , astral , astrid
|
Optional section | Description |
---|---|
--trim |
specify the software trimal for performing the trimming of gappy regions |
--gene_tree1 |
specify the software to use for building the single-gene trees. Choices are fasttree , raxml , iqtree
|
--gene_tree2 |
specify the software ramxl for refining the phylogenies built at the gene_tree1 step |
--tree2 |
specify the software raxml for refining the phylogeny built at the tree1 step |
PhyloPhlAn2 allows users to integrate new tools that are not available in the framework, as well as their parameters, for each of the different steps. This is done by manually editing the configuration file or creating a new configuration file with the desired tools/parameters.
The only requirement for this integration is that the input and output file formats of the tool are compatible with the framework's default tools.
Here is an example section of a default supermatrix configuration file that uses MAFFT for multiple-sequence alignment:
[msa]
program_name = mafft
params = --quiet --anysymbol --thread 1 --auto
version = --version
command_line = #program_name# #params# #input# > #output#
And here is the same section modified to use Clustal Omega that is not a default option in PhyloPhlAn2 for the multiple-sequence alignment step:
[msa]
program_name = clustalo
input = -i
output = -o
params = --threads 1 --auto
version = --version
command_line = #program_name# #params# #input# #output#
A configuration file can be compmosed by several different sections, but there exists a minimum set of sections that has to be present to completes a phylogenetic analysis. The mandatory sectinos are:
- either
map_dna
ormap_aa
msa
tree1
The complete list of sections available are:
map_dna
map_aa
msa
trim
gene_tree1
gene_tree2
tree1
tree2
Each of the above sections can have several different options specified. These are required in order to be able to compose a command line that can run an external tool. The set of mandatory options that each of the section in a configuration file has to specify are:
program_name
command_line
The complete list of options available are:
program_name
params
threads
input
database
output_path
output
version
environment
command_line
In particular, the command_line
option specifies how the other options should be arranged in order to build a running command line. For instance, taking this section of a configuration file:
[msa]
program_name = mafft
params = --quiet --anysymbol --thread 1 --auto
version = --version
command_line = #program_name# #params# #input# > #output#
In the command_line
option it is specified that first there should be the information provided in the program_name
option, followed by the information in the params
options. Then there is the information about the input
option. Note that in this configuration no input
option is specified, so in this case, PhyloPhlAn2 will read the input from the standard input. After the input
option there is the output redirect sign (>
) followed by the output
option. Note that also in this case there is no parameter to specify an output file, so PhyloPhlAn will redirect the output to the output file.
PhyloPhlAn2 allows executing either a Supermatrix (or concatenation) pipeline, as well as a Supertree (or gene trees) pipeline.
The type of phylogenetic pipeline that will be executed is determined on the settings in the configuration file.
The Supermatrix pipeline is the default, determined also by the mandatory sections.
In other words, when neither gene_tree1
nor gene_tree2
sections are present in the configuration file, PhyloPhlAn2 will perform a concatenation pipeline.
This approach is to be preferred when building a large phylogeny but will build a tree with unresolved branch length tips.
For a Supertree pipeline, the required section in the configuration file is: gene_tree1
.
In order to use a gene trees pipeline, the user has to manually edit the [tree1]
section in the configuration file in which the paths to the ASTRAL jar file and the example file for the version
option (needed to verify the correct installation of ASTRAL) need to be specified.
Below the [tree1]
section example template that needs to be edited:
[tree1]
command_line = #program_name# #input# #output#
program_name = java -jar /../path_to_astral/../astral.4.11.1.jar
input = -i
output = -o
version = -i /../path_to_astral/../astral-4.11.1/test_data/song_mammals.424.gene.tre
Note: the order of the options for the [tree1]
section can differ from the above example.
PhyloPhlAn2 allows you now to assign to each bin that comes from a metagenomic assembly analysis its closest species-level genome bins (SGBs, see Pasolli, E et al. Cell (2019) for further details).
The only mandatory parameter is -i
followed by the name of the input directory that contains the bins, for example:
#!bash
phylophlan2_metagenomic.py -i input_folder
Other parameters that can be specified are:
-
-o
: allows you to decide the output prefix that will be used for the two output directories and the output file. If not specified, the prefix used is<input_folder>
as in the previous example where the two output folders will beinput_folder_dists
andinput_folder_sketches
, and the output file will beinput_folder.tsv
-
-n
: allows you to decide how many SGBs (sorted by increasing average genomic distance) will be reported for each input bin in the output file. Also, the keyword--all
is accepted. If not specified, default is10
-
--nproc
: allows you to how many CPUs can be used. If not specified, default is1
A practical example of its usage is given in the example 3. Metagenomics
The phylophlan2_metagenomic.py
script has three different types of outputs: (1) list of the top -n/--how_many
SGBs sorted by their average Mash distance, (2) closest SGB, GGB, FGB, and reference genomes, and (3) "all vs. all" matrix of all pairwise Mash distances.
Output 1
Each line reports the bin name and the list of the closest SGBs (sorted by their increasing average Mash distance) in a tab-separated fashion. The information of each SGB are separated by :
. For example:
my_bin (k|u)SGB_ID:taxa_level:taxonomy:average_mash_distance [(k|u)SGB_ID:taxa_level:taxonomy:average_mash_distance]
Where:
-
my_bin
is the input bin name; -
(k|u)SGB_ID
is the SGB ID and starts with eitherk
oru
to indicate whether it is a known or an unknown SGB; -
taxa_level
can be eitherSpecies
,Genus
,Family
, orPhylum
, depending at which taxonomic level the SGB has been assigned to; -
taxonomy
is the full taxonomic label assigned to the SGB -
average_mash_distance
is the average Mash distance of the input bin w.r.t. all the genomes in the SGB.
Output 2
Similar to the output of Case 1., but with the difference that the information reported are for the closest SGB, then the closest GGB, followed by the closest FGB, and finally the closest reference genomes, according to their respective Mash distances.
Output 3
In this case, phylophlan2_metagenomic.py
produces a square matrix of all pairwise distances of the only input bins.
Note: this is working with up to 100,000 input bins. If you have more than that you should divide your input bins into batches of no more than 100,000 bins each.
This feature is used for getting reference genomes of specified species. This is particularly useful when you need to build a tree to phylogenetically compare your samples with existing ones. The only mandatory parameter is -g <label>
used to specify the taxonomic label for which you need to download the set of references. The <label>
must represent any valid taxonomic level or the special case all
and is best used:
-
-g s__<species_name>
in low diversity trees. A practical example of its usage is given in 1. Phylogenetically characterized isolate genomes of a given species (S. aureus) -
-g all
in high diversity trees. A practical example of its usage is given in 2. Tree of life
This script can be used to perform analysis on trees generated with phylophlan2.py
since it outputs a number of subtrees built according to the relative similarity of nodes, where similarity is defined by two criterions, whose softness can be decided by the user through two parameters at the same time:
-
--p_thr <num>
: it defines the phylogenetic threshold to test on the tree. The phylogenetic distance between any node from the same subtree will be less than this threshold; -
--m_thr <num>
: it defines the mutation rate to test on the tree. The mutation rates between any node from the same subtree will be less than this threshold.
You need to provide the tree file with -i
and the mutation rates table with -m
:
phylophlan2_strain_finder.py -i <input_tree> --tree_format <input_tree_format> -m <mutation_rates.tsv>
Notice that you have the <mutation_rates.tsv>
table only if you use the parameter --mutation_rates
when executing phylophlan2.py
, as explained above
Drawing heatmaps to visualize the output from phylophlan2_metagenomic.py (phylophlan2_draw_metagenomic.py
usage)
The phylophlan2_draw_metagenomic.py
script can be used to visualize the results obtained form phylophlan2_metagenomic.py
and its basic usage is:
phylophlan2_draw_metagenomic.py -i <output_metagenomic> --map <bin2meta.tsv>
where:
-
<output_metagenomic>
is the output file generated byphylophlan2_metagenomic.py
as detailed above; -
<bin2meta.tsv>
passed with the--map
parameter is a mapping file that links each bin to the metagenome it has been reconstructed from. It is a tab-separated file where the input bins are in the first column and metagenomes in the second column.
Note: when building the mapping file make sure the names used for bins are consistent with the ones used as inputs with phylophlan2_metagenomic.py
.
A usage example of phylophlan2_draw_metagenomic.py
is given in the example 3. Metagenomics
- Python (version >=3.0)
- NumPy (version >=1.12.1)
- Biopython (version >=1.70)
- DendroPy (version >=4.2.0)
PhyloPhlAn2 also needs the following tools:
- At least one phylogenetic inference software tool: RAxML, FastTree, IQ-TREE
- At least one multiple sequence alignment tool: MUSCLE, MAFFT, Opal, UPP
- trimAl for the trimming of the multiple sequence alignment (optional)
- blast+ for database building and mapping of nucleotides databases
- USEARCH and/or DIAMOND for database building and mapping of nucleotides and/or amino acids databases.
If you use DIAMOND we notice that sometimes it can happen that it crashes, most likely due to temporary files not removed. So, if PhyloPhlAn2 crashes during the use of diamond, remove the last directory that has been generated in the output/tmp
folder and re-launch the main command to restart PhyloPhlan2 from where it failed.
In general, given that PhyloPhlAn2 is a pipeline interacting with external software it might happen that from time to time the failure of one of the steps may cause an interruption in the execution of phylophlan2.py
. What we advise to do to continue the analysis is, in this order:
-
Remove the last directory that has been generated in the
output/tmp
folder and re-launch the main command to restart PhyloPhlan2 from where it failed, in order to not lose the computation made up to that point -
If the previous solution did not work, execute the command that is crashing but change the
-i
parameter with-c
in order to delete all theoutput
andoutput/tmp
folders. -
If the previous solution did not work, execute the command with
--clean_all
, this will remove all installation and database files that are automatically generated at the first run of PhyloPhlAn2.
This is the main PhyloPhlAn2 script, other information available here
usage: phylophlan2.py [-h] [-i PROJECT_NAME | -c CLEAN] [-o OUTPUT]
[-d DATABASE] [-t {n,a}] [-f CONFIG_FILE] --diversity
{low,medium,high} [--accurate | --fast] [--clean_all]
[--database_list] [-s SUBMAT] [--submat_list]
[--submod_list] [--nproc NPROC]
[--min_num_proteins MIN_NUM_PROTEINS]
[--min_len_protein MIN_LEN_PROTEIN]
[--min_num_markers MIN_NUM_MARKERS]
[--trim {gappy,not_variant,greedy}]
[--not_variant_threshold NOT_VARIANT_THRESHOLD]
[--subsample {phylophlan,onethousand,sevenhundred,\
fivehundred,threehundred,onehundred,fifty,twentyfive,\
tenpercent,twentyfivepercent,fiftypercent}]
[--unknown_fraction UNKNOWN_FRACTION]
[--scoring_function {trident,muscle,random}] [--sort]
[--remove_fragmentary_entries]
[--fragmentary_threshold FRAGMENTARY_THRESHOLD]
[--min_num_entries MIN_NUM_ENTRIES] [--maas MAAS]
[--remove_only_gaps_entries] [--mutation_rates]
[--force_nucleotides] [--input_folder INPUT_FOLDER]
[--data_folder DATA_FOLDER]
[--databases_folder DATABASES_FOLDER]
[--submat_folder SUBMAT_FOLDER]
[--submod_folder SUBMOD_FOLDER]
[--configs_folder CONFIGS_FOLDER]
[--output_folder OUTPUT_FOLDER]
[--genome_extension GENOME_EXTENSION]
[--proteome_extension PROTEOME_EXTENSION] [--verbose]
[-v]
optional arguments:
-h, --help show this help message and exit
-i PROJECT_NAME, --input PROJECT_NAME
-c CLEAN, --clean CLEAN
Clean the output and partial data produced for the
specified project (default: None)
-o OUTPUT, --output OUTPUT
Output folder name, otherwise it will be the name of
the input folder concatenated with the name of the
database used (default: None)
-d DATABASE, --database DATABASE
The name of the database of markers to use. (default:
None)
-t {n,a}, --db_type {n,a}
Specify the type of the database of markers, where "n"
stands for nucleotides and "a" for amino acids. If not
specified, PhyloPhlAn2 will automatically detect the
type of database (default: None)
-f CONFIG_FILE, --config_file CONFIG_FILE
The configuration file to load, four ready-to-use
configuration files can be generated using the
"write_default_configs.sh" script present in the
"configs" folder (default: None)
--diversity {low,medium,high}
Specify the "diversity" of phylogeny to build in order
to automatically adjust some parameters. Values can
be: "low": for genus-/species-/strain-level
phylogenies; "medium": for class-/order-level
phylogenies; "high": for tree-of-life size phylogenies
(default: None)
--accurate If specified will set some parameters that should
provide a more accurate phylogeny reconstruction.
Affected parameters vary depending also on the "--
diversity" parameter (default: False)
--fast If specified will set some parameters that should
provide a faster phylogeny reconstruction. Affected
parameters vary depending also on the "--diversity"
parameter (default: False)
--clean_all Remove all installation and database files that are
automatically generated at the first run of PhyloPhlAn
(default: False)
--database_list List of all the available databases that can be
specified with the -d (or --database) option (default:
False)
-s SUBMAT, --submat SUBMAT
Specify the substitution matrix to use, the available
substitution matrices can be listed using the "--
submat_list" parameter (default: None)
--submat_list List of all the available substitution matrices that
can be specified with the --submat option (default:
False)
--submod_list List of all the available substitution models that can
be specified with the --maas option (default: False)
--nproc NPROC The number of CPUs to use (default: 1)
--min_num_proteins MIN_NUM_PROTEINS
Proteomes (.faa) with less than this number of
proteins will be discarded (default: 1)
--min_len_protein MIN_LEN_PROTEIN
Proteins in proteomes (.faa) shorter than this value
will be discarded (default: 50)
--min_num_markers MIN_NUM_MARKERS
Input genomes or proteomes that map to less than the
specified number of markers will be discarded
(default: 0)
--trim {gappy,not_variant,greedy}
Specify which type of trimming to perform: "gappy"
will perform what specified in the "trim" section of
the configuration file to remove gappy columns
(suggested, trimal --gappyout); "not_variant" will
remove columns that have at least one nucleotide/amino
acid appearing above a certain threshold (see "--
not_variant_threshold" parameter); "greedy" performs
both "gappy" and "not_variant"; "None", no trimming
will be performed (default: None)
--not_variant_threshold NOT_VARIANT_THRESHOLD
Specify the value used to consider a column not
variant when "--trim not_variant" is specified
(default: 0.99)
--subsample {phylophlan,onethousand,sevenhundred,\
fivehundred,threehundred,onehundred,fifty,twentyfive,\
tenpercent,twentyfivepercent,fiftypercent}
Specify which function to use to compute the number of
positions to retain from single marker MSAs for the
concatenated MSA. "phylophlan" compute the number of
position for each marker as in PhyloPhlAn (almost!)
(works only when --database phylophlan); "onethousand"
return the top 1000 positions; "sevenhundred" return
the top 700; "fivehundred" return the top 500;
"threehundred" return the top 300; "onehundred" return
the top 100 positions; "fifty" return the top 50
positions; "twentyfive" return the top 25 positions;
"None", the complete alignment will be used (default:
None)
--unknown_fraction UNKNOWN_FRACTION
Define the amount of unknowns ("X" and "-") allowed in
each column of the MSA of the markers (default: 0.3)
--scoring_function {trident,muscle,random}
Specify which scoring function to use to evaluate
columns in the MSA results (default: None)
--sort If specified, the markers will be ordered, when using
the PhyloPhlAn database, it will be automatically set
to "True" (default: False)
--remove_fragmentary_entries
If specified the MSAs will be checked and cleaned from
fragmentary entries. See --fragmentary_threshold for
the threshold values above which an entry will be
considered fragmentary (default: False)
--fragmentary_threshold FRAGMENTARY_THRESHOLD
The fraction of gaps in the MSA to be considered
fragmentary and hence discarded (default: 0.85)
--min_num_entries MIN_NUM_ENTRIES
The minimum number of entries to be present for each
of the markers in the database (default: 4)
--maas MAAS Select a mapping file that specifies the substitution
model of amino acid to use for each of the markers for
the gene tree reconstruction. File must be tab-
separated (default: None)
--remove_only_gaps_entries
If specified, entries in the MSAs composed only of
gaps ("-") will be removed. This is equivalent to
specifying: --remove_fragmentary_entries
--fragmentary_threshold 1 (default: False)
--mutation_rates If specified will produced a mutation rates table for
each of the aligned markers and a summary table for
the concatenated MSA. This operation can take a long
time to finish (default: False)
--force_nucleotides If specified force PhyloPhlAn to use nucleotide
sequences for the phylogenetic analysis, even in the
case of a database of amino acids (default: False)
--verbose Makes PhyloPhlAn verbose (default: False)
-v, --version Prints the current PhyloPhlAn version and exit
Folder paths:
Parameters for setting the folder locations
--input_folder INPUT_FOLDER
Path to the folder containing the input data (default:
input/)
--data_folder DATA_FOLDER
Path to the folder where to store the intermediate
files, default is "tmp" inside the project's output
folder (default: None)
--databases_folder DATABASES_FOLDER
Path to the folder that contains the database files
(default: phylophlan_databases/)
--submat_folder SUBMAT_FOLDER
Path to the folder containing the substitution
matrices to use to compute the column score for the
subsampling step (default:
phylophlan_substitution_matrices/)
--submod_folder SUBMOD_FOLDER
Path to the folder containing the substitution models
mapping file for building the gene trees (default:
phylophlan_substitution_models/)
--configs_folder CONFIGS_FOLDER
Path to the folder containing the configuration files
that contains the software to use for the phylogenetic
analysis (default: phylophlan_configs/)
--output_folder OUTPUT_FOLDER
Path to the output folder where to save the results
(default: )
Filename extensions:
Parameters for setting the extensions of the input files
--genome_extension GENOME_EXTENSION
Set the extension for the genomes in your inputs
(default: .fna)
--proteome_extension PROTEOME_EXTENSION
Set the extension for the proteomes in your inputs
(default: .faa)
This script is used to build a custom database and it should be used if the user decides not to use one of the two databases provided. The output is a folder containing the markers ready to be used in phylophlan2.py
through the option -d
followed by the name of the said folder.
Other information here
usage: phylophlan2_setup_database.py [-h] (-i INPUT | -g GET_CORE_PROTEINS)
[-o OUTPUT] [-d DB_NAME]
[-e INPUT_EXTENSION] [-t {n,a}]
[-x OUTPUT_EXTENSION] [--overwrite]
[--verbose] [-v]
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Specify the path to either the folder containing the
marker files or the file of markers, in (multi-)fasta
format (default: None)
-g GET_CORE_PROTEINS, --get_core_proteins GET_CORE_PROTEINS
Specify the taxonomic label for which download the set
of core proteins. The label must represent a species:
"--get_core_proteins s__Escherichia_coli" (default:
None)
-o OUTPUT, --output OUTPUT
Specify path to the output folder where to save the
database (default: None)
-d DB_NAME, --db_name DB_NAME
Specify the name of the output database (default:
None)
-e INPUT_EXTENSION, --input_extension INPUT_EXTENSION
Specify the extension of the input file(s) specified
via -i/--input (default: None)
-t {n,a}, --db_type {n,a}
Specify the type of the database, where "n" stands for
nucleotides and "a" for amino acids (default: None)
-x OUTPUT_EXTENSION, --output_extension OUTPUT_EXTENSION
Set the database output extension (default: None)
--overwrite If specified and the output file exists it will be
overwritten (default: False)
--verbose Prints more stuff (default: False)
-v, --version Prints the current phylophlan2_setup_database.py
version and exit
This script allows the user to customize the phylogenetic analysis by creating a personalized configuration file, deciding which software to use for every mandatory section among the available ones, as seen above. The output is a text file, so if the user desires to customize the parameters of the selected software according to his need and the type of the analysis to be executed, he simply should open the generated configuration file with a text editor and then add/remove the specific options. Other information here
usage: phylophlan2_write_config_file.py [-h] -o OUTPUT -d {n,a}
(--db_dna {makeblastdb} | --db_aa {usearch,diamond})
[--map_dna {blastn,tblastn,diamond}]
[--map_aa {usearch,diamond}] --msa
{muscle,mafft,opal,upp}
[--trim {trimal}]
[--gene_tree1 {fasttree,raxml,iqtree}]
[--gene_tree2 {raxml}] --tree1
{fasttree,raxml,iqtree,astral,astrid}
[--tree2 {raxml}] [-a]
[--force_nucleotides] [--overwrite]
[--verbose] [-v]
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Specify the output file where to write the
configurations (default: None)
-d {n,a}, --db_type {n,a}
Specify the type of the database, where "n" stands for
nucleotides and "a" for amino acids (default: None)
--db_dna {makeblastdb}
Add the "db_dna" section of the selected software that
will be used for building the indexed database
(default: None)
--db_aa {usearch,diamond}
Add the "db_aa" section of the selected software that
will be used for building the indexed database
(default: None)
--map_dna {blastn,tblastn,diamond}
Add the "map_dna" section of the selected software
that will be used for mapping the database against the
input genomes (default: None)
--map_aa {usearch,diamond}
Add the "map_aa" section of the selected software that
will be used for mapping the database against the
input proteomes (default: None)
--msa {muscle,mafft,opal,upp}
Add the "msa" section of the selected software that
will be used for producing the MSAs (default: None)
--trim {trimal} Add the "trim" section of the selected software that
will be used for the gappy regions removal of the MSAs
(default: None)
--gene_tree1 {fasttree,raxml,iqtree}
Add the "gene_tree1" section of the selected software
that will be used for building the phylogenies for the
markers in the database (default: None)
--gene_tree2 {raxml} Add the "gene_tree2" section of the selected software
that will be used for refining the phylogenies
previously built with what specified in the
"gene_tree1" section (default: None)
--tree1 {fasttree,raxml,iqtree,astral,astrid}
Add the "tree1" section of the selected software that
will be used for building the first phylogeny
(default: None)
--tree2 {raxml} Add the "tree2" section of the selected software that
will be used for refining the phylogeny previously
built with what specified in the "tree1" section
(default: None)
-a, --absolute_path Write the absolute path to the executable instead of
the executable name as found in the system path
environment (default: False)
--force_nucleotides If specified sets parameters for phylogenetic analysis
software so that they use nucleotide sequences, even
in the case of a database of amino acids (default:
None)
--overwrite Overwrite output file if it exists (default: False)
--verbose Prints more stuff (default: False)
-v, --version Prints the current phylophlan2_write_config_file.py
version and exit
This script allows the user to assign to each bin that comes from a metagenomic assembly analysis its closest species-level genome bins (SGBs). This is particularly useful when the user needs to identify bins assembled from metagenomes. The main output file to consider will be a tsv file containing for each genome information about the bin it has been assigned to. Other information here
usage: phylophlan2_metagenomic.py [-h] -i INPUT [-o OUTPUT_PREFIX]
[-d DATABASE] [-m MAPPING]
[-e INPUT_EXTENSION] [-n HOW_MANY]
[--nproc NPROC]
[--database_folder DATABASE_FOLDER]
[--only_input] [--add_ggb] [--add_fgb]
[--overwrite] [--verbose] [-v]
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Input folder containing the metagenomic bins to be
indexed (default: None)
-o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
Prefix used for the output folders: indexed bins,
distance estimations. If not specified, the input
folder will be used (default: None)
-d DATABASE, --database DATABASE
Specify the name of the database, if not found locally
will be automatically downloaded (default: SGB.Jan19)
-m MAPPING, --mapping MAPPING
Specify the name of the mapping file, if not found
locally will be automatically downloaded (default:
SGB.Jan19)
-e INPUT_EXTENSION, --input_extension INPUT_EXTENSION
Specify the extension of the input file(s) specified
via -i/--input. If not specified will try to infer it
from the input files (default: None)
-n HOW_MANY, --how_many HOW_MANY
Specify the number of SGBs to report in the output;
"all" is a special value to report all the SGBs; this
param is not used when "--only_input" is specified
(default: 10)
--nproc NPROC The number of CPUs to use (default: 1)
--database_folder DATABASE_FOLDER
Path to the folder that contains the database file
(default: phylophlan2_databases/)
--only_input If specified provides a distance matrix between only
the input genomes provided (default: False)
--add_ggb If specified adds GGB assignments. If specified with
--add_fgb, then -n/--how_many will be set to 1 and
will be adding a column that reports the closest
reference genome (default: False)
--add_fgb If specified adds FGB assignments. If specified with
--add_ggb, then -n/--how_many will be set to 1 and
will be adding a column that reports the closest
reference genome (default: False)
--overwrite If specified overwrite the output file if exists
(default: False)
--verbose Prints more stuff (default: False)
-v, --version Prints the current phylophlan2_metagenomic.py version
and exit
This script is used to get reference genomes of a specified species. This is particularly useful when the user needs to build a tree to confront samples with an existing one. When using the -g
parameter the output will be a directory with the requested genomes. Other information here
usage: phylophlan2_get_reference.py [-h] (-g GET | -l)
[-e OUTPUT_FILE_EXTENSION] [-o OUTPUT]
[-n HOW_MANY] [-m GENBANK_MAPPING]
[--verbose] [-v]
optional arguments:
-h, --help show this help message and exit
-g GET, --get GET Specify the taxonomic label for which download the set
of reference genomes. The label must represents a
valid taxonomic level or the special case "all"
(default: None)
-l, --list_clades Print for all taxa the total number of species and
reference genomes available (default: False)
-e OUTPUT_FILE_EXTENSION, --output_file_extension OUTPUT_FILE_EXTENSION
Specify path to the extension of the output files
(default: .fna.gz)
-o OUTPUT, --output OUTPUT
Specify path to the output folder where to save the
files, required when -g/--get is specified (default:
None)
-n HOW_MANY, --how_many HOW_MANY
Specify how many reference genomes to download for each species,
where -1 stands for "all available" (default: 4)
-m GENBANK_MAPPING, --genbank_mapping GENBANK_MAPPING
The local GenBank mapping file, if not found it will
be automatically downloaded (default:
assembly_summary_genbank.txt)
--verbose Prints more stuff (default: False)
-v, --version Prints the current phylophlan2_get_reference.py version
and exit
This script can be used to perform analysis on trees built with phylophlan2.py
. The output is a table that contains the subtrees and information about the minimum, mean, and maximum distance between nodes in the subtree, the minimum, mean and maximum mutation rate between nodes in the subtree, the distance and mutation rate between each node in the subtree.
Other information here
usage: phylophlan2_strain_finder.py [-h] -i INPUT -m MUTATION_RATES
[--p_threshold P_THRESHOLD]
[--m_threshold M_THRESHOLD]
[--tree_format {newick,nexus,phyloxml,cdao,nexml}]
[-o OUTPUT] [--overwrite] [-s {,, ,;}]
[--verbose] [-v]
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Specify the file of the phylogenetic tree as generated from
phylophlan2.py (default: None)
-m MUTATION_RATES, --mutation_rates MUTATION_RATES
Specify the file of the mutation rates as generated from
phylophlan2.py (default: None)
--p_threshold P_THRESHOLD
Maximum phylogenetic distance threshold for every pair
of nodes in the same subtree (inclusive) (default:
0.05)
--m_threshold M_THRESHOLD
Maximum mutation rate ratio for every pair of nodes in
the same subtree (inclusive) (default: 0.05)
--tree_format {newick,nexus,nexml,phyloxml,cdao}
Specify the format of the input tree. (default:
NEWICK)
-o OUTPUT, --output OUTPUT
Specify the name of the output filename, if not specified
default is stdout (default: None)
--overwrite If specified, will overwrite the output file if it
exists (default: False)
-s {",","\t",";"}, --separator {",","\t",";"}
Specify the separator you want in the output (default:
"\t")
--verbose Write more stuff (default: False)
-v, --version Prints the current phylophlan2_strain_finder.py version
and exit
This script can be used to visualize the results obtained in phylophlan2_metagenomic.py
. The outputs are two heatmaps, one showing the presence/absence of the top SGBs (customizable through --top
) in the metagenomes, the other showing the number of kSGBs and uSGBs in each metagenome, and two relative output files containing the data used to build them.
Other information here
usage: phylophlan2_draw_metagenomic.py [-h] -i INPUT --map MAP [--top TOP]
[-o OUTPUT] [-s SEPARATOR] [--dpi DPI]
[-f F] [--verbose] [-v]
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Specify the input tsv file generated from
‘phylophlan2_metagenomic.py’ (default: None)
--map MAP Specify a mapping tsv file that maps for each bin its
metagenome (default: None)
--top TOP Specify the number of SGBs to display in the figure,
if not specified is set to 20 (default: 20)
-o OUTPUT, --output OUTPUT
Specify the prefix of the output file and image,
otherwise it will be set to default to output_heatmap
(default: output_heatmap)
-s SEPARATOR, --separator SEPARATOR
Specify the separator used in the mapping file,
default is tab (default: '\t')
--dpi DPI Specify the dpi of the images. Default is 200
(default: 200)
-f F Specify deisired format for images. Default is svg
(default: svg)
--verbose Prints more stuff (default: False)
-v, --version Prints the current phylophlan2_draw_metagenomic.py
version and exit