Home

PhyloPhlAn 3.0

PhyloPhlAn2 is an integrated pipeline for large-scale phylogenetic profiling of genomes and metagenomes.

Most likely the easiest way to understand how you can use PhyloPhlAn2 in your analysis is to check out the examples below:

In PhyloPhlAn2 the markers database used to extract the phylogenetic signal can be either one of the two provided: PhyloPhlAn and AMPHORA2, or one defined by the user. For the correct working of PhyloPhlAn2, a script that format the database to be used by PhyloPhlAn2 is provided. The phylogenetic pipeline can be entirely configured through a set of input parameters and a configuration file (scripts are provided to generate different configuration files to perform different phylogenetic analysis) where the user can specify the preferred software to be used to perform each step of the pipeline. In addition, several parameters can be used to define whether to apply trimming and subsampling of the alignments produced, for instance, the type of trimming, the function to use to score the multiple-sequence alignments and the function that selects how many position to consider for each of the multiple-sequence alignment.

PhyloPhlAn ver. 1.0: If you are looking for the tutorial of PhyloPhlAn ver. 1.0, have a look here

[TOC]

Installation

There are three installation methods available, we recommend you to use the Conda-based ones to guarantee that all PhyloPhlAn2 dependencies will be automatically satisfied.

Conda package [easy]

This option will be available soon

~~This requires a working Conda installation.~~

#!bash

conda install phylophlan2

~~Potentially add -c bioconda in case the Bioconda channel is not in your default channels list.~~

Conda environment [medium]

This installation procedure doesn't bring the external tools. You are advised to follow this procedure if you wish to install and configure the tools independently.

Step 1: Clone the ppa2 conda environment

This requires a working Conda installation.

#!bash

conda env create fasnicar/ppa2

Step 2: Clone the PhyloPhlAn repository

This requires Mercurial.

#!bash

hg clone https://bitbucket.org/nsegata/phylophlan

PhyloPhlAn2 is at the moment available only in the dev branch of the repository, so you should change branch after having cloned the repository with cd phylophlan && hg up dev

Step 3: Install the Dependencies and Tools necessary to run the PhyloPhlAn2 pipeline

Step 4: Activate the ppa2 conda environment

Before running PhyloPhlAn2, you need to activate the conda environment:

#!bash

conda activate ppa2

If correctly activated, at the beginning of the command line you should see the (ppa2) prefix.

Note: if you follow the medium installation process you should be aware that every command has to be preceded by ./, including those in the bash files provided for each example that execute all commands

Repository from Bitbucket [hard]

Step 1: Get the latest PhyloPhlAn2 version from the repository

This requires Mercurial.

#!bash

hg clone https://bitbucket.org/nsegata/phylophlan

Step 2: Install the Dependencies and Tools necessary to run the PhyloPhlAn2 pipeline

Note: if you follow the hard installation process you should be aware that every command has to be preceded by ./, including those in the bash files provided for each example that execute all commands

Test PhyloPhlAn2 installation

In order to verify that PhyloPhlAn2 is properly installed, the following command:

#!bash

phylophlan2.py --version

should output something like below:

PhyloPhlAn2 version 0.40 (10 September 2019)

Note: if you have followed the medium or hard installation method in order to use PhyloPhlAn2 you should do one of the following.

Change to the repository you have cloned:

#!bash

cd phylophlan/

Alternatively, add to the last line of the .bashrc file in your home directory this line, where <path-to-directory> is the absolute path to the repository you have cloned:

#!bash

export PATH="<path-to-directory>:$PATH"

Basic usage

#!bash

phylophlan2.py -i <input_folder> -d <database> --diversity <low-medium-high> -f <configuration_file>

where:

<input_folder> is the folder containing your input genomes and/or proteomes, a detailed description is available here
<database> is the name of the database of markers to use, a detailed description is available here
--diversity takes value in {low, medium, high} and its used to automatically set the analysis to the type of phylogeny to build, a detailed description is available here
<configuration_file> is the path to the configuration file necessary to properly run PhyloPhlAn2, a detailed description is available here

Input Files

PhyloPhlAn2 takes FASTA files (also compressed in Gzip, .gz and/or Bzip2, .bz2) as input. Inputs can be both genomes and proteomes, also mixed, and by default genomes and proteomes are distinguished by the .fna and .faa extension, respectively.

If needed, genomes and proteomes file extensions can be configured using the --genome_extension and --proteome_extension params, respectively.

Nucleotide or Amino acid pipeline

When using PhyloPhlAn2 the user can customize each step of the pipeline used to build the tree (marker genes identification, multiple sequence alignment, concatenation or gene trees inference and phylogeny reconstruction) by specifying the desired tools in the configuration file. These should be selected according to the type of markers and input used in the analysis and will lead to diverse phylogenetic pipelines:

when both markers and inputs are nucleotides it will proceed in the nucleotide space
when markers are proteins and inputs a mix of genomes and proteomes it will proceed in translated sequence space, so amino acidic. If the input is strictly genomic one can decide to specify the --force_nucletides parameter to use a nucleotidic pipeline even though the database is amino acidic. The configuration file should be created accordingly with the --force_nucleotides parameter.

Diversity

The --diversity parameter is an easy way that allows for three distinct pre-defined options used to set several other PhyloPhlAn2 parameters (e.g., trimming, subsampling, fragmentary removal, etc.) to specific values for the expected type of diversity in the phylogeny to be built.

The user can choose among three values:

Diversity	Description
`low`	for species- and strain-level phylogenies
`medium`	for genus- and family-level phylogenies
`high`	for tree-of-life and higher-ranked taxonomic levels phylogenies

Accurate or Fast

If not specified, PhyloPhlAn2 will automatically run with the --accurate option, that will consider more phylogenetic positions that should reflect in a more accurate phylogenetic reconstruction.

The --fast option can be specified to have a faster phylogeny pipeline reconstruction.

Both options will affect several other PhyloPhlAn2 params, setting optimal parameters based also on the --diversity chosen by the user. A detailed description is available here.

Output

All PhyloPhlAn2 produced files are available in the <input_folder>_<database> folder (or in the folder specified with --output_folder) created in the directory where the script is run.

Inside there is a temporary folder (<input_folder>_<database>/tmp) that contains all the intermediate files produced during the analysis pipeline.

Depending on the configuration file and hence on the pipeline executed, the resulting output files may have different names.

For instance, using the supermatrix_aa.cfg configuration file that can be automatically generated using the phylophlan2_write_default_configs.sh script provided with PhyloPhlAn2, the output files will be:

Filename	Description
RAxML_bestTree.input_folder_refined.tre	is the final (refined) phylogeny produced by RAxML starting from the FastTree phylogeny
input_folder.tre	is the phylogeny built by FastTree
input_folder.aln	is the multiple sequence alignment used as input for the phylogenies, in FASTA format

Parallel computations

The user can specify the number of CPUs to use with the --nproc parameter:

#!bash

phylophlan2.py -i <input_folder> -d <database> --diversity <low-med-high> -f <configuration_file> --nproc <N>

Please note that regardless of the number of CPUs specified with --nproc, PhyloPhlAn2 will run:

FastTree with 3 CPUs (as suggested in the FastTree FAQs) and, in any case, this is not regulated by the --nproc param because FastTree uses the OMP_NUM_THREADS variable, which is defined in the configuration file.
RAxML with no more than 20 CPUs in the case --nproc is greater than 20 as in our experience using more than 20 CPUs with RAxML do not shorten the computational time required for the phylogeny reconstruction.

Note: if you specify with --nproc a higher number of CPUs compared to the ones available in your machine, you will experience a significant drop in the software performances, as also reported in the RAxML manual.

Databases

PhyloPhlAn2 is able to automatically download two databases:

PhyloPhlAn (-d phylophlan, 400 universal marker genes) presented in Segata, N et al. NatComm 4:2304 (2013)
AMPHORA2 (-d amphora2, 136 universal marker genes) presented in Wu M, Scott AJ Bioinformatics 28.7 (2012)

Moreover, in addition to the two databases provided, as explained in the following database setup section, it is possible to retrieve a set of core proteins of a specific species, or even build custom databases starting from either a folder containing marker files or a multi-fasta file containing the marker sequences (e.g., multi-fasta file with the core genes sequences from Roary).

Expert usage

In this section, we provide as many details as possible for the other parameters and configurations available in PhyloPhlAn2.

Input and phylogenetic markers quality control

When building a phylogeny PhyloPhlAn2 makes sure that input genomes/proteomes and markers respect a certain threshold of quality. It is possible to customize these controls through the following parameters:

--min_num_proteins <n>: used to make sure that proteomes (.faa) with less than this number of proteins will be discarded. Default is 1
--min_len_protein <n>: this parameter is associated with the previous --min_num_proteins and it is used to specify the minimum length proteins in proteomes (.faa) should have. Proteins that are shorter than this value will be discarded. Default is 50

The above parameters have no effect when the pipeline is strictly genomic (both markers and input are nucleotides) or when --force_nucleotides is specified in the command line; see this section for more information.

--min_num_markers <n>: input genomes or proteomes that map to less than the specified number of markers will be discarded. Default is 0, unless the database specified with -d is phylophlan or amphora, in these cases default is respectively 100 and 34
--min_num_entries <n>: database markers that are found in less than the specified number of input entries will be discarded. Default is 4
--remove_fragmentary_entries: if specified the multiple sequence alignment (MSA) will be checked and cleaned from fragmentary entries. See --fragmentary_threshold for the threshold values above which an entry will be considered fragmentary. Default is false
--fragmentary_threshold <n>: used to specify the fraction of gaps in each row in the MSA to be considered fragmentary and hence discarded. Default is 0.85
--remove_only_gaps_entries: if specified, entries in the MSAs composed only of gaps will be removed. This is equivalent to specifying --remove_fragmentary_entries and --fragmentary_threshold 1. Default is false

Accurate or Fast

The following table shows which parameters are affected and how their values will automatically change according to the combination of the --diversity and --accurate/--fast parameters.

	`--diversity low`	`--diversity medium`	`--diversity high`
`--accurate`	`--trim not_variant` `--submat pfasum60` `--remove_fragmentary_entries` `--not_variant_threshold 0.99`	`--trim gap_trim` `--remove_fragmentary_entries` `--fragmentary_threshold 0.85` `--submat pfasum60` `--subsample onehundred` `--scoring_function trident`	`--trim greedy` `--remove_fragmentary_entries` `--fragmentary_threshold 0.75` `--submat pfasum60` `--subsample twentyfive` `--scoring_function trident` `--not_variant_threshold 0.95` `--gap_perc_threshold 0.85`
`--fast`	`--trim greedy` `--remove_fragmentary_entries` `--fragmentary_threshold 0.85` `--submat pfasum60` `--subsample fivehundred` `--scoring_function trident` `--gap_perc_threshold 0.67`	`--trim greedy` `--remove_fragmentary_entries` `--fragmentary_threshold 0.75` `--submat pfasum60` `--subsample fifty` `--scoring_function trident` `--not_variant_threshold 0.97` `--gap_perc_threshold 0.75`	`--trim greedy` `--remove_fragmentary_entries` `--fragmentary_threshold 0.67` `--submat pfasum60` `--subsample phylophlan` or `--subsample tenpercent` `--scoring_function trident` `--not_variant_threshold 0.9` `--gap_perc_threshold 0.85`

Note: if you manually specify in the command line one or more of the above parameters, that will override the automatic value for the specific combination of --diversity and --accurate/--fast for that parameter(s).

Trimming

You can specify the trimming strategy to use with the --trim parameter. The user can choose between four different options:

`--trim`	Description
`gap_trim`	will perform what specified in the `trim` section of the configuration file, which by default is trimAl with the `--gappyout` parameter, as presented in Capella-Gutiérrez S, et al. Bioinformatics 25.15 (2009) and in the trimAl website
`gap_perc`	remove columns with a percentage of gaps above a certain threshold, regulated by the `--gap_perc_threshold` parameter, which default value is 0.67
`not_variant`	removes columns from a multiple-sequence aligned file that has at least one amino acid appearing above a certain threshold (threshold set by the `--not_variant_threshold` parameter, which default value is 0.95)
`greedy`	performs all the above trimming options

The default is None. In this case, the trimming step will not be performed.

Subsampling

Site subsampling strategy allows retaining a certain amount of phylogenetically relevant (decided based on the scoring function) positions only.

In PhyloPhlAn2 you can specify the subsample strategy using the --subsample parameter.

There are several options available that will set a different amount of retained positions:

`--subsample`	Description
`phylophlan`	uses the formula presented in Segata, N et al. NatComm 4:2304 (2013) to determine how many positions to retain for each one of the 400 PhyloPhlAn markers
`onethousand`	retains up-to 1000 positions for each marker
`sevenhundred`	retains up-to 700 positions for each marker
`fivehundred`	retains up-to 500 positions for each marker
`threehundred`	retains up-to 300 positions for each marker
`onehundred`	retains up-to 100 positions for each marker
`fifty`	retains up-to 50 positions for each marker
`twentyfive`	retains up-to 25 positions for each marker
`tenpercent`	retain 10% of the positions for each marker
`twentyfivepercent`	retain 25% of the positions for each marker
`fiftypercent`	retain 50% of the positions for each marker

Note: the --subsample phylophlan option works when using the PhyloPhlAn database only, specified via -d phylophlan

The default is None. In this case, the subsampling will not be performed and the full-length alignment will be used.

Scoring function

A scoring function is used in PhyloPhlAn2 to assign to each column in the MSAs a phylogenetic score, that will be then used to rank the MSA positions to retain a subset (see Subsampling).

The --scoring_function parameter allows three different scoring functions:

`--scoring_function`	Description
`muscle`	implements the same scoring function defined in Edgar, RC NAR 32.5 (2004), when specifying the `-scorefile` param
`trident`	implements the `trident` scoring function as presented in Valdar, WSJ. Proteins 48.2 (2002), which is a weighted combination of symbol diversity, stereochemical diversity, and gap cost
`random`	assigns random scores to each position in the MSAs

Substitution matrices

Some of the functions for scoring the MSA columns need a substitution matrix to evaluate the substitution of amino acids.

Substitution matrices can be specified using the --submat param that could assume one of the following values.

`--submat`	Description
`vtml200`	substitution matrix proposed by Yamada K, Tomii K Bioinformatics 30.3 (2014)
`vtml240`	substitution matrix used in Edgar RC NAR 32.5 (2004)
`miqs`	substitution matrix proposed by Tomii K and Kazunori Y Humana Press, New York, NY, 1415 (2016)
`pfasum60`	substitution matrix proposed by Keul F et al. BMC Bioinformatics 18.1 (2017)

The substitution matrices presented above are distributed with PhyloPhlAn2. However, the set of substitution matrices could be extended with user-defined ones. The user can generate its own substitution matrices using the scripts (generate_matrices.sh and serialize_matrix.py) provided into the phylophlan2_substitution_matrices folder.

Mutation rates table

PhyloPhlAn2 has the --mutation_rates option that computes the amount of nucleotide or amino acid changes in each aligned marker.

In the output folder <input_folder>_<database>/mutation_rates/, you can find a mutation rate table for all the markers whereas the <input_folder>_<database>/mutation_rates.tsv file contains the summarized mutation rates table for the complete multiple sequence alignment.

Sorting

Using the --sort it is possible to sort the markers and hence force PhyloPhlAn2 to consider them in a specific order when concatenating the sequences.

When using the PhyloPhlAn database (-d phylophlan), --sort will be automatically set to True.

Note: the sort preference is used only for the super-matrix approach (concatenation) only.

Database setup (`phylophlan2_setup_database.py` usage)

To build a custom database, we provide the phylophlan2_setup_database.py script to be run with the following syntax:

#!bash

phylophlan2_setup_database.py -i <input_file_or_folder> -d <database_name> -e <input_extension> -t <database_type>

where:

<input_file_or_folder> is the folder containing markers files or a multi-fasta file containing the markers
<database_name> is the database name chosen by the user (the name to use when running PhyloPhlAn2)
<input_extension> is the extension of the input file(s)
<database_type> has to be n if it is a nucleotide database or a if it is an amino acids database (depending on the input provided by the user)

The database will be created in the same folder of the input file(s), or you can specify an output folder with the -o option.

The phylophlan2_setup_database.py script can be used to automatically retrieve a set of core proteins of a specific species using the -g option (instead of the -i param). In this case, you need to specify the species name by typing -g s__<species_name>. This is also going to be the default name of the database if not differently specified with -d.
The decision between -i or -g depends on whether the user already has a folder containing the markers or is asking PhyloPhlAn2 to download a set of core markers.

Configuration File (`phylophlan2_write_default_configs.sh` usage)

PhyloPhlAn2 relies on the configuration file for handling the external software.

A configuration file can be specified in phylophlan2.py with either -f <configuration> or --config_file <configuration>.

Each configuration file is composed of different sections (some are mandatory, to ensure to be able to complete a phylogenetic analysis, and some are optional). Each section refers to a specific step in the phylogenetic pipeline and contains all the details for the external software to use.

In PhyloPhlAn2 you can find the phylophlan2_write_default_configs.sh script that will generate four ready-to-use configuration files:

supermatrix_aa.cfg
supermatrix_nt.cfg
supertree_aa.cfg
supertree_nt.cfg

More information about the supermatrix and supertree approaches are available in the following section.

Note: Please, be careful if you have specified in your configuration file and are going to use diamond in your analysis

Generate a custom configuration file (`phylophlan2_write_config_file.py` usage)

If you want to generate your own configuration file, you can use the phylophlan2_write_config_file.py script. Below, an example of the command used to create a customized configuration file is provided. It uses diamond instead of blastn and muscle instead of mafft, with respect to the supermatrix_nt.cfg configuration file generated by the phylophlan2_write_default_configs.sh script:

#!bash

python phylophlan2_write_config_file.py \
    -o custom_config_nt.cfg \
    -d n \
    --db_dna makeblastdb \
    --map_dna diamond \
    --msa muscle \
    --trim trimal \
    --tree1 fasttree \
    --tree2 raxml

where:

-o is the output filename
-d indicates the type of database this configuration file is tailored for, a detailed description is available here
--db_dna, --map_dna, --msa, --trim, --tree1, --tree2, indicate the sections the configuration file will contain

Note: Please, be careful if you have specified in your configuration file and are going to use diamond in your analysis

Mandatory sections

The following sections are strictly required to be defined in a configuration file:

Mandatory section	Description
`--db_dna` and/or `--db_aa`	Choices for `db_dna`: (`makeblastdb`), for `db_aa`: (`usearch`, `diamond`)
`--map_dna` and/or `--map_aa`	specify the software for mapping the database against genomes and proteomes, respectively. Choices for `map_dna`: (`blastn`, `tblastn`, `diamond`), for `map_aa`: (`usearch`, `diamond`)
`--msa`	specify the software for performing the multiple-sequence alignment. Choices are: `muscle`, `mafft`, `opal`, `upp`
`--tree1`	specify the software for inferring the phylogeny. Choices are: `fasttree`, `raxml`, `iqtree`, `astral`, `astrid`

Optional sections

Optional section	Description
`--trim`	specify the software `trimal` for performing the trimming of gappy regions
`--gene_tree1`	specify the software to use for building the single-gene trees. Choices are `fasttree`, `raxml`, `iqtree`
`--gene_tree2`	specify the software `ramxl` for refining the phylogenies built at the `gene_tree1` step
`--tree2`	specify the software `raxml` for refining the phylogeny built at the `tree1` step

Integrating new tools in the framework

PhyloPhlAn2 allows users to integrate new tools that are not available in the framework, as well as their parameters, for each of the different steps. This is done by manually editing the configuration file or creating a new configuration file with the desired tools/parameters.

The only requirement for this integration is that the input and output file formats of the tool are compatible with the framework's default tools.

Here is an example section of a default supermatrix configuration file that uses MAFFT for multiple-sequence alignment:

[msa]
program_name = mafft
params = --quiet --anysymbol --thread 1 --auto
version = --version
command_line = #program_name# #params# #input# > #output#

And here is the same section modified to use Clustal Omega that is not a default option in PhyloPhlAn2 for the multiple-sequence alignment step:

[msa]
program_name = clustalo
input = -i
output = -o
params = --threads 1 --auto
version = --version
command_line = #program_name# #params# #input# #output#

Configuration variables explained

A configuration file can be compmosed by several different sections, but there exists a minimum set of sections that has to be present to completes a phylogenetic analysis. The mandatory sectinos are:

either map_dna or map_aa
msa
tree1

The complete list of sections available are:

map_dna
map_aa
msa
trim
gene_tree1
gene_tree2
tree1
tree2

Each of the above sections can have several different options specified. These are required in order to be able to compose a command line that can run an external tool. The set of mandatory options that each of the section in a configuration file has to specify are:

program_name
command_line

The complete list of options available are:

program_name
params
threads
input
database
output_path
output
version
environment
command_line

In particular, the command_line option specifies how the other options should be arranged in order to build a running command line. For instance, taking this section of a configuration file:

[msa]
program_name = mafft
params = --quiet --anysymbol --thread 1 --auto
version = --version
command_line = #program_name# #params# #input# > #output#

In the command_line option it is specified that first there should be the information provided in the program_name option, followed by the information in the params options. Then there is the information about the input option. Note that in this configuration no input option is specified, so in this case, PhyloPhlAn2 will read the input from the standard input. After the input option there is the output redirect sign (>) followed by the output option. Note that also in this case there is no parameter to specify an output file, so PhyloPhlAn will redirect the output to the output file.

Supermatrix (concatenation) or Supertree (gene trees) approach

PhyloPhlAn2 allows executing either a Supermatrix (or concatenation) pipeline, as well as a Supertree (or gene trees) pipeline.

The type of phylogenetic pipeline that will be executed is determined on the settings in the configuration file.

Supermatrix (concatenation)

The Supermatrix pipeline is the default, determined also by the mandatory sections.

In other words, when neither gene_tree1 nor gene_tree2 sections are present in the configuration file, PhyloPhlAn2 will perform a concatenation pipeline.

Supertree (gene trees)

This approach is to be preferred when building a large phylogeny but will build a tree with unresolved branch length tips. For a Supertree pipeline, the required section in the configuration file is: gene_tree1.

In order to use a gene trees pipeline, the user has to manually edit the [tree1] section in the configuration file in which the paths to the ASTRAL jar file and the example file for the version option (needed to verify the correct installation of ASTRAL) need to be specified.

Below the [tree1] section example template that needs to be edited:

[tree1]
command_line = #program_name# #input# #output#
program_name = java -jar /../path_to_astral/../astral.4.11.1.jar
input = -i
output = -o
version = -i /../path_to_astral/../astral-4.11.1/test_data/song_mammals.424.gene.tre

Note: the order of the options for the [tree1] section can differ from the above example.

Assigning SGBs (`phylophlan2_metagenomic.py` usage)

PhyloPhlAn2 allows you now to assign to each bin that comes from a metagenomic assembly analysis its closest species-level genome bins (SGBs, see Pasolli, E et al. Cell (2019) for further details).

The only mandatory parameter is -i followed by the name of the input directory that contains the bins, for example:

#!bash
phylophlan2_metagenomic.py -i input_folder

Other parameters that can be specified are:

-o: allows you to decide the output prefix that will be used for the two output directories and the output file. If not specified, the prefix used is <input_folder> as in the previous example where the two output folders will be input_folder_dists and input_folder_sketches, and the output file will be input_folder.tsv
-n: allows you to decide how many SGBs (sorted by increasing average genomic distance) will be reported for each input bin in the output file. Also, the keyword --all is accepted. If not specified, default is 10
--nproc: allows you to how many CPUs can be used. If not specified, default is 1

A practical example of its usage is given in the example 3. Metagenomics

Output description

The phylophlan2_metagenomic.py script has three different types of outputs: (1) list of the top -n/--how_many SGBs sorted by their average Mash distance, (2) closest SGB, GGB, FGB, and reference genomes, and (3) "all vs. all" matrix of all pairwise Mash distances.

Output 1

Each line reports the bin name and the list of the closest SGBs (sorted by their increasing average Mash distance) in a tab-separated fashion. The information of each SGB are separated by :. For example:

my_bin	(k|u)SGB_ID:taxa_level:taxonomy:average_mash_distance	[(k|u)SGB_ID:taxa_level:taxonomy:average_mash_distance]

Where:

my_bin is the input bin name;
(k|u)SGB_ID is the SGB ID and starts with either k or u to indicate whether it is a known or an unknown SGB;
taxa_level can be either Species, Genus, Family, or Phylum, depending at which taxonomic level the SGB has been assigned to;
taxonomy is the full taxonomic label assigned to the SGB
average_mash_distance is the average Mash distance of the input bin w.r.t. all the genomes in the SGB.

Output 2

Similar to the output of Case 1., but with the difference that the information reported are for the closest SGB, then the closest GGB, followed by the closest FGB, and finally the closest reference genomes, according to their respective Mash distances.

Output 3

In this case, phylophlan2_metagenomic.py produces a square matrix of all pairwise distances of the only input bins.

Note: this is working with up to 100,000 input bins. If you have more than that you should divide your input bins into batches of no more than 100,000 bins each.

Getting reference genomes of a specified species (`phylophlan2_get_reference.py` usage)

This feature is used for getting reference genomes of specified species. This is particularly useful when you need to build a tree to phylogenetically compare your samples with existing ones. The only mandatory parameter is -g <label> used to specify the taxonomic label for which you need to download the set of references. The <label> must represent any valid taxonomic level or the special case all and is best used:

-g s__<species_name> in low diversity trees. A practical example of its usage is given in 1. Phylogenetically characterized isolate genomes of a given species (S. aureus)
-g all in high diversity trees. A practical example of its usage is given in 2. Tree of life

Finding strains in trees (`phylophlan2_strain_finder.py` usage)

This script can be used to perform analysis on trees generated with phylophlan2.py since it outputs a number of subtrees built according to the relative similarity of nodes, where similarity is defined by two criterions, whose softness can be decided by the user through two parameters at the same time:

--p_thr <num> : it defines the phylogenetic threshold to test on the tree. The phylogenetic distance between any node from the same subtree will be less than this threshold;
--m_thr <num>: it defines the mutation rate to test on the tree. The mutation rates between any node from the same subtree will be less than this threshold.

You need to provide the tree file with -i and the mutation rates table with -m:

phylophlan2_strain_finder.py -i <input_tree> --tree_format <input_tree_format>  -m <mutation_rates.tsv>

Notice that you have the <mutation_rates.tsv> table only if you use the parameter --mutation_rates when executing phylophlan2.py, as explained above

Drawing heatmaps to visualize the output from phylophlan2_metagenomic.py (`phylophlan2_draw_metagenomic.py` usage)

The phylophlan2_draw_metagenomic.py script can be used to visualize the results obtained form phylophlan2_metagenomic.py and its basic usage is:

phylophlan2_draw_metagenomic.py -i <output_metagenomic> --map <bin2meta.tsv>

where:

<output_metagenomic> is the output file generated by phylophlan2_metagenomic.py as detailed above;
<bin2meta.tsv> passed with the --map parameter is a mapping file that links each bin to the metagenome it has been reconstructed from. It is a tab-separated file where the input bins are in the first column and metagenomes in the second column.

Note: when building the mapping file make sure the names used for bins are consistent with the ones used as inputs with phylophlan2_metagenomic.py.

A usage example of phylophlan2_draw_metagenomic.py is given in the example 3. Metagenomics

Requirements

Dependencies

Python (version >=3.0)
NumPy (version >=1.12.1)
Biopython (version >=1.70)
DendroPy (version >=4.2.0)

External Tools

PhyloPhlAn2 also needs the following tools:

At least one phylogenetic inference software tool: RAxML, FastTree, IQ-TREE
At least one multiple sequence alignment tool: MUSCLE, MAFFT, Opal, UPP
trimAl for the trimming of the multiple sequence alignment (optional)
blast+ for database building and mapping of nucleotides databases
USEARCH and/or DIAMOND for database building and mapping of nucleotides and/or amino acids databases.

Known Issues

If you use DIAMOND we notice that sometimes it can happen that it crashes, most likely due to temporary files not removed. So, if PhyloPhlAn2 crashes during the use of diamond, remove the last directory that has been generated in the output/tmp folder and re-launch the main command to restart PhyloPhlan2 from where it failed.

In general, given that PhyloPhlAn2 is a pipeline interacting with external software it might happen that from time to time the failure of one of the steps may cause an interruption in the execution of phylophlan2.py. What we advise to do to continue the analysis is, in this order:

Remove the last directory that has been generated in the output/tmp folder and re-launch the main command to restart PhyloPhlan2 from where it failed, in order to not lose the computation made up to that point
If the previous solution did not work, execute the command that is crashing but change the -i parameter with -c in order to delete all the output and output/tmp folders.
If the previous solution did not work, execute the command with --clean_all, this will remove all installation and database files that are automatically generated at the first run of PhyloPhlAn2.

Command Line Options (`phylophlan2.py`)

This is the main PhyloPhlAn2 script, other information available here

usage: phylophlan2.py [-h] [-i PROJECT_NAME | -c CLEAN] [-o OUTPUT]
                      [-d DATABASE] [-t {n,a}] [-f CONFIG_FILE] --diversity
                      {low,medium,high} [--accurate | --fast] [--clean_all]
                      [--database_list] [-s SUBMAT] [--submat_list]
                      [--submod_list] [--nproc NPROC]
                      [--min_num_proteins MIN_NUM_PROTEINS]
                      [--min_len_protein MIN_LEN_PROTEIN]
                      [--min_num_markers MIN_NUM_MARKERS]
                      [--trim {gappy,not_variant,greedy}]
                      [--not_variant_threshold NOT_VARIANT_THRESHOLD]
                      [--subsample {phylophlan,onethousand,sevenhundred,\
                      fivehundred,threehundred,onehundred,fifty,twentyfive,\
                      tenpercent,twentyfivepercent,fiftypercent}]
                      [--unknown_fraction UNKNOWN_FRACTION]
                      [--scoring_function {trident,muscle,random}] [--sort]
                      [--remove_fragmentary_entries]
                      [--fragmentary_threshold FRAGMENTARY_THRESHOLD]
                      [--min_num_entries MIN_NUM_ENTRIES] [--maas MAAS]
                      [--remove_only_gaps_entries] [--mutation_rates]
                      [--force_nucleotides] [--input_folder INPUT_FOLDER]
                      [--data_folder DATA_FOLDER]
                      [--databases_folder DATABASES_FOLDER]
                      [--submat_folder SUBMAT_FOLDER]
                      [--submod_folder SUBMOD_FOLDER]
                      [--configs_folder CONFIGS_FOLDER]
                      [--output_folder OUTPUT_FOLDER]
                      [--genome_extension GENOME_EXTENSION]
                      [--proteome_extension PROTEOME_EXTENSION] [--verbose]
                      [-v]

optional arguments:
  -h, --help            show this help message and exit
  -i PROJECT_NAME, --input PROJECT_NAME
  -c CLEAN, --clean CLEAN
                        Clean the output and partial data produced for the
                        specified project (default: None)
  -o OUTPUT, --output OUTPUT
                        Output folder name, otherwise it will be the name of
                        the input folder concatenated with the name of the
                        database used (default: None)
  -d DATABASE, --database DATABASE
                        The name of the database of markers to use. (default:
                        None)
  -t {n,a}, --db_type {n,a}
                        Specify the type of the database of markers, where "n"
                        stands for nucleotides and "a" for amino acids. If not
                        specified, PhyloPhlAn2 will automatically detect the
                        type of database (default: None)
  -f CONFIG_FILE, --config_file CONFIG_FILE
                        The configuration file to load, four ready-to-use
                        configuration files can be generated using the
                        "write_default_configs.sh" script present in the
                        "configs" folder (default: None)
  --diversity {low,medium,high}
                        Specify the "diversity" of phylogeny to build in order
                        to automatically adjust some parameters. Values can
                        be: "low": for genus-/species-/strain-level
                        phylogenies; "medium": for class-/order-level
                        phylogenies; "high": for tree-of-life size phylogenies
                        (default: None)
  --accurate            If specified will set some parameters that should
                        provide a more accurate phylogeny reconstruction.
                        Affected parameters vary depending also on the "--
                        diversity" parameter (default: False)
  --fast                If specified will set some parameters that should
                        provide a faster phylogeny reconstruction. Affected
                        parameters vary depending also on the "--diversity"
                        parameter (default: False)
  --clean_all           Remove all installation and database files that are
                        automatically generated at the first run of PhyloPhlAn
                        (default: False)
  --database_list       List of all the available databases that can be
                        specified with the -d (or --database) option (default:
                        False)
  -s SUBMAT, --submat SUBMAT
                        Specify the substitution matrix to use, the available
                        substitution matrices can be listed using the "--
                        submat_list" parameter (default: None)
  --submat_list         List of all the available substitution matrices that
                        can be specified with the --submat option (default:
                        False)
  --submod_list         List of all the available substitution models that can
                        be specified with the --maas option (default: False)
  --nproc NPROC         The number of CPUs to use (default: 1)
  --min_num_proteins MIN_NUM_PROTEINS
                        Proteomes (.faa) with less than this number of
                        proteins will be discarded (default: 1)
  --min_len_protein MIN_LEN_PROTEIN
                        Proteins in proteomes (.faa) shorter than this value
                        will be discarded (default: 50)
  --min_num_markers MIN_NUM_MARKERS
                        Input genomes or proteomes that map to less than the
                        specified number of markers will be discarded
                        (default: 0)
  --trim {gappy,not_variant,greedy}
                        Specify which type of trimming to perform: "gappy"
                        will perform what specified in the "trim" section of
                        the configuration file to remove gappy columns
                        (suggested, trimal --gappyout); "not_variant" will
                        remove columns that have at least one nucleotide/amino
                        acid appearing above a certain threshold (see "--
                        not_variant_threshold" parameter); "greedy" performs
                        both "gappy" and "not_variant"; "None", no trimming
                        will be performed (default: None)
  --not_variant_threshold NOT_VARIANT_THRESHOLD
                        Specify the value used to consider a column not
                        variant when "--trim not_variant" is specified
                        (default: 0.99)
  --subsample {phylophlan,onethousand,sevenhundred,\
               fivehundred,threehundred,onehundred,fifty,twentyfive,\
               tenpercent,twentyfivepercent,fiftypercent}
                        Specify which function to use to compute the number of
                        positions to retain from single marker MSAs for the
                        concatenated MSA. "phylophlan" compute the number of
                        position for each marker as in PhyloPhlAn (almost!)
                        (works only when --database phylophlan); "onethousand"
                        return the top 1000 positions; "sevenhundred" return
                        the top 700; "fivehundred" return the top 500;
                        "threehundred" return the top 300; "onehundred" return
                        the top 100 positions; "fifty" return the top 50
                        positions; "twentyfive" return the top 25 positions;
                        "None", the complete alignment will be used (default:
                        None)
  --unknown_fraction UNKNOWN_FRACTION
                        Define the amount of unknowns ("X" and "-") allowed in
                        each column of the MSA of the markers (default: 0.3)
  --scoring_function {trident,muscle,random}
                        Specify which scoring function to use to evaluate
                        columns in the MSA results (default: None)
  --sort                If specified, the markers will be ordered, when using
                        the PhyloPhlAn database, it will be automatically set
                        to "True" (default: False)
  --remove_fragmentary_entries
                        If specified the MSAs will be checked and cleaned from
                        fragmentary entries. See --fragmentary_threshold for
                        the threshold values above which an entry will be
                        considered fragmentary (default: False)
  --fragmentary_threshold FRAGMENTARY_THRESHOLD
                        The fraction of gaps in the MSA to be considered
                        fragmentary and hence discarded (default: 0.85)
  --min_num_entries MIN_NUM_ENTRIES
                        The minimum number of entries to be present for each
                        of the markers in the database (default: 4)
  --maas MAAS           Select a mapping file that specifies the substitution
                        model of amino acid to use for each of the markers for
                        the gene tree reconstruction. File must be tab-
                        separated (default: None)
  --remove_only_gaps_entries
                        If specified, entries in the MSAs composed only of
                        gaps ("-") will be removed. This is equivalent to
                        specifying: --remove_fragmentary_entries
                        --fragmentary_threshold 1 (default: False)
  --mutation_rates      If specified will produced a mutation rates table for
                        each of the aligned markers and a summary table for
                        the concatenated MSA. This operation can take a long
                        time to finish (default: False)
  --force_nucleotides   If specified force PhyloPhlAn to use nucleotide
                        sequences for the phylogenetic analysis, even in the
                        case of a database of amino acids (default: False)
  --verbose             Makes PhyloPhlAn verbose (default: False)
  -v, --version         Prints the current PhyloPhlAn version and exit

Folder paths:
  Parameters for setting the folder locations

  --input_folder INPUT_FOLDER
                        Path to the folder containing the input data (default:
                        input/)
  --data_folder DATA_FOLDER
                        Path to the folder where to store the intermediate
                        files, default is "tmp" inside the project's output
                        folder (default: None)
  --databases_folder DATABASES_FOLDER
                        Path to the folder that contains the database files
                        (default: phylophlan_databases/)
  --submat_folder SUBMAT_FOLDER
                        Path to the folder containing the substitution
                        matrices to use to compute the column score for the
                        subsampling step (default:
                        phylophlan_substitution_matrices/)
  --submod_folder SUBMOD_FOLDER
                        Path to the folder containing the substitution models
                        mapping file for building the gene trees (default:
                        phylophlan_substitution_models/)
  --configs_folder CONFIGS_FOLDER
                        Path to the folder containing the configuration files
                        that contains the software to use for the phylogenetic
                        analysis (default: phylophlan_configs/)
  --output_folder OUTPUT_FOLDER
                        Path to the output folder where to save the results
                        (default: )

Filename extensions:
  Parameters for setting the extensions of the input files

  --genome_extension GENOME_EXTENSION
                        Set the extension for the genomes in your inputs
                        (default: .fna)
  --proteome_extension PROTEOME_EXTENSION
                        Set the extension for the proteomes in your inputs
                        (default: .faa)

Command Line Options (`phylophlan2_setup_database.py`)

This script is used to build a custom database and it should be used if the user decides not to use one of the two databases provided. The output is a folder containing the markers ready to be used in phylophlan2.py through the option -d followed by the name of the said folder. Other information here

usage: phylophlan2_setup_database.py [-h] (-i INPUT | -g GET_CORE_PROTEINS)
                                     [-o OUTPUT] [-d DB_NAME]
                                     [-e INPUT_EXTENSION] [-t {n,a}]
                                     [-x OUTPUT_EXTENSION] [--overwrite]
                                     [--verbose] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Specify the path to either the folder containing the
                        marker files or the file of markers, in (multi-)fasta
                        format (default: None)
  -g GET_CORE_PROTEINS, --get_core_proteins GET_CORE_PROTEINS
                        Specify the taxonomic label for which download the set
                        of core proteins. The label must represent a species:
                        "--get_core_proteins s__Escherichia_coli" (default:
                        None)
  -o OUTPUT, --output OUTPUT
                        Specify path to the output folder where to save the
                        database (default: None)
  -d DB_NAME, --db_name DB_NAME
                        Specify the name of the output database (default:
                        None)
  -e INPUT_EXTENSION, --input_extension INPUT_EXTENSION
                        Specify the extension of the input file(s) specified
                        via -i/--input (default: None)
  -t {n,a}, --db_type {n,a}
                        Specify the type of the database, where "n" stands for
                        nucleotides and "a" for amino acids (default: None)
  -x OUTPUT_EXTENSION, --output_extension OUTPUT_EXTENSION
                        Set the database output extension (default: None)
  --overwrite           If specified and the output file exists it will be
                        overwritten (default: False)
  --verbose             Prints more stuff (default: False)
  -v, --version         Prints the current phylophlan2_setup_database.py
                        version and exit

Command Line Options (`phylophlan2_write_config_file.py`)

This script allows the user to customize the phylogenetic analysis by creating a personalized configuration file, deciding which software to use for every mandatory section among the available ones, as seen above. The output is a text file, so if the user desires to customize the parameters of the selected software according to his need and the type of the analysis to be executed, he simply should open the generated configuration file with a text editor and then add/remove the specific options. Other information here

usage: phylophlan2_write_config_file.py [-h] -o OUTPUT -d {n,a}
                                        (--db_dna {makeblastdb} | --db_aa {usearch,diamond})
                                        [--map_dna {blastn,tblastn,diamond}]
                                        [--map_aa {usearch,diamond}] --msa
                                        {muscle,mafft,opal,upp}
                                        [--trim {trimal}]
                                        [--gene_tree1 {fasttree,raxml,iqtree}]
                                        [--gene_tree2 {raxml}] --tree1
                                        {fasttree,raxml,iqtree,astral,astrid}
                                        [--tree2 {raxml}] [-a]
                                        [--force_nucleotides] [--overwrite]
                                        [--verbose] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Specify the output file where to write the
                        configurations (default: None)
  -d {n,a}, --db_type {n,a}
                        Specify the type of the database, where "n" stands for
                        nucleotides and "a" for amino acids (default: None)
  --db_dna {makeblastdb}
                        Add the "db_dna" section of the selected software that
                        will be used for building the indexed database
                        (default: None)
  --db_aa {usearch,diamond}
                        Add the "db_aa" section of the selected software that
                        will be used for building the indexed database
                        (default: None)
  --map_dna {blastn,tblastn,diamond}
                        Add the "map_dna" section of the selected software
                        that will be used for mapping the database against the
                        input genomes (default: None)
  --map_aa {usearch,diamond}
                        Add the "map_aa" section of the selected software that
                        will be used for mapping the database against the
                        input proteomes (default: None)
  --msa {muscle,mafft,opal,upp}
                        Add the "msa" section of the selected software that
                        will be used for producing the MSAs (default: None)
  --trim {trimal}       Add the "trim" section of the selected software that
                        will be used for the gappy regions removal of the MSAs
                        (default: None)
  --gene_tree1 {fasttree,raxml,iqtree}
                        Add the "gene_tree1" section of the selected software
                        that will be used for building the phylogenies for the
                        markers in the database (default: None)
  --gene_tree2 {raxml}  Add the "gene_tree2" section of the selected software
                        that will be used for refining the phylogenies
                        previously built with what specified in the
                        "gene_tree1" section (default: None)
  --tree1 {fasttree,raxml,iqtree,astral,astrid}
                        Add the "tree1" section of the selected software that
                        will be used for building the first phylogeny
                        (default: None)
  --tree2 {raxml}       Add the "tree2" section of the selected software that
                        will be used for refining the phylogeny previously
                        built with what specified in the "tree1" section
                        (default: None)
  -a, --absolute_path   Write the absolute path to the executable instead of
                        the executable name as found in the system path
                        environment (default: False)
  --force_nucleotides   If specified sets parameters for phylogenetic analysis
                        software so that they use nucleotide sequences, even
                        in the case of a database of amino acids (default:
                        None)
  --overwrite           Overwrite output file if it exists (default: False)
  --verbose             Prints more stuff (default: False)
  -v, --version         Prints the current phylophlan2_write_config_file.py
                        version and exit

Command Line Options (`phylophlan2_metagenomic.py`)

This script allows the user to assign to each bin that comes from a metagenomic assembly analysis its closest species-level genome bins (SGBs). This is particularly useful when the user needs to identify bins assembled from metagenomes. The main output file to consider will be a tsv file containing for each genome information about the bin it has been assigned to. Other information here

usage: phylophlan2_metagenomic.py [-h] -i INPUT [-o OUTPUT_PREFIX]
                                  [-d DATABASE] [-m MAPPING]
                                  [-e INPUT_EXTENSION] [-n HOW_MANY]
                                  [--nproc NPROC]
                                  [--database_folder DATABASE_FOLDER]
                                  [--only_input] [--add_ggb] [--add_fgb]
                                  [--overwrite] [--verbose] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input folder containing the metagenomic bins to be
                        indexed (default: None)
  -o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                        Prefix used for the output folders: indexed bins,
                        distance estimations. If not specified, the input
                        folder will be used (default: None)
  -d DATABASE, --database DATABASE
                        Specify the name of the database, if not found locally
                        will be automatically downloaded (default: SGB.Jan19)
  -m MAPPING, --mapping MAPPING
                        Specify the name of the mapping file, if not found
                        locally will be automatically downloaded (default:
                        SGB.Jan19)
  -e INPUT_EXTENSION, --input_extension INPUT_EXTENSION
                        Specify the extension of the input file(s) specified
                        via -i/--input. If not specified will try to infer it
                        from the input files (default: None)
  -n HOW_MANY, --how_many HOW_MANY
                        Specify the number of SGBs to report in the output;
                        "all" is a special value to report all the SGBs; this
                        param is not used when "--only_input" is specified
                        (default: 10)
  --nproc NPROC         The number of CPUs to use (default: 1)
  --database_folder DATABASE_FOLDER
                        Path to the folder that contains the database file
                        (default: phylophlan2_databases/)
  --only_input          If specified provides a distance matrix between only
                        the input genomes provided (default: False)
  --add_ggb             If specified adds GGB assignments. If specified with
                        --add_fgb, then -n/--how_many will be set to 1 and
                        will be adding a column that reports the closest
                        reference genome (default: False)
  --add_fgb             If specified adds FGB assignments. If specified with
                        --add_ggb, then -n/--how_many will be set to 1 and
                        will be adding a column that reports the closest
                        reference genome (default: False)
  --overwrite           If specified overwrite the output file if exists
                        (default: False)
  --verbose             Prints more stuff (default: False)
  -v, --version         Prints the current phylophlan2_metagenomic.py version
                        and exit

Command Line Options (`phylophlan2_get_reference.py`)

This script is used to get reference genomes of a specified species. This is particularly useful when the user needs to build a tree to confront samples with an existing one. When using the -g parameter the output will be a directory with the requested genomes. Other information here

usage: phylophlan2_get_reference.py [-h] (-g GET | -l)
                                    [-e OUTPUT_FILE_EXTENSION] [-o OUTPUT]
                                    [-n HOW_MANY] [-m GENBANK_MAPPING]
                                    [--verbose] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -g GET, --get GET     Specify the taxonomic label for which download the set
                        of reference genomes. The label must represents a
                        valid taxonomic level or the special case "all"
                        (default: None)
  -l, --list_clades     Print for all taxa the total number of species and
                        reference genomes available (default: False)
  -e OUTPUT_FILE_EXTENSION, --output_file_extension OUTPUT_FILE_EXTENSION
                        Specify path to the extension of the output files
                        (default: .fna.gz)
  -o OUTPUT, --output OUTPUT
                        Specify path to the output folder where to save the
                        files, required when -g/--get is specified (default:
                        None)
  -n HOW_MANY, --how_many HOW_MANY
                        Specify how many reference genomes to download for each species,
                        where -1 stands for "all available" (default: 4)
  -m GENBANK_MAPPING, --genbank_mapping GENBANK_MAPPING
                        The local GenBank mapping file, if not found it will
                        be automatically downloaded (default:
                        assembly_summary_genbank.txt)
  --verbose             Prints more stuff (default: False)
  -v, --version         Prints the current phylophlan2_get_reference.py version
                        and exit

Command Line Options (`phylophlan2_strain_finder.py`)

This script can be used to perform analysis on trees built with phylophlan2.py. The output is a table that contains the subtrees and information about the minimum, mean, and maximum distance between nodes in the subtree, the minimum, mean and maximum mutation rate between nodes in the subtree, the distance and mutation rate between each node in the subtree. Other information here

usage: phylophlan2_strain_finder.py [-h] -i INPUT -m MUTATION_RATES
                                    [--p_threshold P_THRESHOLD]
                                    [--m_threshold M_THRESHOLD]
                                    [--tree_format {newick,nexus,phyloxml,cdao,nexml}]
                                    [-o OUTPUT] [--overwrite] [-s {,,	,;}]
                                    [--verbose] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Specify the file of the phylogenetic tree as generated from
                        phylophlan2.py (default: None)

  -m MUTATION_RATES, --mutation_rates MUTATION_RATES
                        Specify the file of the mutation rates as generated from
                        phylophlan2.py (default: None)
  --p_threshold P_THRESHOLD
                        Maximum phylogenetic distance threshold for every pair
                        of nodes in the same subtree (inclusive) (default:
                        0.05)
  --m_threshold M_THRESHOLD
                        Maximum mutation rate ratio for every pair of nodes in
                        the same subtree (inclusive) (default: 0.05)
  --tree_format {newick,nexus,nexml,phyloxml,cdao}
                        Specify the format of the input tree. (default:
                        NEWICK)
  -o OUTPUT, --output OUTPUT
                        Specify the name of the output filename, if not specified
                        default is stdout (default: None)
  --overwrite           If specified, will overwrite the output file if it
                        exists (default: False)
  -s {",","\t",";"}, --separator {",","\t",";"}
                        Specify the separator you want in the output (default:
                        "\t")
  --verbose             Write more stuff (default: False)
  -v, --version         Prints the current phylophlan2_strain_finder.py version
                        and exit

Command Line Options (`phylophlan2_draw_metagenomic.py`)

This script can be used to visualize the results obtained in phylophlan2_metagenomic.py. The outputs are two heatmaps, one showing the presence/absence of the top SGBs (customizable through --top) in the metagenomes, the other showing the number of kSGBs and uSGBs in each metagenome, and two relative output files containing the data used to build them. Other information here

usage: phylophlan2_draw_metagenomic.py [-h] -i INPUT --map MAP [--top TOP]
                                       [-o OUTPUT] [-s SEPARATOR] [--dpi DPI]
                                       [-f F] [--verbose] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Specify the input tsv file generated from
                        ‘phylophlan2_metagenomic.py’ (default: None)
  --map MAP             Specify a mapping tsv file that maps for each bin its
                        metagenome (default: None)
  --top TOP             Specify the number of SGBs to display in the figure,
                        if not specified is set to 20 (default: 20)
  -o OUTPUT, --output OUTPUT
                        Specify the prefix of the output file and image,
                        otherwise it will be set to default to output_heatmap 
                        (default: output_heatmap)
  -s SEPARATOR, --separator SEPARATOR
                        Specify the separator used in the mapping file,
                        default is tab (default: '\t')
  --dpi DPI             Specify the dpi of the images. Default is 200
                        (default: 200)
  -f F                  Specify deisired format for images. Default is svg
                        (default: svg)
  --verbose             Prints more stuff (default: False)
  -v, --version         Prints the current phylophlan2_draw_metagenomic.py
                        version and exit

Home

PhyloPhlAn 3.0

Installation

Conda package [easy]

Conda environment [medium]

Repository from Bitbucket [hard]

Test PhyloPhlAn2 installation

Basic usage

Input Files

Nucleotide or Amino acid pipeline

Diversity

Accurate or Fast

Output

Parallel computations

Databases

Expert usage

Input and phylogenetic markers quality control

Accurate or Fast

Trimming

Subsampling

Scoring function

Substitution matrices

Mutation rates table

Sorting

Database setup (phylophlan2_setup_database.py usage)

Configuration File (phylophlan2_write_default_configs.sh usage)

Generate a custom configuration file (phylophlan2_write_config_file.py usage)

Mandatory sections

Optional sections

Integrating new tools in the framework

Configuration variables explained

Supermatrix (concatenation) or Supertree (gene trees) approach

Supermatrix (concatenation)

Supertree (gene trees)

Assigning SGBs (phylophlan2_metagenomic.py usage)

Output description

Getting reference genomes of a specified species (phylophlan2_get_reference.py usage)

Finding strains in trees (phylophlan2_strain_finder.py usage)

Drawing heatmaps to visualize the output from phylophlan2_metagenomic.py (phylophlan2_draw_metagenomic.py usage)

Requirements

Dependencies

External Tools

Known Issues

Command Line Options (phylophlan2.py)

Command Line Options (phylophlan2_setup_database.py)

Command Line Options (phylophlan2_write_config_file.py)

Command Line Options (phylophlan2_metagenomic.py)

** Command Line Options (phylophlan2_get_reference.py)**

Command Line Options (phylophlan2_strain_finder.py)

Command Line Options (phylophlan2_draw_metagenomic.py)

Clone this wiki locally

Database setup (`phylophlan2_setup_database.py` usage)

Configuration File (`phylophlan2_write_default_configs.sh` usage)

Generate a custom configuration file (`phylophlan2_write_config_file.py` usage)

Assigning SGBs (`phylophlan2_metagenomic.py` usage)

Getting reference genomes of a specified species (`phylophlan2_get_reference.py` usage)

Finding strains in trees (`phylophlan2_strain_finder.py` usage)

Drawing heatmaps to visualize the output from phylophlan2_metagenomic.py (`phylophlan2_draw_metagenomic.py` usage)

Command Line Options (`phylophlan2.py`)

Command Line Options (`phylophlan2_setup_database.py`)

Command Line Options (`phylophlan2_write_config_file.py`)

Command Line Options (`phylophlan2_metagenomic.py`)

Command Line Options (`phylophlan2_get_reference.py`)

Command Line Options (`phylophlan2_strain_finder.py`)

Command Line Options (`phylophlan2_draw_metagenomic.py`)