-
Notifications
You must be signed in to change notification settings - Fork 86
MetaPhlAn 4
MetaPhlAn is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With StrainPhlAn, it is possible to perform accurate strain-level microbial profiling.
MetaPhlAn 4 relies on ~5.1M unique clade-specific marker genes identified from ~1M microbial genomes (~236,600 references and 771,500 metagenomic assembled genomes) spanning 26,970 species-level genome bins (SGBs, http://segatalab.cibio.unitn.it/data/Pasolli_et_al.html), 4,992 of them taxonomically unidentified at the species level (the latest marker information file can be found here), allowing:
- unambiguous taxonomic assignments;
- an accurate estimation of organismal relative abundance;
- SGB-level resolution for bacteria, archaea and eukaryotes;
- strain identification and tracking
- orders of magnitude speedups compared to existing methods.
- metagenomic strain-level population genomics
MetaPhlAn requires python 3 or newer with numpy, and Biopython libraries installed. Python libraries are automatically installed by pip
.
MetaPhlAn relies on BowTie2 (version 2.3 or higher) to map reads against marker genes. Check that bowtie2
is present in the system path with execute and read permissions.
If MetaPhlAn is installed using conda, no pre-requisites are needed.
MetaPhlAn is integrated with advanced heatmap plotting with hclust2 and cladogram visualization with GraPhlAn. If you use such visualization tools please refer to their prerequisites.
The best way to install MetaPhlAn is through conda via the Bioconda channel. If you have not configured your Anaconda installation to fetch packages from Bioconda, please follow these steps to set up the channels.
You can install MetaPhlAn by running
$ conda install -c bioconda metaphlan
It is recommended to create an isolated conda environment and install MetaPhlAn into it.
$ conda create --name mpa -c bioconda python=3.7 metaphlan
This allows having the correct version of all the dependencies isolated from the system's python installation.
Before using MetaPhlAn, you should activate the mpa
environment:
$ conda activate mpa
MetaPhlAn is also available in PyPi
$ pip install metaphlan
Alternatively, you can manually download from GitHub or clone the repository using the following command
$ git clone https://github.com/biobakery/MetaPhlAn.git
and install MetaPhlAn by running
$ pip install .
If you choose this way, you'll need to install manually some dependencies!
MetaPhlAn needs the clade markers and the database to be downloaded locally. To obtain them:
$ metaphlan --install
If you have installed MetaPhlAn using Anaconda, it is advised to install the database in a folder outside the Conda environment. To do this, run
$ metaphlan --install --bowtie2db <database folder>
If you install the database in a different location, remember to run MetaPhlAn using --bowtie2db <database folder>
!
By default, the latest MetaPhlAn database is downloaded and built. You can download a specific version with the --index
parameter
$ metaphlan --install --index mpa_vJan21_CHOCOPhlAnSGB_202103 --bowtie2db <database folder>
When --index
is specified, MetaPhlAn skips the check for the latest database version and run the analysis using the database version provided by --index
located in --bowtie2db
.
This option is recommended when MetaPhlAn is run on HPC clusters or containerized
If you have issues in downloading the database, you can get it from:
Just download the .tar, .md5, and the mpa_latest files and place them in the metaphlan_databases folder.
$ metaphlan metagenome.fastq --input_type fastq -o profiled_metagenome.txt
It is highly recommended to save the intermediate BowTie2 output for re-running MetaPhlAn extremely quickly (--bowtie2out
), and use multiple CPUs (--nproc
) if available:
$ metaphlan metagenome.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq -o profiled_metagenome.txt
If you already mapped your metagenome against the marker DB (using a previous MetaPhlAn run), you can obtain the results in few seconds by using the previously saved --bowtie2out
file and specifying the input (--input_type bowtie2out
):
$ metaphlan metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out -o profiled_metagenome.txt
bowtie2out
files generated with MetaPhlAn versions below 3.0 are not compatible. Starting from MetaPhlAn 3, the BowTie2 output now includes the size of the profiled metagenome.
You can also provide an externally BowTie2-mapped SAM if you specify this format with --input_type
.
Two steps here: first map your metagenome with BowTie2 and then feed MetaPhlAn with the obtained SAM:
$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103 -U metagenome.fastq
$ metaphlan metagenome.sam --input_type sam -o profiled_metagenome.txt
MetaPhlAn can also natively handle paired-end metagenomes (but does not use the paired-end information), and, more generally, metagenomes stored in multiple files (but you need to specify the --bowtie2out parameter):
$ metaphlan metagenome_1.fastq,metagenome_2.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq -o profiled_metagenome.txt
Starting from version 3, MetaPhlAn can estimate the unclassified fraction of the metagenome. The relative abundance profile is scaled according to the percentage of reads mapping to a clade in the database.
$ metaphlan metagenome.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq --unclassified_estimation -o profiled_metagenome.txt
If you want to estimate the unknown fraction of a metagenome and your input file is a SAM file, remember to specify the metagenome size using --nreads
. You can get easily get the metagenome size from a SAM file if you have run bowtie2 without the --no-unal
parameter by running
$ wc -l metagenome.sam
Otherwise, read_fastx.py
is your choice: this will print on the standard error the metagenome size.
$ read_fastx.py metagenome.fastq > /dev/null
You can provide the specific database version with --index
.
By default MetaPhlAn is run with --index latest
: the latest version of the database is used; if it is not available, MetaPhlAn will try to download it.
When --index
is specified, MetaPhlAn skips the check for the latest database version and run the analysis using the database version provided by --index
located in --bowtie2db
.
For advanced options and other analysis types (such as strain tracking) please refer to the full command-line options metaphlan --help
.
MetaPhlAn's repository features a few utility scripts to aid in the manipulation of sample output and its visualization. These scripts can be found under the utils
folder in the MetaPhlAn directory.
The script merge_metaphlan_tables.py allows to combine MetaPhlAn output from several samples to be merged into one table Bugs (rows) vs Samples (columns) with the table enlisting the relative normalized abundances per sample per bug.
To merge multiple output files, run the script as below
$ merge_metaphlan_tables.py metaphlan_output1.txt metaphlan_output2.txt > metaphlan_output3.txt output/merged_abundance_table.txt
Wildcards can be used as needed:
$ merge_metaphlan_tables.py metaphlan_output*.txt > output/merged_abundance_table.txt
Output files can be merged only if the profiling was performed with the same version of the MetaPhlAn database.
There is no limit to how many files you can merge.
The script sgb_to_gtdb_profile.py allows to convert a SGB-based MetaPhlAn 4 output into a GTDB-taxonomy-based profile.
To do so, run the script as below
$ sgb_to_gtdb_profile.py -i metaphlan_output.txt -o metaphlan_output_gtdb.txt
The script calculate_unifrac.R allows to compute weighted or unweighted UniFrac starting from the merged MetaPhlAn outputs using the merge_metaphlan_tables.py script. We also provide a Newick tree relating all the SGBs included in MetaPhlAn 4 (metaphlan/utils/mpa_v4beta_Jan21_SGB_tree.nwk
).
In order to use calculate_unifrac.R, you need to have installed R and the vegan
, rbiom
, and ape
packages. The first execution of the script will install the dependencies, if not already satisfied.
To generate the UniFrac distance matrix, you need to run the script as below:
Rscript calculate_unifrac.R merged_mpa4_profiles.tsv mpa_v4beta_Jan21_SGB_tree.nwk unifrac_merged_mpa4_profiles.tsv
For the full list of options, please run
Rscript calculate_unifrac.R
The hclust2 script generates a hierarchically-clustered heatmap from MetaPhlAn abundance profiles. To generate the heatmap for a merged MetaPhlAn output table (as described above), you need to run the script as below:
hclust2.py \
-i HMP.species.txt \
-o HMP.sqrt_scale.png \
--skip_rows 1 \
--ftop 50 \
--f_dist_f correlation \
--s_dist_f braycurtis \
--cell_aspect_ratio 9 \
-s --fperc 99 \
--flabel_size 4 \
--metadata_rows 2,3,4 \
--legend_file HMP.sqrt_scale.legend.png \
--max_flabel_len 100 \
--metadata_height 0.075 \
--minv 0.01 \
--no_slabels \
--dpi 300 \
--slinkage complete
The tutorial of using GraPhlAn can be found from the GraPhlAn wiki.
In order to add a marker to the database, the user needs the following steps:
- Reconstruct the marker sequences (in fasta format) from the MetaPhlAn BowTie2 database by:
bowtie2-inspect metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103 > metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_markers.fasta
- Add the marker sequence stored in a file new_marker.fasta to the marker set:
cat new_marker.fasta >> metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_markers.fasta
- Rebuild the bowtie2 database:
bowtie2-build metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103_markers.fasta metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_NEW
- Assume that the new marker was extracted from genome1, genome2. Update the taxonomy file from the Python console as follows:
import pickle
import bz2
db = pickle.load(bz2.open('metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103.pkl', 'r'))
# Add the taxonomy of the new genomes
db['taxonomy']['7-levels taxonomy with clade names of genome1'] = ('7-levels NCBI taxonomy id of genome1', length of genome1)
db['taxonomy']['7-levels taxonomy with clade names of genome2'] = ('7-levels NCBI taxonomy id of genome1', length of genome2)
# Add the information of the new marker as the other markers
db['markers'][new_marker_name] = {
'clade': the clade that the marker belongs to,
'ext': {the GCA of the first external genome where the marker appears,
the GCA of the second external genome where the marker appears,
},
'len': length of the marker,
'taxon': the taxon of the marker
}
To see an example, try to print the first marker information:
print list(db['markers'].items())[0]
# Save the new mpa_pkl file
with bz2.BZ2File('metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_NEW.pkl', 'w') as ofile:
pickle.dump(db, ofile, pickle.HIGHEST_PROTOCOL)
- To use the new database, remember to run metaphlan.py with the "--index mpa_vJan21_CHOCOPhlAnSGB_NEW" parameter.