-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Meteor2 is a plateform for quantitative metagenomics profiling of complex ecosystems. Meteor2 relies on genes catalogue to perform species-level taxonomic profiling, functional analysis and strain-level population structure inference of metagenomic shotgun sequencing data.
Besides python packages dependencies, Meteor requires:
- python 3.10+
- bowtie2 2.3.5+
- freebayes 1.3.6+
Meteor is available with conda which includes all its dependencies:
conda create --name meteor -c conda-forge -c bioconda meteor
Or with pip with a recent Python 3.10+:
pip install meteor
You can test the installation of meteor with:
meteor test
Typically, the process will follow these steps:
- Download or build a reference catalogue
- Organize the fastq into a specific directory structure
- Generate the gene count table
- Compute taxonomical and/or functional abundances
- Merge sample profiles
- Generate strain profiles
- Build phylogenetic trees
To run following R plots, you might need to install the R dependencies in your conda environment:
conda install -c conda-forge -c bioconda -c r R r-dplyr r-ggplot2 r-ggpubr bioconductor-complexheatmap bioconductor-ggtree
Meteor2 requires to download locally a microbial gene catalogue. There are currently 10 catalogues available through Meteor2, each one corresponding to a specific ecosystem. Type meteor download --help
to get the list of available ecosystems. All catalogues are structured in exactly the same way and contain everything you need to get the whole process done: fasta files of the genes, Metagenomic Species (MSP) definition, functional and taxonomic annotation, etc. You can learn more about catalogue files and structure here.
For a given ecosystem, two versions of the catalogue coexist: full and fast. The fast version is a streamlined version of the full version, reduced to the 100 core genes of each MSP. It uses less time and memory to process samples, with a slight loss of specificity. Of course, functional profiling, which demands accessory genes detection, estimation and annotation, cannot be perform in fast mode.
Here, for demonstration purpose, we will take advantage of the fast mode, and download the catalogue associated with the human gut ecosytem (800 MB, ~2 minutes to get the .tar.gz file and extract it). Don't worry about the output directories, Meteor2 will create them if they do not exist:
meteor download -i human_gut --fast -o catalogue/
Meteor2 needs fastq files to be in a specific architecture (basically, one directory for one sample containing all its fastq, associated with specific json files). You will find comparable json files in all Meteor2 output. They are easily recognized thanks to their intruiguing suffix: _census_stage_N.json. In fastq directories, N = 0; in mapping directories, N = 1, etc. Don't overlook them, they are full of highly valuable information, such as "when? what parameters? which version?" or metrics directly computed on samples (examples will come later).
In this tutorial, we will work on samples from PRJEB42906, where healthy volunteers consumed a resistant dextrine or a control product and collected their feces at baseline and X weeks later. For saving time, we have selected 8 individuals from the resistant dextrine groups, and rarefied their samples (n = 16) to 100,000 reads each (200M in total). To download them and extract them in fastq/
directory, type:
wget https://zenodo.org/records/14197452/files/fastq.tar.gz; tar -xzf fastq.tar.gz
All the fastq files are in the same directory. Fortunately, Meteor2 will help you in creating the correct directory structure (here in sample/
):
meteor fastq -i fastq/ -o sample/
Note that you do not need to create sample/
; Meteor2 will create it if it does not exist (as all Meteor2 output directories).
In this case, we had only one fastq per sample. If you have multiple fastq for one file, use the option -m
so that Meteor2 knows they come from the same sample and should be treated as one unique fastq. For example, if filenames contain the string 'SAMPLE_01' or 'SAMPLE_02', you can specify to Meteor2 to create two directories SAMPLE_01/ and SAMPLE_02/ with the command -m SAMPLE_\\d+
.
Take a few seconds to navigate through sample/
. You will notice that Meteor2 did not delete the files from fastq/
nor did it copied them: only symbolic links were created. So don't remove or delete the initial files! Have also a look at the json file that was created. It contains only minimal set of information needed for Meteor2, but if you need to keep more staff about your sample (for example, the sequencing perform), you can add a field 'additional_info' and put anything you want in it.
Since you have your catalogue and your samples in the correct structure, you can finally start the actual sample processing! Mapping the fastq files against the gene catalogue is a crucial step, as the mapped files (in cram format) really are the central piece of the whole Meteor2 machinery. They will be used to generate the gene count table from which species and functions abundance arise, but also to call SNPs and profile your sample at strain level.
In short, Meteor2 calls bowtie2 to map reads against the catalogue genes; aligned reads are then filtered to keep only those above a customizable identity threshold (--id
). When using a full catalogue, we will generally choose 95% identity. With the fast version, to decrease false positives number, we recommend to use 97% identity. These thresholds are used by Meteor2, which automatically adapts its default parameter to the kind of catalogue you specified. Bowtie2 output is a cram file: based on this, Meteor2 will count how many reads mapped on each gene and generate the gene count table. Different counting strategies are currently implemented (total, unique, or smart shared - the default one). Go to this page if you want to find more about counting stategies.
Note: Meteor2 doesn't filter reads before mapping on catalogue. We thus recommend to first filter out reads with low-quality, length < 60nt or belonging to the host. If you want to perform rarefaction, you should also consider doing it prior to Meteor2.
To map reads from the sample ERS12377136 on your catalogue and keep the filtered cram file for further strain analysis, simply run the following command:
meteor mapping -i sample/ERS12377136 -r catalogue/hs_10_4_gut_taxo/ -o mapping/ --kf -t 8
ATTENTION: As you noticed, meteor mapping
only accepts one sample directory as input! To run it on several samples, we have to make a loop (~ 6 minutes to run, 2.6G)...
for SUBDIR in sample/*/; do [ -d "$SUBDIR" ] && meteor mapping -i "$SUBDIR" -o mapping/ -r catalogue/hs_10_4_gut_taxo/ --kf -t 8; done
...or create a workflow!
By default, the cram file generated during the mapping step is deleted at the end of the task. If you are only interested in profiling at species and functional level, that's not an issue. However, if you want to perform strain analysis, you need to specify Meteor2 to keep this file with the option --kf
. More precisely, this option ensures that you keep only the part of the cram fime required for strain profiling - i.e., reads mapping on the 100 core genes of an MSP.
You can now inspect the contents of mapping
. Even if you specified the same output directory for all samples, Meteor2 created for each sample a specific subfolder, containing several files:
- the gene count table;
- the filtered cram file (and its index);
- a census json file.
Check the json file to know which identity threshold was used by Meteor2! Most importantly, in the json file you will find information about mapping rate, i.e., the percentage of reads that found a match in the gene catalogue. You absolutely need this metrics to know if the catalogue is representative enough of your sample. A correct mapping rate shoud be above 60% (at minimum). If not, it means that your sample contains a significant amount of species that are not in the catalogue, and that a large part of the fastq information was missed. In such a case, a solution is to use another catalogue or create one, more adapted to your project.
If you take a look at the json file, you will see two different fields: the 'overall_alignment_rate' (number of reads that pass bowtie2 threshold divided by total read count) and 'counted_reads' (the actual number of reads that pass the identity threshold filter). In our example, these figures are very low: since Meteor2 fast mode uses only core genes, mapping rate metrics are not relevant to determine if the catalogue is appropriate.
Now that we have the gene count table, we can compute species abundances. As a reminder, genes from the catalogue are clustered into Metagenomic Species Pangenomes based on co-abundance thanks to MSPminer. Into a single cluster, genes are classified as 'core' or 'accessory'. To correctly assess an MSP abundance, we simply need to compute the mean abundance over 100 core genes of the MSP - that's why we can use 'fast' catalogues.
To reduce false positive species that could arise from mismapped reads, abundance of the species without enough core genes detected is forced to 0. With the full catalogue, 'enough' core genes usually means 10% of the 100 core genes used for species profiling. In fast mode, the default threshold is set to 20%.
However, to be comparable between genes, counts must first be normalized for gene length. Two normalization methods are implemented in Meteor2: coverage or FPKM. For more details about the calculation, see here. Meteor2 also enable to perform rarefaction directly on the gene count table, to ensure same sequencing depth between samples. This technique is limited to single-read samples since it could bias paired-end samples. Anyway, if you want to rarefy your samples to a common sequencing depth, we recommend you to do prior to running Meteor2.
As for meteor mapping
, meteor profile
takes a single sample as input:
meteor profile -i mapping/ERS12377136 -r catalogue/hs_10_4_gut_taxo/ -o profiles/ -n coverage
To run meteor profile
on all samples, run the following command (~ 4 minutes, 2.6G):
for SUBDIR in mapping/*/; do [ -d "$SUBDIR" ] && meteor profile -i "$SUBDIR" -o profiles/ -r catalogue/hs_10_4_gut_taxo/ -n coverage; done
Have a look at the profiles/
directory. Again, Meteor2 created the same architecture, with all results of a sample in one subdirectory. In each subdirectory, there are four outputs:
- the census_stage_2.json file (extended from the census_stage_1.json);
- the gene raw count table (a symbolic link towards mapping output);
- the normalized gene count table;
- the MSP table.
We can now check that Meteor2 forced MSP with less than 20% of core genes detected at 0 by inspecting the census_stage_2.json file (field 'msp_filter').
In the case of full catalogue, the same command would have created all functional tables along with the species tables, without any further parameter. More detais about how functions are computed can be found here.
As you have noticed, taxonomic profiles of different samples lie in different subdirectories. To gather them into a single file that will be more convenient for downstream analysis, you can run:
meteor merge -i profiles/ -r catalogue/hs_10_4_gut_taxo/ -o merging/ -p demo
By default, Meteor2 doesn't merge gene tables - to save time and memory. If you need them, for example to inspect accessory genes content of an MSP, add the parameter -g
.
Meteor2 has created the directory merging/
(if it did not exist) and merged all sample profiles into a single tsv file. Of note, MSP with 0 abundance in all merged samples are removed from the merged tsv file. In the same time, Meteor2 also generated a taxonomy file (containing taxonomy of the detected MSP) ad a report file, regrouping all information from the census_state_2.json files. All files are prefixed with 'demo' as we specified with the parameter -p
. Thus, a single merging folder can contain merging output from several runs, if you specify a different prefix. If not, the previous output will simply be deleted.
We now have the MSP abundance table that can be analyzed in Rserver or directly uploaded in shaman for example. According the the PRJEB42906 paper, Parabacteroides distasonis was increased by the resistant dextrin. Is it still the case with this sample selection? To find out, first download metadata:
wget https://zenodo.org/records/14197452/files/metadata.tar.gz; tar -xzf metadata.tar.gz
library(dplyr)
library(ggplot2)
library(ggpubr)
demo_metadata = read.delim("metadata/demo_metadata.tsv", stringsAsFactors = FALSE)
demo_msp = read.delim("merging/demo_msp.tsv",
stringsAsFactors = FALSE, row.names = 1)
all_metadata = demo_metadata %>%
left_join(tibble::rownames_to_column(as.data.frame(t(demo_msp)), var = "sample"),
by = join_by(sample))
ggplot(all_metadata, aes(timepoint, msp_0012)) +
geom_boxplot() +
geom_line(aes(group = individual_id), colour = "lightgrey", linetype = "dashed") +
geom_point() +
stat_compare_means(paired = T) +
theme_pubr()
We observed that P. distasonis is strongly increased by resistant dextrin consumption - but not in all individuals! Is there possibly a strain effect?
We can answer that question by profiling strains in each sample. From the filtered cram file that we kept at the mapping step, Meteor2 will run freebayes to perform SNP calling and generate a consensus fasta for each MSP - at least, for each MSP with sufficient signal. A consensus fasta consists in the concatenated sequences of the MSP 100 core genes, where the original bases been replaced by alternative bases according to freebayes output. Positions where information is lacking (i.e., not enough reads mapped) are simply replaced by N. We usually consider MSP with at least 80% of the core genes covered at a minimum of 50%.
For the purpose of demonstration, since sample depth was rarefied to 100,000 reads, we can lower these threshold to get some signal. WARNING: with such low thresholds results may not be meaningful, this is for demonstration only!
To run meteor strain
on one sample:
meteor strain -i mapping/ERS12377136 -o strain/ -r catalogue/hs_10_4_gut_taxo/ -m 10 -c 0.1 --kc
To run meteor strain
on all samples (takes ~32 minutes to run, 3.4G):
for SUBDIR in mapping/*/; do [ -d "$SUBDIR" ] && meteor strain -i "$SUBDIR" -o strain/ -r catalogue/hs_10_4_gut_taxo/ -m 10 -c 0.1 --kc; done
In each sample directory you will now find a fasta file per MSP that was covered enough (according to -m
and -c
criteria). This fasta contains a single entry: the version of the MSP in the sample.
Using the different MSP consensus fasta generated for each sample where signal was sufficient, Meteor2 computes mutation rates and build trees between strains from samples using a GTR+GAMMA model with the following command:
meteor tree -i strain/ -o tree/ -g 1 -t 10
Like meteor merge
, meteor tree
will gather information from all samples to create an output. For each MSP with more than 4 samples to follow, Meteor2 generated:
- a multifasta file (.fasta) grouping available sequences from all individuals;
- a tree file (.tree) that can be visualized with iTOL for example;
- a mutation rate matrix file (.tsv), giving the mutation rates between each pair of indivudual.
With such a low sequencing depth, meteor strain
and meteor tree
output are not meaningful (don't forget you used -m 10
and -c 0.1
as parameters to follow some strains!). For visualization purposes, we have run the whole process (meteor mapping
, meteor strain
and meteor tree
) on the raw samples (without rarefaction). You can download the tree and mutation rate files of Parabacteroides distasonis that resulted from that process here:
wget https://zenodo.org/records/14197452/files/tree_raw.tar.gz; tar -xzf tree_raw.tar.gz
Now you can display the mutation matrix as a heatmap:
library(ComplexHeatmap)
library(ggplot2)
library(ggpubr)
demo_metadata = read.delim("metadata/demo_metadata.tsv", stringsAsFactors = FALSE)
rownames(demo_metadata) = demo_metadata$sample
mutation_matrix = as.matrix(read.delim("tree_raw/msp_0012.tsv", stringsAsFactors = FALSE, row.names = 1))
Heatmap(mutation_matrix,
name = "Dist",
border = TRUE,
col = circlize::colorRamp2(c(0, 0.007), c("red", "white")),
show_column_names = FALSE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
top_annotation = HeatmapAnnotation(ind = demo_metadata[colnames(mutation_matrix), c("individual_id")],
logFC = demo_metadata[colnames(mutation_matrix), c("logFC")],
col = list(logFC = circlize::colorRamp2(c(-5, 0, 5), c("#D55E00", "white", "#0072B2")),
ind = setNames(rainbow(8), unique(demo_metadata$individual_id))),
annotation_name_side = "left",
show_legend = TRUE))
Or visualize directly the tree in R:
library(ggplot2)
library(ggpubr)
library(ggtree)
demo_metadata = read.delim("metadata/demo_metadata.tsv", stringsAsFactors = FALSE)
rownames(demo_metadata) = demo_metadata$sample
tree = ape::read.tree("tree_raw/msp_0012.tree")
p = ggtree(tree) %<+% demo_metadata
p + geom_tiplab(offset = 0.001, hjust = 1, size = 3) +
geom_tippoint(aes(shape = logFC > 0, color = individual_id), size = 3)