Thanks to the increased cost-effectiveness of high-throughput technologies, the number of studies focusing on microorganisms (bacteria, archaea, microbial eukaryotes, fungi, and viruses) and on their connections with human health and diseases has surged, and, consequently, a plethora of approaches and software has been made available for their study, making it difficult to select the best methods and tools.
Here we present Yet Another Metagenomic Pipeline (YAMP) that, starting from the raw sequencing data and having a strong focus on quality control, allows, within hours, the data processing up to the functional annotation (please refer to the YAMP wiki for more information).
YAMP is constructed on Nextflow, a framework based on the dataflow programming model, which allows writing workflows that are highly parallel, easily portable (including on distributed systems), and very flexible and customisable, characteristics which have been inherited by YAMP. New modules can be added easily and the existing ones can be customised -- even though we have already provided default parameters deriving from our own experience.
YAMP is accompanied by a Docker container, that saves the users from the hassle of installing the required software, increasing, at the same time, the reproducibility of the YAMP results (see the Using Docker section).
- Nextflow (https://github.com/nextflow-io/nextflow)
- fastQC v0.11.2+ (http://www.bioinformatics.babraham.ac.uk/projects/fastqc)
- BBmap v36.92+ (https://sourceforge.net/projects/bbmap)
- Samtools v1.3.1 (http://samtools.sourceforge.net)
- MetaPhlAn2 v2.0+ (https://bitbucket.org/biobakery/metaphlan2)
- QIIME v1.9.1+ (http://qiime.org)
- HUMAnN2 v0.9.9+ (https://bitbucket.org/biobakery/humann2)
These tools need to be in the system path with execute and read permission. Notably, MetaPhlAn2, QIIME, and HUMAnN2 are also available in bioconda.
The required tools (except Nextflow) are also included in a Docker container (please refer to the Using Docker section). If using the container, Docker (https://www.docker.com) should be installed.
Clone the YAMP repository in a directory of your choice:
git clone https://github.com/alesssia/YAMP.git
The repository includes:
- the Nextflow script,
YAMP.nf
, - the configuration files,
nextflow.config
- a folder (
bin
) containing two helper scripts (fastQC.sh
andlogQC.sh
), - a folder (
yampdocker
) containing the Docker file used to build the Docker image (Dockerfile
).
Note: the nextflow.config
file includes the parameters that are used in our tutorials (check the YAMP wiki!).
YAMP requires a set of databases that are queried during its execution. Some of them should be automatically downloaded when installing the tools listed in the dependencies (or using specialised scripts, as those available with HUMAnN2), whilst other should be created by the user. Specifically, you will need:
- a FASTA file listing the adapter sequences to remove in the trimming step. This file should be available within the BBmap installation. If not, please download it from here;
- two FASTA file describing synthetic contaminants. These files (
sequencing_artifacts.fa.gz
andphix174_ill.ref.fa.gz
) should be available within the BBmap installation. If not, please download them from here; - a FASTA file describing the contaminating genome(s). This file should be created by the users according to the contaminants present in their dataset. When analysing human metagenome, we suggest the users to always include the human genome. Please note that this file should be indexed beforehand. This can be done using BBMap, using the following command:
bbmap.sh -Xmx24G ref=my_contaminants_genomes.fa.gz
. We suggest to download the FASTA file provided by Brian Bushnell for removing human contamination, using the instruction available here; - the BowTie2 database file for MetaPhlAn2. This file should be available within the MetaPhlAn2 installation. If not, please download it from here;
- the ChocoPhlAn and UniRef databases, that can be downloaded directly by HUMAnN2, as explained here;
- [optional] a phylogenetic tree used by QIIME to compute a set of alpha-diversity measures (see here for details).
- Modify the
nextflow.config
file, specifying the necessary parameters, such as the path to the aforementioned databases. - From a terminal window run the
YAMP.nf
script using the following command (when the library layout is 'paired'):wherenextflow run YAMP.nf --reads1 R1 --reads2 R2 --prefix mysample --outdir outputdir --mode MODE
R1
andR2
represent the path to the raw data (two compressed paired-end FASTQ files),mysample
is a prefix that will be used to label all the resulting files,outputdir
is the directory where the results will be stored, andMODE
is any of the following: < QC, characterisation, complete >; or the following command (when the library layout is 'single'):wherenextflow run YAMP.nf --reads1 R --prefix mysample --outdir outputdir --mode MODE --librarylayout single
R
represents the path to the raw data (a compressed single-end FASTQ file),librarylayout single
specifies that single-end reads are at hand, and the other parameters are as above.
Does it seem complicate? In the YAMP wiki there are some tutorials!
To use the tools made available through the Docker container, one could either pull the pre-built image from DockerHub, using the following command:
docker pull alesssia/yampdocker
or build a local image using the file Dokerfile
in the yampdocker
folder. To build a local image, one should first access the yampdocker
folder and then run the following command (be careful to add the dot!):
docker build -t yampdocker .
In both cases, the image can be used by YAMP by running the command presented above adding -with-docker
followed by the image name (yampdocker
):
nextflow run YAMP.nf --reads1 R1 --reads2 R2 --prefix mysample --outdir outputdir --mode MODE -with-docker yampdocker
where R1
and R2
represent the path to the raw data (two compressed FASTQ file), mysample
is a prefix that will be used to label all the resulting files, outputdir
is the directory where the results will be stored, and MODE
is any of the following: < QC, characterisation, complete >.
Please note that Nextflow is not included in the Docker container and should be installed as explained here.
Enhancements:
- YAMP can now handle both paired-end and single-end reads
- The de-duplication step is now optional and can be skip (default: true)
Enhancements:
- YAMP can now be run in three "modes" : < QC, characterisation, complete >.
YAMP is licensed under GNU GPL v3.
Alessia would like to thank Brian Bushnell for his helpful suggestions about how to successfully use the BBmap suite in a metagenomics context and for providing several useful resources, and Paolo Di Tommaso, for helping her in using Nextflow properly!