Skip to content

richardjdavies/YAMP

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Yet Another Metagenomic Pipeline (YAMP)

Thanks to the increased cost-effectiveness of high-throughput technologies, the number of studies focusing on microorganisms (bacteria, archaea, microbial eukaryotes, fungi, and viruses) and on their connections with human health and diseases has surged, and, consequently, a plethora of approaches and software has been made available for their study, making it difficult to select the best methods and tools.

Here we present Yet Another Metagenomic Pipeline (YAMP) that, starting from the raw sequencing data and having a strong focus on quality control, allows, within hours, the data processing up to the functional annotation (please refer to the YAMP wiki for more information).

YAMP is constructed on Nextflow, a framework based on the dataflow programming model, which allows writing workflows that are highly parallel, easily portable (including on distributed systems), and very flexible and customisable, characteristics which have been inherited by YAMP. New modules can be added easily and the existing ones can be customised -- even though we have already provided default parameters deriving from our own experience.

YAMP is accompanied by a Docker container, that saves the users from the hassle of installing the required software, increasing, at the same time, the reproducibility of the YAMP results (see the Using Docker section).

Table of contents

Dependencies

These tools need to be in the system path with execute and read permission. Notably, MetaPhlAn2, QIIME, and HUMAnN2 are also available in bioconda.

The required tools (except Nextflow) are also included in a Docker container (please refer to the Using Docker section). If using the container, Docker (https://www.docker.com) should be installed.

Installation

Clone the YAMP repository in a directory of your choice:

git clone https://github.com/alesssia/YAMP.git

The repository includes:

  • the Nextflow script, YAMP.nf,
  • the configuration files, nextflow.config
  • a folder (bin) containing two helper scripts (fastQC.sh and logQC.sh),
  • a folder (yampdocker) containing the Docker file used to build the Docker image (Dockerfile).

Note: the nextflow.config file includes the parameters that are used in our tutorials (check the YAMP wiki!).

Other requirements

YAMP requires a set of databases that are queried during its execution. Some of them should be automatically downloaded when installing the tools listed in the dependencies (or using specialised scripts, as those available with HUMAnN2), whilst other should be created by the user. Specifically, you will need:

  • a FASTA file listing the adapter sequences to remove in the trimming step. This file should be available within the BBmap installation. If not, please download it from here;
  • two FASTA file describing synthetic contaminants. These files (sequencing_artifacts.fa.gz and phix174_ill.ref.fa.gz) should be available within the BBmap installation. If not, please download them from here;
  • a FASTA file describing the contaminating genome(s). This file should be created by the users according to the contaminants present in their dataset. When analysing human metagenome, we suggest the users to always include the human genome. Please note that this file should be indexed beforehand. This can be done using BBMap, using the following command: bbmap.sh -Xmx24G ref=my_contaminants_genomes.fa.gz . We suggest to download the FASTA file provided by Brian Bushnell for removing human contamination, using the instruction available here;
  • the BowTie2 database file for MetaPhlAn2. This file should be available within the MetaPhlAn2 installation. If not, please download it from here;
  • the ChocoPhlAn and UniRef databases, that can be downloaded directly by HUMAnN2, as explained here;
  • [optional] a phylogenetic tree used by QIIME to compute a set of alpha-diversity measures (see here for details).

Usage

  1. Modify the nextflow.config file, specifying the necessary parameters, such as the path to the aforementioned databases.
  2. From a terminal window run the YAMP.nf script using the following command (when the library layout is 'paired'):
    nextflow run YAMP.nf --reads1 R1 --reads2 R2 --prefix mysample --outdir outputdir --mode MODE
    
    where R1 and R2 represent the path to the raw data (two compressed paired-end FASTQ files), mysample is a prefix that will be used to label all the resulting files, outputdir is the directory where the results will be stored, and MODE is any of the following: < QC, characterisation, complete >; or the following command (when the library layout is 'single'):
    nextflow run YAMP.nf --reads1 R --prefix mysample --outdir outputdir --mode MODE --librarylayout single
    
    where R represents the path to the raw data (a compressed single-end FASTQ file), librarylayout single specifies that single-end reads are at hand, and the other parameters are as above.

Does it seem complicate? In the YAMP wiki there are some tutorials!

Using Docker

To use the tools made available through the Docker container, one could either pull the pre-built image from DockerHub, using the following command:

docker pull alesssia/yampdocker

or build a local image using the file Dokerfile in the yampdocker folder. To build a local image, one should first access the yampdocker folder and then run the following command (be careful to add the dot!):

docker build -t yampdocker .

In both cases, the image can be used by YAMP by running the command presented above adding -with-docker followed by the image name (yampdocker):

nextflow run YAMP.nf --reads1 R1 --reads2 R2 --prefix mysample --outdir outputdir --mode MODE -with-docker yampdocker

where R1 and R2 represent the path to the raw data (two compressed FASTQ file), mysample is a prefix that will be used to label all the resulting files, outputdir is the directory where the results will be stored, and MODE is any of the following: < QC, characterisation, complete >.

Please note that Nextflow is not included in the Docker container and should be installed as explained here.

Changelog

0.9.3 / 2017-08-30

Enhancements:

  • YAMP can now handle both paired-end and single-end reads
  • The de-duplication step is now optional and can be skip (default: true)

0.9.2 / 2017-07-10

Enhancements:

  • YAMP can now be run in three "modes" : < QC, characterisation, complete >.

License

YAMP is licensed under GNU GPL v3.

Acknowledgements

Alessia would like to thank Brian Bushnell for his helpful suggestions about how to successfully use the BBmap suite in a metagenomics context and for providing several useful resources, and Paolo Di Tommaso, for helping her in using Nextflow properly!

About

YAMP: Yet Another Metagenomic Pipeline

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 100.0%