diff --git a/README.md b/README.md index 49fa7fb..0afa267 100644 --- a/README.md +++ b/README.md @@ -1,164 +1,31 @@ # ![NGI-MethylSeq](docs/images/NGI-MethylSeq_logo.png) -> # UNDER DEVELOPMENT -> This pipeline is currently under active development and has very little in the way of testing. You may well have problems running it. Any help debugging is very welcome! Please fork, make changes and submit a pull request. +[![Build Status](https://travis-ci.org/SciLifeLab/NGI-MethylSeq.svg?branch=master)](https://travis-ci.org/SciLifeLab/NGI-MethylSeq) +[![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A50.22.2-brightgreen.svg)](https://www.nextflow.io/) -**NGI-MethylSeq** is a bioinformatics best-practice analysis pipeline used for Methylation (BS-Seq) data analysis at the [National Genomics Infastructure](https://ngisweden.scilifelab.se/) at [SciLifeLab Stockholm](https://www.scilifelab.se/platforms/ngi/), Sweden. +### Introduction + +NGI-MethylSeq is a bioinformatics best-practice analysis pipeline used for Methylation (BS-Seq) data analysis at the [National Genomics Infastructure](https://ngisweden.scilifelab.se/) at [SciLifeLab Stockholm](https://www.scilifelab.se/platforms/ngi/), Sweden. The pipeline uses [Nextflow](https://www.nextflow.io), a bioinformatics workflow tool. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results. This pipeline is primarily used with a SLURM cluster on the Swedish [UPPMAX systems](https://www.uppmax.uu.se). However, the pipeline should be able to run on any system that Nextflow supports. We have done some limited testing using Docker and AWS, and the pipeline comes with some configuration for these systems. See the [installation docs](docs/installation.md) for more information. -## Pipeline Results -See the [pipeline documentation](docs/README.md) for explanations of -the results files. - -## Installation -### NextFlow installation -See https://github.com/SciLifeLab/NGI-NextflowDocs for instructions on how to install and configure -Nextflow. - -### Pipeline installation -This pipeline itself needs no installation - NextFlow will automatically fetch it from GitHub when run if -`SciLifeLab/NGI-MethylSeq` is specified as the pipeline name. - -If you prefer, you can download the files yourself from GitHub and run them directly: -``` -git clone https://github.com/SciLifeLab/NGI-MethylSeq.git -nextflow NGI-MethylSeq/main.nf -``` - -## Configuration -By default, the pipeline is configured to run on the Swedish UPPMAX cluster (milou / irma). - -You will need to specify your UPPMAX project ID when running a pipeline. To do this, use -the command line flag `--project `. - -To avoid having to specify this every time you run Nextflow, you can add it to your -personal Nextflow config file instead. Add this line to `~/.nextflow/config`: - -```groovy -params.project = 'project_ID' -``` - -The pipeline will exit with an error message if you try to run it pipeline with the default -UPPMAX config profile and don't set project. - - -### Running on other clusters -It is entirely possible to run this pipeline on other clusters, though you will need to set up -your own config file so that the script knows where to find your reference files and how your -cluster works. - -Copy the contents of [`conf/uppmax.config`](conf/uppmax.config) to your own config file somewhere -and then reference it with `-c` when running the pipeline. - -If you think that there are other people using the pipeline who would benefit from your configuration -(eg. other common cluster setups), please let us know. It should be easy to create a new config file -in `conf` and reference this as a named profile in [`nextflow.config`](nextflow.config). Then these -configuration options can be used by specifying `-profile ` when running the pipeline. - - -## Running the pipeline -The typical command for running the pipeline is as follows: -``` -nextflow SciLifeLab/NGI-MethylSeq --reads '*_R{1,2}.fastq.gz' --genome GRCh37 -``` - -Note that the pipeline will create files in your working directory: -```bash -work # Directory containing the nextflow working files -results # Finished results (configurable, see below) -.nextflow_log # Log file from Nextflow -# Other nextflow hidden files, eg. history of pipeline runs and old logs. -``` - -### `--reads` -Location of the input FastQ files: -``` - --reads 'path/to/data/sample_*_{1,2}.fastq' -``` - -**NB: Must be enclosed in quotes!** - -Note that the `{1,2}` parentheses are required to specify paired end data. Running `--reads '*.fastq'` will treat -all files as single end. Also, note that the file path should be in quotation marks to prevent shell glob expansion. - -If left unspecified, the pipeline will assume that the data is in a directory called `data` in the working directory. - -### `--genome` -The reference genome to use of the analysis, needs to be one of the genome specified in the config file. - -See [`conf/uppmax.config`](conf/uppmax.config) for a list of the supported reference genomes -and their keys. Common genomes that are supported are: - -* Human - * `--genome GRCh37` -* Mouse - * `--genome GRCm38` -* Drosophila - * `--genome BDGP6` -* _S. cerevisiae_ - * `--genome 'R64-1-1'` - -> There are numerous others - check the config file for more. - -If you usually want to work with a single species, you can set a default in your user config file. -For example, add this line to `~/.nextflow/config`: -``` -params.genome = 'GRCh37' -``` - -### Trimming Parameters -The pipeline accepts a number of parameters to change how the trimming is done, according to your data type. -The following settings are set for these command line flags: - -| Parameter | 5' R1 Trim | 5' R2 Trim | 3' R1 Trim | 3' R2 Trim | -|-----------------|------------|------------|------------|------------| -| `--pbat` | 6 | 9 | 6 | 9 | -| `--single_cell` | 6 | 6 | 6 | 6 | -| `--epignome` | 8 | 8 | 8 | 8 | -| `--accel` | 10 | 15 | 10 | 10 | -| `--zymo` | 10 | 15 | 10 | 10 | -| `--cegx` | 6 | 6 | 2 | 2 | - -You can specify custom trimming parameters as follows: - -* `--clip_r1 ` -* `--clip_r2 ` -* `--three_prime_clip_r1 ` -* `--three_prime_clip_r2 ` - -Finally, specifying `--rrbs` will pass on the `--rrbs` parameter to TrimGalore! - -### Bismark Parameters -Using the `--pbat` parameter will affect the trimming (see above) and also set the `--pbat` flag when -aligning with Bismark. - -Using the `--single_cell` or `--zymo` parameters will set the `--non_directional` flag when aligning with Bismark. -This can also be set with `--non_directional` (doesn't affect trimming). - -Use the `--unmapped` flag to set the `--unmapped` flag with Bismark align and save the unmapped reads. +### Documentation +The NGI-MethylSeq pipeline comes with documentation about the pipeline, found in the `docs/` directory: -### Deduplication -By default, the pipeline includes a deduplication step after alignment. This is skipped if `--rrbs` or `--nodedup` are specified on the command line. +1. [Installation and configuration](docs/installation.md) +2. [Running the pipeline](docs/usage.md) +3. [Output and how to interpret the results](docs/output.md) -### `--bismark_index` -If you prefer, you can specify the full path to your reference genome when you run the pipeline: -``` ---bismark_index [path to Bismark index] -``` +If you're interested in running the pipeline in the cloud, please read the docs about using our pipeline with Amazon Web Services on the [NGI-RNAseq pipeline](https://github.com/SciLifeLab/NGI-RNAseq/blob/master/docs/amazon_web_services.md) (the instructions should work with this pipeline as well). -### `--outdir` -The output directory where the results will be saved. +### Credits +These scripts were written for use at the [National Genomics Infrastructure](https://portal.scilifelab.se/genomics/) at [SciLifeLab](http://www.scilifelab.se/) in Stockholm, Sweden. Written by Phil Ewels (@ewels) and Rickard Hammarén (@Hammarn). -### `-c` -Specify the path to a specific config file (this is a core NextFlow command). Useful if using different UPPMAX -projects or different sets of reference genomes. +--- -## Credits -These scripts were written for use at the [National Genomics Infrastructure](https://portal.scilifelab.se/genomics/) -at [SciLifeLab](http://www.scilifelab.se/) in Stockholm, Sweden. -Written by Phil Ewels (@ewels) and Rickard Hammarén (@Hammarn). +[![SciLifeLab](https://raw.githubusercontent.com/SciLifeLab/NGI-MethylSeq/master/docs/images/SciLifeLab_logo.png)](http://www.scilifelab.se/) +[![National Genomics Infrastructure](https://raw.githubusercontent.com/SciLifeLab/NGI-MethylSeq/master/docs/images/NGI_logo.png)](https://ngisweden.scilifelab.se/) -

+--- diff --git a/bismark.nf b/bismark.nf index 460df4c..28154f3 100644 --- a/bismark.nf +++ b/bismark.nf @@ -27,6 +27,8 @@ params.email = false params.genome = false params.bismark_index = params.genome ? params.genomes[ params.genome ].bismark ?: false : false params.saveReference = false +params.saveTrimmed = false +params.saveAlignedIntermediates = false params.reads = "data/*_R{1,2}.fastq.gz" params.outdir = './results' params.multiqc_config = "$baseDir/conf/multiqc_config.yaml" @@ -39,7 +41,7 @@ params.relaxMismatches = false params.numMismatches = 0.6 // 0.6 will allow a penalty of bp * -0.6 // For 100bp reads, this is -60. Mismatches cost -6, gap opening -5 and gap extension -2 -// Sp -60 would allow 10 mismatches or ~ 8 x 1-2bp indels +// So -60 would allow 10 mismatches or ~ 8 x 1-2bp indels // Bismark default is 0.2 (L,0,-0.2), Bowtie2 default is 0.6 (L,0,-0.6) // Validate inputs @@ -114,7 +116,9 @@ summary['Working dir'] = workflow.workDir summary['Output dir'] = params.outdir summary['Script dir'] = workflow.projectDir summary['Deduplication'] = params.nodedup ? 'No' : 'Yes' +summary['Save Trimmed'] = params.saveTrimmed summary['Save Unmapped'] = params.unmapped ? 'Yes' : 'No' +summary['Save Intermeds'] = params.saveAlignedIntermediates summary['Directional Mode'] = params.non_directional ? 'No' : 'Yes' summary['All C Contexts'] = params.comprehensive ? 'Yes' : 'No' if(params.rrbs) summary['RRBS Mode'] = 'On' @@ -144,7 +148,8 @@ if( workflow.profile == 'standard' && !params.project ) exit 1, "No UPPMAX proje */ process fastqc { tag "$name" - publishDir "${params.outdir}/fastqc", mode: 'copy' + publishDir "${params.outdir}/fastqc", mode: 'copy', + saveAs: {filename -> filename.indexOf(".zip") > 0 ? "zips/$filename" : "$filename"} input: set val(name), file(reads) from read_files_fastqc @@ -167,14 +172,20 @@ if(params.notrim){ } else { process trim_galore { tag "$name" - publishDir "${params.outdir}/trim_galore", mode: 'copy' + publishDir "${params.outdir}/trim_galore", mode: 'copy', + saveAs: {filename -> + if (filename.indexOf("_fastqc") > 0) "FastQC/$filename" + else if (filename.indexOf("trimming_report.txt") > 0) "logs/$filename" + else params.saveTrimmed ? filename : null + } input: set val(name), file(reads) from read_files_trimming output: set val(name), file('*fq.gz') into trimmed_reads - file '*trimming_report.txt' into trimgalore_results + file "*trimming_report.txt" into trimgalore_results + file "*_fastqc.{zip,html}" into trimgalore_fastqc_reports script: c_r1 = params.clip_r1 > 0 ? "--clip_r1 ${params.clip_r1}" : '' @@ -182,13 +193,14 @@ if(params.notrim){ tpc_r1 = params.three_prime_clip_r1 > 0 ? "--three_prime_clip_r1 ${params.three_prime_clip_r1}" : '' tpc_r2 = params.three_prime_clip_r2 > 0 ? "--three_prime_clip_r2 ${params.three_prime_clip_r2}" : '' rrbs = params.rrbs ? "--rrbs" : '' + non_directional = params.rrbs && params.non_directional ? "--non_directional" : '' if (params.singleEnd) { """ - trim_galore --gzip $rrbs $c_r1 $tpc_r1 $reads + trim_galore --fastqc --gzip $rrbs $c_r1 $tpc_r1 $reads """ } else { """ - trim_galore --paired --gzip $rrbs $c_r1 $c_r2 $tpc_r1 $tpc_r2 $reads + trim_galore --paired --fastqc --gzip $rrbs $c_r1 $c_r2 $tpc_r1 $tpc_r2 $reads """ } } @@ -199,7 +211,12 @@ if(params.notrim){ */ process bismark_align { tag "$name" - publishDir "${params.outdir}/bismark_alignments", mode: 'copy' + publishDir "${params.outdir}/bismark_alignments", mode: 'copy', + saveAs: {filename -> + if (filename.indexOf(".fq.gz") > 0) "unmapped/$filename" + else if (filename.indexOf(".bam") == -1) "logs/$filename" + else params.saveAlignedIntermediates || params.nodedup || params.rrbs ? filename : null + } input: file index from bismark_index @@ -240,7 +257,8 @@ if (params.nodedup || params.rrbs) { } else { process bismark_deduplicate { tag "${bam.baseName}" - publishDir "${params.outdir}/bismark_deduplicated", mode: 'copy' + publishDir "${params.outdir}/bismark_deduplicated", mode: 'copy', + saveAs: {filename -> filename.indexOf(".bam") == -1 ? "logs/$filename" : "$filename"} input: file bam @@ -267,7 +285,14 @@ if (params.nodedup || params.rrbs) { */ process bismark_methXtract { tag "${bam.baseName}" - publishDir "${params.outdir}/bismark_methylation_calls", mode: 'copy' + publishDir "${params.outdir}/bismark_methylation_calls", mode: 'copy', + saveAs: {filename -> + if (filename.indexOf("splitting_report.txt") > 0) "logs/$filename" + else if (filename.indexOf("M-bias") > 0) "m-bias/$filename" + else if (filename.indexOf(".cov") > 0) "methylation_coverage/$filename" + else if (filename.indexOf("bedGraph") > 0) "bedGraph/$filename" + else "methylation_calls/$filename" + } input: file bam from bam_dedup @@ -366,7 +391,7 @@ process bismark_summary { */ process qualimap { tag "${bam.baseName}" - publishDir "${params.outdir}/Qualimap", mode: 'copy' + publishDir "${params.outdir}/qualimap", mode: 'copy' input: file bam from bam_dedup_qualimap diff --git a/docs/images/NGI_logo.png b/docs/images/NGI_logo.png new file mode 100644 index 0000000..6ef69f8 Binary files /dev/null and b/docs/images/NGI_logo.png differ diff --git a/docs/installation.md b/docs/installation.md new file mode 100644 index 0000000..d21ed56 --- /dev/null +++ b/docs/installation.md @@ -0,0 +1,174 @@ +# NGI-MethylSeq Installation + +To start using the NGI-MethylSeq pipeline, there are three steps described below: + +1. [Install Nextflow](#1-install-nextflow) +2. [Install the pipeline](#2-install-the-pipeline) +3. Configure the pipeline + * [Swedish UPPMAX System](#31-configuration-uppmax) + * [Other Clusters](#32-configuration-other-clusters) + * [Docker](#33-configuration-docker) + * [Amazon AWS](#34-configuration-amazon-ec2) + +## 1) Install NextFlow +Nextflow runs on most POSIX systems (Linux, Mac OSX etc). It can be installed by running the following commands: + +```bash +# Make sure that Java v7+ is installed: +java -version + +# Install Nextflow +curl -fsSL get.nextflow.io | bash + +# Add Nextflow binary to your PATH: +mv nextflow ~/bin +# OR system-wide installation: +# sudo mv nextflow /usr/local/bin +``` + +See [nextflow.io](https://www.nextflow.io/) and [NGI-NextflowDocs](https://github.com/SciLifeLab/NGI-NextflowDocs) for further instructions on how to install and configure Nextflow. + +## 2) Install the Pipeline +This pipeline itself needs no installation - NextFlow will automatically fetch it from GitHub if `SciLifeLab/NGI-MethylSeq` is specified as the pipeline name. + +If you prefer, you can download the files yourself from GitHub and run them directly: + +```bash +git clone https://github.com/SciLifeLab/NGI-MethylSeq.git +nextflow run NGI-MethylSeq/bismark.nf +nextflow run NGI-MethylSeq/bwa-meth.nf # Alternative bwa-meth pipeline +``` + +## 3.1) Configuration: UPPMAX +By default, the pipeline is configured to run on the [Swedish UPPMAX](https://www.uppmax.uu.se/) cluster (`milou` / `irma`). As such, you shouldn't need to add any custom configuration - everything _should_ work out of the box. + +Note that you will need to specify your UPPMAX project ID when running a pipeline. To do this, use the command line flag `--project `. The pipeline will exit with an error message if you try to run it pipeline with the default UPPMAX config profile without a project. + +**Optional Extra:** To avoid having to specify your project every time you run Nextflow, you can add it to your personal Nextflow config file instead. Add this line to `~/.nextflow/config`: + +```groovy +params.project = 'project_ID' // eg. b2017123 +``` + +## 3.2) Configuration: Other clusters +It is entirely possible to run this pipeline on other clusters, though you will need to set up your own config file so that the script knows where to find your reference files and how your cluster works. + +> If you think that there are other people using the pipeline who would benefit from your configuration (eg. other common cluster setups), please let us know. We can add a new configuration and profile which can used by specifying `-profile ` when running the pipeline. + +If you are the only person to be running this pipeline, you can create your config file as `~/.nextflow/config` and it will be applied every time you run Nextflow. Alternatively, save the file anywhere and reference it when running the pipeline with `-c path/to/config`. + +An empty configuration comes with the pipeline, which should be applied by using the command line flag `-profile none`. This prevents the UPPMAX defaults (above) from being applied and means that you only need to configure the specifics for your system. + +### Cluster Environment +By default, Nextflow uses the `local` executor - in other words, all jobs are run in the login session. If you're using a simple server, this may be fine. If you're using a compute cluster, this is bad as all jobs will run on the head node. + +To specify your cluster environment, add the following line to your config file: + +```groovy +process { + executor = 'YOUR_SYSTEM_TYPE' +} +``` + +Many different cluster types are supported by Nextflow. For more information, please see the [Nextflow documentation](https://www.nextflow.io/docs/latest/executor.html). + +Note that you may need to specify cluster options, such as a project or queue. To do so, use the `clusterOptions` config option: + +```groovy +process { + executor = 'SLURM' + clusterOptions = '-A myproject' +} +``` + +### Reference Genomes +The NGI-MethylSeq pipeline needs a reference genome for alignment and annotation. If not already available, start by downloading the relevant reference, for example from [illumina iGenomes](https://support.illumina.com/sequencing/sequencing_software/igenome.html). + +> NB: The below paragraph is a lie. You currently need a Bismark reference. Integrated builds from Fasta files coming soon... + +The minimal requirements are a FASTA file. If a Bismark reference is specified, the pipeline won't have to generate it and will be finished quite a bit faster. Use the command line option `--saveReference` to keep the generated references so that they can be added to your config and used again in the future. + +A reference genome path can be specified on the command line each time you run with `--bismark_index` or `--fasta`. Alternatively, add the paths to the config under a relevant id and just specify this id with `--genome ID` when you run the pipeline _(this can also be set as a default in your config)_: + +```groovy +params { + genomes { + 'YOUR-ID' { + bismar = '/BismarkIndex' + fasta = '/genome.fa' + } + 'OTHER-GENOME' { + // [..] + } + } + // Optional - default genome. Ignored if --genome 'OTHER-GENOME' specified on command line + genome = 'YOUR-ID' +} +``` + + +### Software Requirements +To run the pipeline, several software packages are required. How you satisfy these requirements is essentially up to you and depends on your system. + +#### Environment Modules +If your cluster uses environment modules, the software may already be available. If so, just add lines to your custom config file as follows _(customise module names and versions as appropriate)_: + +```groovy +process { + // Main Bismark Pipeline + $fastqc.module = ['FastQC'] + $trim_galore.module = ['TrimGalore'] + $bismark_align.module = ['samtools/1.3', 'bismark'] + $bismark_deduplicate.module = ['samtools/1.3', 'bismark'] + $bismark_methXtract.module = ['samtools/1.3', 'bismark'] + $bismark_report.module = ['bismark'] + $bismark_summary.module = ['bismark'] + $qualimap.module = ['samtools/1.3', 'QualiMap'] + $multiqc.module = ['MultiQC'] + + // Extras for BWA-meth Pipeline + $makeBwaMemIndex.module = ['bwa', 'bwa-meth', 'samtools/1.3'] + $makeFastaIndex.module = ['samtools/1.3'] + $bwamem_align.module = ['bwa', 'bwa-meth', 'samtools/1.3'] + $samtools_flagstat.module = ['samtools/1.3'] + $samtools_sort.module = ['samtools/1.3'] + $samtools_index.module = ['samtools/1.3'] + $markDuplicates.module = ['picard/2.0.1'] + $methyldackel.module = ['MethylDackel'] +} +``` + +#### Manual Installation +If the software is not already available, you will need to install it. + +If you are able to use [Docker](https://www.docker.com/), you can use the [sclifelab/ngi-methylseq](https://hub.docker.com/r/scilifelab/ngi-methylseq/) image which comes with all requirements. This is pulled by Nextflow automatically if you use `-profile docker` (see below for [further instructions](#33-configuration-docker)). + +We recommend using [Bioconda](https://bioconda.github.io/) to install the required software as the process is quite easy in our experience. + +## 3.3) Configuration: Docker +Docker is a great way to run NGI-MethylSeq, as it manages all software installations and allows the pipeline to be run in an identical software environment across a range of systems. + +Nextflow has [excellent integration](https://www.nextflow.io/docs/latest/docker.html) with Docker, and beyond installing the two tools, not much else is required. + +First, install docker on your system : [Docker Installation Instructions](https://docs.docker.com/engine/installation/) + +Then, simply run the analysis pipeline: +```bash +nextflow run SciLifeLab/NGI-MethylSeq -profile docker # rest of normal launch command +``` + +Nextflow will recognise `SciLifeLab/NGI-MethylSeq` and download the pipeline from GitHub. The `-profile docker` configuration lists the [sclifelab/ngi-methylseq](https://hub.docker.com/r/scilifelab/ngi-methylseq/) image that we have created and is hosted at dockerhub, and this is downloaded. + +A reference genome is still required by the pipeline. Specifying a path to a FASTA file is the minimum requirement, a Bismark reference will automatically be generated. See the above [Reference Genomes](#reference-genomes) documentation for instructions on how to configure Nextflow with preset paths to make this easier. + +A test suite for docker comes with the pipeline, and can be run by moving to the [`tests` directory](https://github.com/ewels/NGI-MethylSeq/tree/master/tests) and running `./docker_test.sh`. This will download a small lambda genome and some data, and attempt to run the pipeline through docker on that small dataset. This is automatically run using [Travis](https://travis-ci.org/SciLifeLab/NGI-MethylSeq/) whenever changes are made to the pipeline. + +## 3.4) Configuration: Amazon EC2 +There are multiple ways of running this pipeline over Amazon's EC2 service. Please see the [NGI-RNAseq pipeline docs](https://github.com/SciLifeLab/NGI-RNAseq/blob/master/docs/amazon_web_services.md) for more information. + +--- + +[![SciLifeLab](images/SciLifeLab_logo.png)](http://www.scilifelab.se/) +[![National Genomics Infrastructure](images/NGI_logo.png)](https://ngisweden.scilifelab.se/) + +--- \ No newline at end of file diff --git a/docs/output.md b/docs/output.md new file mode 100644 index 0000000..51fe86f --- /dev/null +++ b/docs/output.md @@ -0,0 +1,150 @@ +# NGI-MethylSeq Output + +NGI-MethylSeq is the new RNA-seq Best Practice pipeline used by the [National Genomics Infrastructure](https://ngisweden.scilifelab.se/) at [SciLifeLab](https://www.scilifelab.se/platforms/ngi/) in Stockholm, Sweden. + +This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline. + +Note that NGI-MethylSeq contains two workflows - one for Bismark, one for bwa-meth. This document describes the output from the default Bismark workflow. + +## Pipeline overview +The pipeline is built using [Nextflow](https://www.nextflow.io/) +and processes data using the following steps: + +* [FastQC](#fastqc) - read quality control +* [TrimGalore](#trimgalore) - adapter trimming +* [Bismark](#bismark) + * [Alignment](#alignment) - aligning reads to reference genome + * [Deduplication](#deduplication) - deduplicating reads + * [Methylation Extraction](#methylation-extraction) - calling cytosine methylation steps + * [Report](#report) - single-sample analysis report + * [Summary](#summary) - multi-sample analysis summary report +* [Qualimap](#qualimap) - QC package for genome alignments +* [MultiQC](#multiqc) - aggregate report, describing results of the whole pipeline + +## FastQC +[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your reads. It provides information about the quality score distribution across your reads, the per base sequence content (%T/A/G/C). You get information about adapter contamination and other overrepresented sequences. + +For further reading and documentation see the [FastQC help](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/). + +> **NB:** The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality. To see how your reads look after trimming, look at the FastQC reports in the `trim_galore` directory. + +**Output directory: `results/fastqc`** + +* `sample_fastqc.html` + * FastQC report, containing quality metrics for your untrimmed raw fastq files +* `sample_fastqc.zip` + * zip file containing the FastQC report, tab-delimited data file and plot images + +## TrimGalore +The NGI-MethylSeq BP 2.0 pipeline uses [TrimGalore](http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) for removal of adapter contamination and trimming of low quality regions. TrimGalore uses [Cutadapt](https://github.com/marcelm/cutadapt) for adapter trimming and runs FastQC after it finishes. + +MultiQC reports the percentage of bases removed by TrimGalore in the _General Statistics_ table, along with a line plot showing where reads were trimmed. + +**Output directory: `results/trim_galore`** + +Contains FastQ files with quality and adapter trimmed reads for each sample, along with a log file describing the trimming. + +* `sample_val_1.fq.gz`, `sample_val_2.fq.gz` + * Trimmed FastQ data, reads 1 and 2. + * NB: Only saved if `--saveTrimmed` has been specified. +* `logs/sample_val_1.fq.gz_trimming_report.txt` + * Trimming report (describes which parameters that were used) +* `FastQC/sample_val_1_fastqc.zip` + * FastQC report for trimmed reads + +Single-end data will have slightly different file names and only one FastQ file per sample. + +## Bismark +The default NGI-MethylSeq workflow uses [Bismark](http://www.bioinformatics.babraham.ac.uk/projects/bismark/) to process raw sequencing reads into cytosine methylation calls. + +### Alignment +Bismark converts all Cytosines to Thymine _in-silico_ and then aligns against a three-letter reference genome. It produces a BAM file of genomic alignments. + +**Output directory: `results/bismark_alignments/`** + +* `sample.bam` + * Aligned reads in BAM format. + * Only saved if `--saveAlignedIntermediates`, `--nodedup` or `--rrbs` is specified when running the pipeline. +* `logs/sample_PE_report.txt` + * Log file giving summary statistics about alignment. +* `unmapped/unmapped_reads_1.fq.gz`, `unmapped/unmapped_reads_2.fq.gz` + * Unmapped reads in FastQ format. + * Only saved if `--unmapped` specified when running the pipeline. + +### Deduplication +This step removes alignments with identical mapping position to avoid technical duplication in the results. Note that it is skipped if `--saveAlignedIntermediates`, `--nodedup` or `--rrbs` is specified when running the pipeline. + +**Output directory: `results/bismark_deduplicated/`** + +* `deduplicated.bam` + * BAM file with only unique alignments. +* `logs/deduplication_report.txt` + * Log file giving summary statistics about deduplication. + +### Methylation Extraction +The bismark methylation extractor tool takes a BAM file aligned with Bismark and generates files containing methylation information about cytosines. It produces a few different output formats, described below. + +Note that the output may vary a little depending on whether you specify `--comprehensive` or `--non_directional` when running the pipeline. + +Filename abbreviations stand for the following reference alignment strands: + +* `OT` - original top strand +* `OB` - original bottom strand +* `CTOT` - complementary to original top strand +* `CTOB` - complementary to original bottom strand + +(`CTOT` and `CTOB` are not aligned unless `--non_directional` specified). + +**Output directory: `results/bismark_methylation_calls/`** + +* `methylation_calls/XXX_context_sample.txt.gz` + * Individual methylation calls, sorted into files according to cytosine context. +* `methylation_coverage/sample.bismark.cov.gz` + * Coverage text file summarising cytosine methylation values. +* `bedGraph/sample.bedGraph.gz` + * Methylation statuses in [bedGraph](http://genome.ucsc.edu/goldenPath/help/bedgraph.html) format, with 0-based genomic start and 1- based end coordinates. +* `m-bias/sample.M-bias.txt` + * QC data showing methylation bias across read lengths. See the [bismark documentation](https://rawgit.com/FelixKrueger/Bismark/master/Docs/Bismark_User_Guide.html#m-bias-plot) for more information. +* `logs/sample_splitting_report.txt` + * Log file giving summary statistics about methylation extraction. + +### Report +Bismark generates a HTML report describing results for each sample. + +**Output directory: `results/bismark_reports`** + +### Summary +Bismark generates summary text and HTML reports giving an overview of results for all samples in a project. + +**Output directory: `results/bismark_summary`** + +## Qualimap + +[Qualimap BamQC](http://qualimap.bioinfo.cipf.es/doc_html/analysis.html#bam-qc) is a general-use quality-control tool that generates a number of statistics about aligned BAM files. It's not specific to bisulfite data, but it produces several useful stats - for example, insert size and coverage statistics. + +**Output directory: `results/qualimap`** + +* `sample/qualimapReport.html` + * Qualimap HTML report +* `sample/genome_results.txt`, `sample/raw_data_qualimapReport/*.txt` + * Text-based statistics that can be loaded into downstream programs + + +## MultiQC +[MultiQC](http://multiqc.info) is a visualisation tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in within the report data directory. + +**Output directory: `results/multiqc`** + +* `Project_multiqc_report.html` + * MultiQC report - a standalone HTML file that can be viewed in your web browser +* `Project_multiqc_data/` + * Directory containing parsed statistics from the different tools used in the pipeline + +For more information about how to use MultiQC reports, see http://multiqc.info + +--- + +[![SciLifeLab](images/SciLifeLab_logo.png)](http://www.scilifelab.se/) +[![National Genomics Infrastructure](images/NGI_logo.png)](https://ngisweden.scilifelab.se/) + +--- diff --git a/docs/usage.md b/docs/usage.md new file mode 100644 index 0000000..b200f96 --- /dev/null +++ b/docs/usage.md @@ -0,0 +1,198 @@ +# NGI-MethylSeq Usage + +## Bismark and bwa-meth workflow + +The NGI-MethylSeq package is actually two pipelines in one. The default (and most polished) workflow uses [Bismark](http://www.bioinformatics.babraham.ac.uk/projects/bismark/) for processing and can be found in `bismark.nf`. Unless specified otherwise, NGI-MethylSeq will run this pipeline. + +The second included workflow uses [BWA-Meth](https://github.com/brentp/bwa-meth) and [MethylDackyl](https://github.com/dpryan79/methyldackel) instead of Bismark and can be found in `bwa-meth.nf`. To run this workflow, you must download the repository files and explicitly call `nextflow run /path/to/files/bwa-meth.nf`. Note that this workflow has not been tested to the same extent and may have bugs. + +All of the documentation refers to the Bismark workflow at this stage, though most of it also applies to the bwa-meth workflow. + +## Running the pipeline +The typical command for running the pipeline is as follows: + +```bash +nextflow run SciLifeLab/NGI-MethylSeq --genome GRCh37 --reads '*_R{1,2}.fastq.gz' +``` + +Note that the pipeline will create files in your working directory: + +```bash +work # Directory containing the nextflow working files +results # Finished results (configurable, see below) +.nextflow_log # Log file from Nextflow +# Other nextflow hidden files, eg. history of pipeline runs and old logs. +``` + +## Input Data + +### `--reads` +Location of the input FastQ files: + +```bash + --reads 'path/to/data/sample_*_{1,2}.fastq' +``` + +**NB: Must be enclosed in quotes!** + +Note that the `{1,2}` parentheses are required to specify paired end data. The file path should be in quotation marks to prevent shell glob expansion. + +If left unspecified, the pipeline will assume that the data is in a directory called `data` in the working directory (`data/*{1,2}.fastq.gz`). + +### `--singleEnd` +By default, the pipeline expects paired-end data. If you have single-end data, specify `--singleEnd` on the command line when you launch the pipeline. A normal glob pattern, enclosed in quotation marks, can then be used for `--reads`. For example: `--singleEnd --reads '*.fastq'` + +It is not possible to run a mixture of single-end and paired-end files in one run. + +## Reference Genomes + +### `--genome` +The reference genome to use for the analysis, needs to be one of the genome specified in the config file. This is `False` by default and needs to be specified (unless index files are supplied, see below). + +See [`conf/uppmax.config`](conf/uppmax.config) for a list of the supported reference genomes and their keys. Common genomes that are supported are: + +* Human + * `--genome GRCh37` +* Mouse + * `--genome GRCm38` +* _Drosophila_ + * `--genome BDGP6` +* _S. cerevisiae_ + * `--genome 'R64-1-1'` + +> There are numerous others - check the config file for more. + +If you're not running on UPPMAX (the default profile), you can create your own config file with paths to your reference genomes. See the [Nextflow documentation](https://www.nextflow.io/docs/latest/config.html) for instructions on where to add this. + +The syntax for this reference configuration is as follows: + +```groovy +params { + genomes { + 'GRCh37' { + bismark = '' + fasta = '' // Used if no bismark index given + } + // Any number of additional genomes, key is used with --genome + } +} +``` + +### `--bismark_index`, `--fasta` +If you prefer, you can specify the full path to your reference genome when you run the pipeline: + +```bash +--bismark_index '[path to Bismark index]' +--fasta '[path to Fasta reference]' +``` + +### `--saveReference` +Supply this parameter to save any generated reference genome files to your results folder. These can then be used for future pipeline runs, reducing processing times. + +## Adapter Trimming +The pipeline accepts a number of parameters to change how the trimming is done, according to your data type. +The following settings are set for these command line flags: + +| Parameter | 5' R1 Trim | 5' R2 Trim | 3' R1 Trim | 3' R2 Trim | +|-----------------|------------|------------|------------|------------| +| `--pbat` | 6 | 9 | 6 | 9 | +| `--single_cell` | 6 | 6 | 6 | 6 | +| `--epignome` | 8 | 8 | 8 | 8 | +| `--accel` | 10 | 15 | 10 | 10 | +| `--zymo` | 10 | 15 | 10 | 10 | +| `--cegx` | 6 | 6 | 2 | 2 | + +You can specify custom trimming parameters as follows: + +* `--clip_r1 ` + * Instructs Trim Galore to remove bp from the 5' end of read 1 (or single-end reads). +* `--clip_r2 ` + * Instructs Trim Galore to remove bp from the 5' end of read 2 (paired-end reads only). +* `--three_prime_clip_r1 ` + * Instructs Trim Galore to remove bp from the 3' end of read 1 _AFTER_ adapter/quality trimming has been +* `--three_prime_clip_r2 ` + * Instructs Trim Galore to re move bp from the 3' end of read 2 _AFTER_ adapter/quality trimming has been performed. + +### `--rrbs` +Specifying `--rrbs` will pass on the `--rrbs` parameter to TrimGalore! See the [TrimGalore! documentation](https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md#rrbs-specific-options-mspi-digested-material) to read more about the effects of this option. + +### `--notrim` +Specifying `--notrim` will skip the adapter trimming step. Use this if your input FastQ files have already been trimmed outside of the workflow. + +### `--saveTrimmed` +By default, trimmed FastQ files will not be saved to the results directory. Specify this flag (or set to true in your config file) to copy these files to the results directory when complete. + + +## Bismark Parameters + +### `--pbat` +Using the `--pbat` parameter will affect the trimming (see above) and also set the `--pbat` flag when aligning with Bismark. It tells Bismark to align complementary strands (the opposite of `--directional`). + +### `--non_directional` +By default, Bismark assumes that libraries are directional and does not align against complementary strands. If your library prep was not direcional, use `--non_directional` to align against all four possible strands. + +Note that the `--single_cell` and `--zymo` parameters both set the `--non_directional` workflow flag automatically. + +### `--comprehensive` +This flag instructs the Bismark methylation extractor to use the `--comprehensive` and `--merge_non_CpG` flags. This produces coverage files with information from about all strands and cytosine contexts merged into two files - one for CpG context and one for non-CpG context. Note that for large genomes (eg. Human), these can be massive files. This is only recommended for small genomes (especially those that don't exhibit strong CpG context methylation specificity). + +### `--relaxMismatches` and `--numMismatches` + +By default, Bismark is pretty strict about which alignments it accepts as valid. If you have good reason to believe that your reads will contain more mismatches than normal, these flags can be used to relax the stringency that Bismark uses when accepting alignments. This can greatly improve the number of aligned reads you get back, but may negatively impact the quality of your data. + +`--numMismatches` is `0.2` by default in Bismark, or `0.6` if `--relaxMismatches` is specified. `0.6` will allow a penalty of `bp * -0.6` - for 100bp reads, this is `-60`. Mismatches cost `-6`, gap opening `-5` and gap extension `-2`. So, `-60` would allow 10 mismatches or ~ 8 x 1-2bp indels. + +### `--unmapped` +Use the `--unmapped` flag to set the `--unmapped` flag with Bismark align and save the unmapped reads to FastQ files. + +### `--nodedup` +By default, the pipeline includes a deduplication step after alignment. Use `--nodedup` on the command line to skip this step. This is automatically set if using `--rrbs` for the workflow. + +### `--saveAlignedIntermediates` +By default intermediate BAM files will not be saved. The final BAM files created after the Bismark deduplication step are always saved. Set to true to also copy out BAM files from the initial Bismark alignment step. If `--nodedup` or `--rrbs` is specified then BAMs from the initial alignment will always be saved. + +## Job Resources +### Automatic resubmission +Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits on UPPMAX with an error code of `143` (exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped. + +### Custom resource requests +Wherever process-specific requirements are set in the pipeline, the default value can be changed by creating a custom config file. See the files in [`conf`](../conf) for examples. + +## Other command line parameters +### `--outdir` +The output directory where the results will be saved. + +### `--email` +Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file (`~/.nextflow/config`) then you don't need to speicfy this on the command line for every run. + +### `--clusterOptions` +Submit arbitrary SLURM options (UPPMAX profile only). For instance, you could use `--clusterOptions '-p devcore'` +to run on the development node (though won't work with default process time requests). + +### `--multiqc_config` +If you would like to supply a custom config file to MultiQC, you can specify a path with `--multiqc_config`. This is used instead of the config file specific to the pipeline. + +### `-name` +Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic. + +**NB:** Single hyphen (core Nextflow option) + +### `-c` +Specify the path to a specific config file (this is a core NextFlow command). Useful if using different UPPMAX +projects or different sets of reference genomes. + +**NB:** Single hyphen (core Nextflow option) + +Note - you can use this to override defaults. For example, we run on UPPMAX but don't want to use the MultiQC +environment module as is the default. So we specify a config file using `-c` that contains the following: + +```groovy +process.$multiqc.module = [] +``` + +--- + +[![SciLifeLab](images/SciLifeLab_logo.png)](http://www.scilifelab.se/) +[![National Genomics Infrastructure](images/NGI_logo.png)](https://ngisweden.scilifelab.se/) + +---