rnaSeq-byBABS

A Nextflow pipeline script for processing RNA-Seq data.

by BABS - The Bioinformatics & Biostatistics Group @ The Francis Crick Institute

nextflow: http://www.nextflow.io nextflow-quickstart: http://www.nextflow.io/docs/latest/getstarted.html#get-started

Overview

To run an rnaSeq-byBABS analysis you need to complete the following steps. These are explained in more detail further down.

Load Nextflow module.
Obtain the rnaSeq-byBABS files you need.
Specify analysis parameters by completing the parameters.yml file.
Create a sample design file and comparisons contrasts file.
Run nextflow pipeline.

The test data

You can test the pipeline by using the provided simulated test dataset. Running the pipeline on it should take no more than 15 minutes. In the following, this dataset will be used as an example to illustrate how you can parametrise the pipeline for a real experiment.

Loading the Nextflow Module

To run Nextflow you will need to first load the Nextflow module. This can be done using the following command:

$ module load nextflow/0.27.2

Obtaining the Files You Need

$ git clone https://github.com/crickbabs/rnaSeq_byBABS
$ cd rnaSeq_byBABS

These files should be in the current directory:

File	Description
`README.md`	this file
`rnaSeq-byBABS.nf`	the nextflow script itself
`fastq/single_end/`	the directory containing simulated single-end `fastq.gz` samples
`fastq/paired_end/`	the directory containing simulated paired-end `fastq.gz` samples
`yml/single_end.yml`	the example parameter file for the simulated single-end samples
`yml/paired_end.yml`	the example parameter file for the simulated paired-end samples
`yml/modules.yml`	the configuration file containing the slurm modules to load
`yml/multiqc_config.yml`	the example MultiQC configuration file
`r/analysis.r`	the example R script which runs the statistical analysis
`r/analysis_example.r`	the example R script which runs the statistical analysis for the simulated samples
`csv/single_end.csv`	the example design file for the simulated single-end samples
`csv/paired_end.csv`	the example design file for the simulated single-end samples
`conf/fastq_screen.conf`	the example FastQ Screen configuration file
`png/`	the directory containing the illustrations for this page

This is currently configured to run the rnaSeq-byBABS RNA-Seq analysis pipeline on The Crick CAMP HPC system. If you want to test the pipeline immediately, the fastest way is to run it on the test data with:

$ sh sh/single_end.sh

, or with:

$ sh sh/paired_end.sh

Specifying Samples via a Design File

The design file has to be a csv (not tsv) file, because in the script there are regular expressions which need the comma to match. The file must contain a column named sample which associates a name of your choice to each sample. The file must also contain the location of the fastq file for each sample. These columns must be named file for a single-end experiment, and file1 and file2 for a paired-end experiment. All the other columns contain the different conditions of the experiment (treatment, time...). If you are not providing experimental condition columns, the R script running the differential gene expression analysis should detect it and stop. For example, the design file for the single end test data contains:

sample,file,treatment
control1,fastq/single_end/sample1.fastq.gz,untreated
control2,fastq/single_end/sample2.fastq.gz,untreated
treatment2,fastq/single_end/sample5.fastq.gz,treated

And the one for the paired end test data contains:

sample,file1,file2,treatment
control2,fastq/paired_end/sample2_r1.fastq.gz,fastq/paired_end/sample2_r2.fastq.gz,untreated
control3,fastq/paired_end/sample3_r1.fastq.gz,fastq/paired_end/sample3_r2.fastq.gz,untreated
treatment3,fastq/paired_end/sample6_r1.fastq.gz,fastq/paired_end/sample6_r2.fastq.gz,treated

Specifying Analysis Parameters

All the parameters except multiqc_config are mandatory.

Parameter	Description	Required by
`design`	the path of the `csv` design file	every processes because the sample column values will be used to name the output files, the experimental condition columns values will be used by DESeq2
`output_directory`	the path to the directory that will contain all the files produced by the workflow	every processes
`binomial_nomenclature`	the organism in which the experiment has been done	STAR genome alignment, DESeq2
`strandedness`	the strandedness of the reads (`none`, `forward`, `reverse`)	STAR
`read_length`	the read length	STAR, RSeQC mismatch profile
`single_end`	the experiment is considered as single end if this flag is present and as paired end if this flag is absent	cutadapt, STAR, RNA-SeQC
`modules`	the path of the `yml` module file that contains the list of the SLURM modules	every processes (each process needs to load its own modules)
`genome_type`	the genome type of the organism in which the experiment has been done (`ensembl` most of time). This parameter is required to build the path of the STAR genome index	STAR
`genome_version`	the genome version of the organism in which the experiment has been done (`36` or `38` most of time). This parameter is required to build the path of the STAR genome index	STAR
`genome_release`	the genome release of the organism in which the experiment has been done (`86` or `89` most of time). This parameter is required to build the path of the STAR genome index	STAR
`genome_sequence_extension`	the genome sequence extension of the organism in which the experiment has been done (`primary_assembly.fa` or `toplevel.fa` most of time). This parameter is required to build the path of the raw sequence of the genome of the organism	Picard multimetric, RNA-SeQC
`fastq_screen_conf`	the path to the FastQ Screen configuration file which contains the paths of the bowtie2 indexes. The example file points to Harshil's folders at the moment	FastQ Screen
`publish_directory_mode`	define if Nextflow will link, copy or move the output files (`symlink`, `link`, `copy`, `move`)	Nextflow
`publish_directory_overwrite`	define if Nexflow will overwrite the output files or not (`true`, `false`)	Nextflow
`r_script`	the path to the R script that will perform the statistical analysis	DESeq2
`multiqc_config`	the path to the MultiQC configuration file	MultiQC

There are two ways to specify the analysis parameters : with a parameter yml file or via command line arguments.

With a parameter file

If you want to run the workflow with a parameter file the command is:

$ nextflow run main.nf -params-file yml/single_end.yml

or:

$ nextflow run main.nf -params-file yml/paired_end.yml

With command line arguments

The native nextflow's command line arguments are preceded by -, for example -params-file. However, the nextflow command accepts any parameter if it is preceded by --. For example, the read_length parameters can be passed to nextflow like this:

$ nextflow run main.nf --read_length 75

Thus, in this way, all the parameters of the pipeline can be passed via a shell script. You can find an example of this for the test data : sh/single_end.sh and sh/paired_end.sh.

WARNING!

Theoretically, the two methods can be mixed in order to pass the arguments in both ways at the same time, or in order to override the value of some arguments. However, this has not been tested yet and it is recommended to use either the parameter file or the command line.

The Analysis Pipeline Flow

cutadapt

Cutadapt removes adapter and poor quality sequence from the 3' end of reads. The maximum read error rate is set to 10% (-e 0.1), the quality cutoff is set at 10 (-q 10), the minimum acceptable read length after trimming is 20 (-m 20) and the minimum adapter overlap is 1 (-O 1).

STAR

The STAR aligner is run via the rsem-calculate-expression command of the RSEM package.

Genome alignment

The reads are aligned on the reference genome of the organism specified in the parameters.

Contaminants alignment

The reads are aligned on the reference sequences of well known molecular biology contaminants.

WARNING: this process is as ignore which means that the pipeline will continue even if it fails.

DESeq2

This process run a statistical analysis for the differentially expressed genes with DESeq2 package.

The process will call the R script specified by the r_script parameter. Feel to write your own script, but two scripts are provided with the pipeline in the r directory.

The r/analysis_example.r script is provided as a basic example for the test dataset. In addition to do a differential expression analysis with DESeq2, this script will create a PCA plot:

,and will do a quick ontology analysis:

The r/analysis.r script is a far more general one, which is aimed to be applied on any kind of experiment as a first quality control. Guys, your opinion about it will be very much appreciated.

WARNING: this process is as ignore which means that the pipeline will continue even if it fails.

SAMtools

In this workflow, SAMtools is only used in order to sort and index BAM files.

FastQ Screen

Assess the library composition by aligning the reads on various reference genomes specified in the FastQ Screen configuration file.

Picard

The workflow is using five tools of the Picard suite in order to collect several metrics.

Complexity

Call the EstimateLibraryComplexity Picard tool.

Duplicates

Call the MarkDuplicates Picard tool.

Group

Call the AddOrReplaceReadGroups Picard tool.

Multimetrics

Call the CollectMultipleMetrics Picard tool.

RNA-Seq metrics

Call the CollectRnaSeqMetrics Picard tool.

RseQC

The workflow is using 6 scripts the RSeQC package.

Infer experiment

Call the infer_experiment.py RSeQC script.

Junction annotation

Call the junction_annotation.py RSeQC script.

Junction saturation

Call the [junction_saturation.py][url_rseqc_junction_saturation] RSeQC script.

Mismatch profile

Call the mismatch_profile.py RSeQC script.

Read distribution

Call the read_distribution.py RSeQC script.

Transcript integrity number

Call the tin.py RSeQC script.

RNA-SeQC

RNA-SeQC computes various metrics among which the most interesting are the gap length and the mean coverage.

MultiQC

Summary of all the quality control metrics with MultiQC.

Troubleshooting

Sometimes there is an error of the type:

Cannot access folder: '/camp/stp/babs/working/<USER>/.../work/f8/810a2...'

I absolutely don't know why this happens. I just need to run nextflow again with the -resume option and everything is fine, so I suspect a connection problem. I do not have this kind of error with snakemake though. Need to find out what's happening.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
conf		conf
csv		csv
fastq		fastq
png		png
r		r
sh		sh
yml		yml
.gitattributes		.gitattributes
README.md		README.md
main.nf		main.nf

crickbabs/rnaSeq_byBABS

Folders and files

Latest commit

History

Repository files navigation