Skip to content

Bulk RNA Seq tutorial

Laurent Jourdren edited this page Oct 8, 2019 · 4 revisions

This tutorial will go through an end to end analysis for RNA-Seq analysis using Eoulsan. Before you start, you will need an installed copy of Eoulsan (see the installation guide).

This tutorial will use an example mouse RNA-Seq dataset generated by an Illumina NextSeq 500 sequencer.

Warning: You need about 48 GB of RAM to perform this tutorial as we use STAR to map read on mouse genome. The analysis need about 2 hours to perform. The mouse STAR index creation, is the longuest step of the analysis but thanks to Eoulsan repository feature, this index will be only computed once.

Outline of the process:

Step Module used Inputs Output
Create a genome index for STAR starindexgenerator FASTA ZIP
Quality control of the raw FASTQ files with FastQC fastqc FASTQ HTML
Filter raw FASTQ files filterreads FASTQ FASTQ
Map the reads using STAR mapreads FASTQ SAM
Remove unmap and multimatches SAM entries filtersam SAM SAM
Convert SAM files into sorted and indexed BAM files sam2bam SAM BAM
Convert BAM files int BedGraph files bam2bedgraph BAM BedGraph
Convert BAM files int BED files splice2bed BAM BED
Count alignments with HTSeq-count expression SAM TSV
Convert expression files into XLSX files expressionresultsannotation TSV XLSX
Quality control of the analysis with MultiQC multiqc FASTQ HTML
Normalisation and differential analysis with DESeq 2 normalization TSV TSV + graphics
Convert differential analysis output files into XLSX files expressionresultsannotation TSV XLSX

TL;DR

# Step 1: get data (Raw data, genome sequence and annotations)
$ wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/rnaseq-illumina/2013_0267_S1_L001_R1_001.fastq.bz2 && \
  wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/rnaseq-illumina/2013_0268_S2_L001_R1_001.fastq.bz2 && \
  wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/rnaseq-illumina/2013_0269_S3_L001_R1_001.fastq.bz2 && \
  wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/rnaseq-illumina/2013_0270_S4_L001_R1_001.fastq.bz2 && \
  wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/rnaseq-illumina/2013_0271_S5_L001_R1_001.fastq.bz2 && \
  wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/rnaseq-illumina/2013_0272_S6_L001_R1_001.fastq.bz2 && \
  wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.fasta.bz2 && \
  wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.gff.bz2 && \
  wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.gtf.bz2 && \
  wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.tsv.bz2

# Step 2: create a mockup of the design file
$ eoulsan.sh createdesign *.fq.bz2 mm10ens91.fasta.bz2 mm10ens91.gff.bz2 mm10ens91.gtf.bz2 mm10ens91.tsv.bz2

# Step 3: edit the design file using a text editor or a spreadsheet
$ vim design.txt

# Step 3 bis: download the ready to use design file
$ wget https://github.com/GenomicParisCentre/eoulsan/wiki/files/design-rnaseq.txt && mv design-rnaseq.txt design.txt

# Step 4: download workflow file
$ wget https://github.com/GenomicParisCentre/eoulsan/wiki/files/workflow-rnaseq.xml

# Step 4: launch the analysis
$ eoulsan.sh exec workflow-rnaseq.xml design.txt

Step 1: Obtaining the data

We provide in this section some samples files to test Eoulsan. These files have been produced during a mouse RNA-Seq experiment sequenced on an Illumina GAIIx.

Step 2: create the design file

In an empty directory, copy the reads, genome and annotation files, then you can create a mockup of the design file with the next command:

$ eoulsan.sh createdesign *.fq.bz2 mm10ens91.fasta.bz2 mm10ens91.gff.bz2 mm10ens91.gtf.bz2 mm10ens91.tsv.bz2

You can now modify the design file to add additional information using a text editor (e.g. vim, nano, emacs...) or a spreadsheet (MS Excel, Libreoffice...). Note that Eoulsan handle automatically compressed files.

For this tutorial, we will use the following design file that can be downloaded:

[Header]
DesignFormatVersion=2
GenomeFile=mm10ens91.fasta.bz2
GffFile=mm10ens91.gff.bz2
GtfFile=mm10ens91.gtf.bz2
AdditionalAnnotationFile=mm10ens91.tsv.bz2

[Experiments]
Exp.exp1.name=demo-mouse-rnaseq

[Columns]
SampleId	SampleName	Reads	Date	FastqFormat	RepTechGroup	Exp.exp1.Condition	Exp.exp1.Reference	UUID
20130267	KO1	2013_0267_S1_L001_R1_001.fastq.bz2	2015-10-04	fastq-sanger	KO1	KO	0	7a392c56-3018-471f-8d39-8f1b03669c35
20130268	KO2	2013_0268_S2_L001_R1_001.fastq.bz2	2015-10-04	fastq-sanger	KO2	KO	0	68a87e5b-1022-4288-914e-eb3f13985a7f
20130269	KO3	2013_0269_S3_L001_R1_001.fastq.bz2	2015-10-04	fastq-sanger	KO3	KO	0	1dd4f0ab-6b31-41d5-93f6-69fd2a474697
20130270	WT1	2013_0270_S4_L001_R1_001.fastq.bz2	2015-10-04	fastq-sanger	WT1	WT	1	36995c69-2685-4f29-822c-b3ee5bd423b4
20130271	WT2	2013_0271_S5_L001_R1_001.fastq.bz2	2015-10-04	fastq-sanger	WT2	WT	1	7355ddc7-2426-48fc-aa1d-a88eefc8e5c2
20130272	WT3	2013_0272_S6_L001_R1_001.fastq.bz2	2015-10-04	fastq-sanger	WT3	WT	1	004b53eb-e77f-4267-bf1d-98ef21cd5119

The experiment defined in this design file use a simple design. The Eoulsan design file allow users to define complex models for DESeq2 and choose the comparisons to perform. Technical replicates (e.g. FASTQ files for the same sample in different lines of a run) must have the same name in the "RepTechGroup" column.

More information about the design file is available on the reference site of Eoulsan.

Step 3: create the workflow file

To create a workflow file, the best solution is to reuse an existing workflow file and adapt it to your needs. You can download an existing workflow to analyse this data with the following command:

$ wget http://outils.genomique.biologie.ens.fr/eoulsan2/workflow-local.xml

The workflow file contains the list of the steps that will be executed by Eoulsan. Each step have parameters and it is related to a module to execute. Some step modules are only available in local mode (like differential analysis) or in distributed mode, see the modules page for more details. For each step you can change, add or remove parameters. Parameters are specific to a module, consult the documentation of the built-in steps for the list of available parameters of each step.

At least, there is a global section in the workflow file that override the values of Eoulsan configuration file. This section is useful to set for example the path of the temporary directory to use.

Workflow step: Create a STAR index

Unlike other mapper (e.g. Bowtie/BWA), STAR creation index can be customized using parameters. This step allow to create a custom index for STAR. If you don't need a custom STAR index, you can remove this step from your workflow. The following configuration example will use a GFF3 annotation file to get the position of the exons of the transcripts of the reference genome.

        <!-- Create a custom STAR index using feature annotation -->
        <step skip="false">
            <module>starindexgenerator</module>
            <parameters>

                <!-- The overhang value must be greater than the length of the reads -->
                <parameter>
                    <name>overhang</name>
                    <value>100</value>
                </parameter>

                <!-- Use a feature annotation file -->
                <parameter>
                    <name>use.gtf.file</name>
                    <value>true</value>
                </parameter>

                <!-- The format of feature file is GFF -->
                <parameter>
                    <name>features.file.format</name>
                    <value>gff</value>
                </parameter>

                <!-- The name of the feature for exon is "exon" in the feature annotation file -->
                <parameter>
                    <name>gtf.feature.exon</name>
                    <value>exon</value>
                </parameter>

                <!-- For each exon feature we can get the transcript Id using the "Parent" tag -->
                <parameter>
                    <name>gtf.tag.exon.parent.transcript</name>
                    <value>Parent</value>
                </parameter>
            </parameters>
        </step>

Workflow step: launching FastQC on raw data

This step allow to launch FastQC on the raw data. To enable this step, just add the following lines in your workflow file. More parameters are available in the FastQC module, please see the FastQC module page in the reference manual for more informations.

        <!-- FastQC of raw reads -->
        <step id="rawfastqc" skip="false">
            <module>fastqc</module>
        </step>

Workflow step: remove adapters with Trim Galore!/cutadapt

This step allow to remove adapters from raw reads. This step is often useless for RNA-Seq, so we recommend to add this step to your worklfow only if FastQC has shown an adapter contamination as it takes a lot of time to perform.

        <!-- Remove adapters with Trim Galore!/cutadapt -->
        <step skip="false" discardoutput="asap">
            <module>trimadapt</module>
            <parameters>

                <!-- Trim low-quality ends from reads  -->
                <parameter>
                    <name>quality</name>
                    <value>20</value>
                </parameter>

                <!-- Maximum allowed error rate -->
                <parameter>
                    <name>error</name>
                    <value>0.1</value>
                </parameter>

                <!-- Overlap with adapter sequence required to trim a sequence -->
                <parameter>
                    <name>stringency</name>
                    <value>1</value>
                </parameter>

                <!-- Discard reads that became shorter than parameter value because of either quality or adapter trimming -->
                <parameter>
                    <name>length</name>
                    <value>20</value>
                </parameter>

                <!-- Paired-end data  -->
                <parameter>
                    <name>is.paired</name>
                    <value>no</value>
                </parameter>

            </parameters>
        </step>

Workflow step: filter the reads

This step allow to remove bad quality reads. The following configuration example will remove reads that do not pass Illumina filter, trim polyN reads and remove all the reads with a length lower than 40 and with a mean quality lower to 30. For more information about the filterreads module see the reference documentation.

        <!-- Filter reads -->
        <step skip="false" discardoutput="asap">
            <module>filterreads</module>
            <parameters>

                <!-- Remove all the read that do not pass Illumina QC -->
                <parameter>
                    <name>illuminaid</name>
                    <value></value>
                </parameter>

                <!-- Remove polyN tails of the reads -->
                <parameter>
                    <name>trimpolynend</name>
                    <value></value>
                </parameter>

                <!-- Remove reads with length lower than 40 -->
                <parameter>
                    <name>length.minimal.length.threshold</name>
                    <value>40</value>
                </parameter>

                <!-- Remove reads with mean QScore < 30 -->
                <parameter>
                    <name>quality.threshold</name>
                    <value>30</value>
                </parameter>
            </parameters>
        </step>

Workflow step: map the reads

This step allow to map FASTQ read. The following configuration example will use STAR with default parameters. STAR is only parameted to keep unmapped read in output SAM files. Eoulsan is bundled with many mappers (Bowtie, Bowtie2, BWA, Minimap, Gmap/Gsnap, STAR...). For more information about the mapreads module see the reference documentation. Note that with STAR or Minimap2, custom indexes can used. To create this custom indexes, you need to add a dedicated step (the starindexgenerator module for STAR and the minimap2indexgenerator module for Minimap2.

        <!-- Mapping of the reads -->
        <step skip="false" discardoutput="asap">
            <module>mapreads</module>
            <parameters>

                <!-- Use use STAR as mapper -->
                <parameter>
                    <name>mapper</name>
                    <value>star</value>
                </parameter>

                <!-- The version of STAR to use is 2.6.1b -->
                <parameter>
                    <name>mapper.version</name>
                    <value>2.6.1b</value>
                </parameter>

                <!-- We define here the additional STAR parameter (keep unmap read in SAM output) -->
                <parameter>
                    <name>mapper.arguments</name>
                    <value>--outSAMunmapped Within</value>
                </parameter>
            </parameters>
        </step>

Workflow step: filter the alignments

In this step, we will remove all SAM entries that are unmapped or multimatches. Like the filterreads module, this step is highly configurable with additional plugins. For more information about the filtersam module see the reference documentation.

        <!-- Filtering of the SAM alignments -->
        <step skip="false" discardoutput="asap">
            <module>filtersam</module>
            <parameters>

                <!-- Remove the unmap entries in the files -->
                <parameter>
                    <name>removeunmapped</name>
                    <value>true</value>
                </parameter>

                <!-- Remove the multimaches alignments -->
                <parameter>
                    <name>removemultimatches</name>
                    <value>true</value>
                </parameter>

            </parameters>
        </step>

Workflow (optional) step: convert the SAM files in BAM files

With this optional step, we will convert SAM files converted in the previous step in BAM files. The generated BAM files are sorted by coordinates. For each created BAM, an index file (BAI file) is also created. For more information about the sam2bam module see the reference documentation.

        <!-- Convert SAM to BAM file and sort the BAM file by coordinate -->
        <step skip="false">
            <module>sam2bam</module>
            <parameters>

                <!-- Compression level of the BAM file -->
                <parameter>
                    <name>compression.level</name>
                    <value>5</value>
                </parameter>

            </parameters>
        </step>

Workflow (optional) step: convert the BAM files in BedGraph

In this optional step, BAM files will be converted in BedGraph files. Note that this step require the bam-to-bedgraph.xml Galaxy tool wrapper that is available on the eoulsan-tool GitHub repository.

        <!-- Create Bedgraph files from the BAM files -->
        <step dataproduct="match" skip="false">
            <module>bam2bedgraph</module>
            <parameters/>
        </step>

Workflow (optional) step: convert the SAM files in BED files

With this optional step, positions of the alignments will be converted in BED 12 files. For more information about the splice2bed module see the reference documentation.

        <!-- Create Bed files from the SAM files -->
        <step dataproduct="match" skip="false">
            <module>splice2bed</module>
            <parameters/>
        </step>

Workflow step: compute the expression

This step allow to compute the expression of the gene/transcripts. Currently only the htseq-count counter is implemented in Eoulsan. Most of the option of this module are the standard options of HTSeq-count. For more information about this expressionmodule, see the HTSeq-count and module documentation.

        <!-- Compute expression -->
        <step skip="false">
            <module>expression</module>
            <parameters>

                <!-- We use the fast Java implementation of htseq-count to perform the counting -->
                <parameter>
                    <name>counter</name>
                    <value>htseq-count</value>
                </parameter>

                <!-- The output file file will be in TSV format -->
                <parameter>
                    <name>output.file.format</name>
                    <value>tsv</value>
                </parameter>

                <!-- The input feature annotation file is in GTF format -->
                <parameter>
                    <name>features.file.format</name>
                    <value>gtf</value>
                </parameter>

                <!-- The feature type to use is "exon", we count by exon -->
                <parameter>
                    <name>genomic.type</name>
                    <value>exon</value>
                </parameter>

                <!-- We aggregate the counts by genes -->
                <parameter>
                    <name>attribute.id</name>
                    <value>gene_id</value>
                </parameter>

                <!-- The library was oriented -->
                <parameter>
                    <name>stranded</name>
                    <value>yes</value>
                </parameter>

                <!-- The use the union mode as overlap mode -->
                <parameter>
                    <name>overlap.mode</name>
                    <value>union</value>
                </parameter>

                <!-- We do not remove ambiguous cases -->
                <parameter>
                    <name>remove.ambiguous.cases</name>
                    <value>false</value>
                </parameter>

            </parameters>
        </step>

Workflow step: convert expression files in XLSX or ODS

The expression files generated in the previous step are simple tabulated text file with only two columns: the identifiers of the feature and the counts. This files are often opened with spreadsheet software like Microsoft Excel. Unfortunately this tools can misinterpret the content of the file. That's why with this step the TSV files are converted in XLSX files (or ODS for LibreOffice). Moreover, this step add several column with useful information (e.g. gene name, gene description...) that is coming from the additional annotation file. For more information about this expressionresultsannotation module, see the module documentation.

        <!-- Convert expression files to XLSX files -->
        <step skip="false">
            <module>expressionresultsannotation</module>
            <parameters>

                <!-- Select the output of the step, ODS and TSV format are also supported -->
                <parameter>
                    <name>output.format</name>
                    <value>xlsx</value>
                </parameter>

            </parameters>
        </step>

Workflow step: MultiQC

MultiQC is tool that has been developed to aggregate all the FastQC reports of an analysis in one report. Then, it has been enhanced to support many other tools. With this step, we will generate a MultiQC report with data that come from the FastQC, mapping and expression steps. For more information about this multiqc module, see the module documentation.

        <!-- MultiQC -->
        <step skip="false">
            <module>multiqc</module>
            <parameters>

                <!-- MultiQC will contain reports from FastQC, STAR and HTSeq-count -->
                <parameter>
                    <name>reports</name>
                    <value>fastqc,mapreads,expression</value>
                </parameter>

                <!-- Docker is required to launch this step -->
                <parameter>
                    <name>use.docker</name>
                    <value>true</value>
                </parameter>

            </parameters>
        </step>

Workflow step: Normalization and differential analysis using DESeq 2

With this step, the expression data will be normalized and then the differential analysis will be performed. This module allow to use both standard and complex experimental design. The design file documentation, explain how to define the experimental design of your experiment. This module can only handle experimental design with biological replicates. If you don't have any biological replicates, please use instead the DESeq1 module. The r.execution.mode parameter allow to choose the method to use for launching R (local, RServe or Docker). We recommend you to use docker. For more information about this deseq2 module, see the module documentation.

Note: You must define the main.docker.uri parameter in your global section of your workflow file (or in your configuration file) to enable docker. Usually the value of this parameter id unix:///var/run/docker.sock.

        <!-- Normalization and differential analysis with DESeq 2 -->
        <step skip="false">
            <module>deseq2</module>
            <parameters>

                <!-- Generate normalization plots -->
                <parameter>
                    <name>norm.fig</name>
                    <value>true</value>
                    </parameter>

                <!-- Genearte differential analysis plots -->
                <parameter>
                    <name>diffana.fig</name>
                    <value>true</value>
                </parameter>

                <!-- Enable Normalization -->
                <parameter>
                    <name>norm.diffana</name>
                    <value>true</value>
                </parameter>

                <!-- Enable differential analysis -->
                <parameter>
                    <name>diffana</name>
                    <value>true</value>
                </parameter>

                <!-- Set the size factor type to "ratio" -->
                <parameter>
                    <name>size.factors.type</name>
                    <value>ratio</value>
                </parameter>

                <!-- Set the fit type to "parametric" -->
                <parameter>
                    <name>fit.type</name>
                    <value>parametric</value>
                </parameter>

                <!-- Use Wald as statistical test -->
                <parameter>
                    <name>statistical.test</name>
                    <value>Wald</value>
                </parameter>

                <!-- Use Docker to execute this step -->
                <parameter>
                    <name>r.execution.mode</name>
                    <value>docker</value>
                </parameter>

                <!-- Define the Docker image to use (optional) -->
                <parameter>
                    <name>docker.image</name>
                    <value>bioconductor/release_sequencing:3.1</value>
                </parameter>

            </parameters>
        </step>

Workflow step: convert files by DESeq2 in XLSX or ODS formats

The text files generated by the previous step are simple tabulated text files. This files are often opened with spreadsheet software like Microsoft Excel. Unfortunately this tools can misinterpret the content of the text file. That's why with this step the TSV files are converted in XLSX files (or ODS for LibreOffice). Moreover, this step add several column with useful information (e.g. gene name, gene description...) that is coming from the additional annotation file. For more information about this diffanaresultsannotation module, see the module documentation.

        <!-- Convert ourput of the differential analysis step to XLSX files -->
        <step skip="false">
            <module>diffanaresultsannotation</module>
            <parameters>

                <!-- Select the output of the step, ODS and TSV format are also supported -->
                <parameter>
                    <name>output.format</name>
                    <value>xlsx</value>
                </parameter>

            </parameters>
        </step>

Setting global configuration

This section of the workflow file define the global configuration of the analysis. For this analysis, path to additional data format, Galaxy tools and Docker connection URI are set.

Note: You can also define this global parameters in a configuration file like the default Eoulsan configuration file ~/.eoulsan, see the configuration file documentation for more information.

    <!-- This section contains globals parameters that can also be define in a configuration file. -->
    <globals>

        <!-- Define the location of the Galaxy tools/modules to use -->
        <parameter>
            <name>main.galaxy.tool.path</name>
            <value>https://raw.githubusercontent.com/GenomicParisCentre/eoulsan-tools/master/galaxytools</value>
        </parameter>

        <!-- Define the location of additional formats to use -->
        <parameter>
            <name>main.format.path</name>
            <value>https://raw.githubusercontent.com/GenomicParisCentre/eoulsan-tools/master/formats</value>
        </parameter>

        <!-- Define the location of the Docker connection -->
        <parameter>
            <name>main.docker.uri</name>
            <value>unix:///var/run/docker.sock</value>
        </parameter>

    </globals>

Step 4: launch the analysis

Once your design file and workflow file are ready, you can launch Eoulsan analysis with the following command:

$ eoulsan.sh exec workflow-local.xml design.txt

Warning: To perform the normalization and differential analysis steps of this workflow, this demo requires Docker. If you want to run this demo without Docker, you must must install R (or a RServe server) and the related packages (See differential analysis module documentation for more information) and change the execution.mode parameter of the normalization and diffana steps.

Once started and before starting analysis, Eoulsan will check if:

  • The design file and workflow file are valid
  • All the modules related to the steps exist
  • The workflow of all the steps are valid
  • The order of the steps is correct (a step can not use data generated after its end)
  • The input files are valid (fasta and annotation files)
  • A genome mapper index is needed. In this case, a step to generate this index is automatically added

If successful, you will obtain a new directory name like eoulsan-20190910-101235 with log files about the analysis. Results files are stored in the current directory of the user.

Clone this wiki locally