-
Notifications
You must be signed in to change notification settings - Fork 10
Introduction
Welcome to Eoulsan new users instructions!
First, if Eoulsan is not installed, please follow the instructions from the Eoulsan reference website. When installed, check that you have rights to write and execute Eoulsan script (check with ls -l
).
You need 2 input files to launch Eoulsan :
-
Workflow file: a XML file which defines all the steps (and their parameters) of your analysis. More info here.
-
Design file: text file which contains informations on the experiments and on the samples to be analyzed. There are 2 versions of the workflow file. In scRNA-seq, for example, we only use the version 2 of the design file. More info here. This file is divided into three sections:
-
[Header] section contains (at least) the paths to:
- a reference genome (FASTA file)
- an annotation file GTF / GFF
- [Experiments] section: information for each experiment.
-
[Columns] section: information (ID, name, path, etc.) for each sample. A line might correspond to different things:
- for Smart-seq2: each line is a cell
- for 10x Genomics or bulk RNA-seq: each line is a dataset (e.g. 1000 cells)
-
[Header] section contains (at least) the paths to:
The design file can be build automatically with the eoulsan createdesign
command.
E.g:
# Define paths
reads='*.fq.bz2'
ref='genome.fasta.bz2'
annotation='annotation.gff.bz2'
# Command to build the design file
eoulsan.sh createdesign $reads $ref $annotation
There is a third file you may need to launch your analysis: a configuration file. It is not compuslory as all of these information can be sepcified in the globals
section of the workflow file. However it may be easier for you with the configuration file, as the information is more concise.
E.g:
$ head configuration.txt
# Paths
main.tmp.dir=/tmp
main.gff.storage.path=/path/to/genome_annotations
main.genome.desc.storage.path=/path/to/genome_descriptions
main.genome.mapper.index.storage.path=/path/to/genome_indexes
main.genome.storage.path=/path/to/genomes
main.docker.uri=unix:///var/run/docker.sock
main.docker.mount.nfs.roots=true
# Adapt this parameter wisely according to your needs
main.local.threads=4
We call module a tool that can be used in Eoulsan to perform a specific step (i.e. there is a module to control the quality of the reads, a module for the alignment of the reads to a reference genome...). Each module has its own options and parameters that you can change as you wish in the <fc #009000>Workflow file (i.e. for the alignment step you can choose between different aligners like bowtie, STAR, BWA...). You can keep as many workflow files as you want, so it's an easy way to perform series of analysis with different parameters and keep the exact track of each analysis.
More info on the available modules:
- Modules integrated in core Eoulsan (bulk RNA-seq): http://www.outils.genomique.biologie.ens.fr/eoulsan/modules.html
- Modules in development (scRNA-seq): https://github.com/ComputationalSystemsBiology/Single-cell-RNA-seq/wiki
==== Example (unitary test) ==== Running just one module ("unitary test") is a routine while programming new modules, but is also a good way to make some first trials with Eoulsan.
$ head -n 50 run.sh
#!/bin/sh
# INPUT
annotation=ensembl_Homo_sapiens.GRCh38.84.gtf
genomeRef=hg38.fasta.bz2
data=T1_test.fastq
configuration=configuration.txt
workflow=workflow-test.xml
# CREATE DESIGN FILE
# If this does not work, check that none of the links are symbolic links (readlink -f [path])
eoulsan.sh createdesign $data $genomeRef $annotation
# RUN (you may start a 'screen' before)
#screen
eoulsan.sh -conf $configuration exec $workflow design.txt
This is an //all-in-one// demo, built so that you can test easily how Eoulsan works. This means that all the files you need are already stored in a folder called ''Eoulsan_demo''. If you want to run your own analysis, please make sure to have:
- your configuration file
- your workflow file
- path to your data
- path to annotation and genome of reference (though it might not differ from the demo if you're working on human data)
== Input data == For this demo, we took data from human pDC (plasmacytoid dendritic cells). Altogether, there are 7 datasets of ~1000 cells each, each corresponding to a different time-point after activation with a virus. More info on the data [HERE].
The ''T1_test.fastq'' dataset used here is a small fraction of one dataset (time-point 1 or ''T1.fastq'') which originally weights 16Go. The ''T1_test.fastq'' is just 2.5Mo but it does not have a true biological meaning (only for the purpose of this demo).
This command will create a file called ''design.txt''. The design file obtained should be similar to this: <file: design.txt> [Header] DesignFormatVersion=2 GenomeFile=hg38.fasta.bz2 GtfFile=ensembl_Homo_sapiens.GRCh38.84.gtf
[Experiments] Exp.exp1.name=Experiment1
[Columns] SampleId SampleName Reads Date FastqFormat RepTechGroup Exp.exp1.Condition Exp.exp1.Reference UUID T1test T1_test T1_test.fastq 2017-11-14 fastq-sanger T1_test T1_test false a2cbcf6a-bdc3-466f-8381-30f802250177
If you run the ''createdesign'' command multiple times, you will get an error message such as the following one. However, Eoulsan will keep running normally.
=== Eoulsan Error ===
File not found: Output design file design.txt already exists
== Complete workflow file ==
Here is a minimalist version of a workflow file. If you want to try perform some other analysis and bring modifications to the workflow file, you have to keep in mind that:
- All the tags must be in lower case: '''' and not ''''
- Step ID can be changed as you wish: '''' (this will name your output folder ''hello_output''[LINK])
- Module ID cannot be change as you wish: ''umi_whitelist'' or ''mapreads'' for the http://www.outils.genomique.biologie.ens.fr/eoulsan/module-mapreads.html, see the //Internal name// part.
<file: workflow-test.xml> 1.0 Drop-seq demo Demo of Eoulsan for Drop-seq data (single-cell RNA seq) Lehmann
<steps>
<step id="step1whitelist" skip="false">
<module>umi_whitelist</module>
<parameters></parameters>
</step>
</steps>
Don't know what are the parameters that you can add ? It depends on the type of modules you want to use:
- Core Eoulsan modules: check the dedicated page http://www.outils.genomique.biologie.ens.fr/eoulsan/modules.html. For example, the alignment module http://www.outils.genomique.biologie.ens.fr/eoulsan/module-mapreads.html have different parameters available, like the type of aligner to use, the version, the arguments to use, etc.
- Development Eoulsan modules (specific to scRNA-seq): check on the https://github.com/ComputationalSystemsBiology/Single-cell-RNA-seq/wiki/Module:-Umi-whitelist (example of the module //umi_whitelist//, dedicated to identify a list of "true" cells among the data - ~0.1% of the data). If you need some options or parameters that are not available, or there is no wiki page of this module yet, you will probably want to have a look at the module source code: see the Eoulsan development FAQ [LINK].
== Run ==
The output should look like this:
Eoulsan version 2.0-beta6-SNAPSHOT (heads/master-0-g7b7bfe1, build1 build on csbpc12, 2017-10-20 17:33:31 CEST)
- Step checker DONE
- Step step1whitelist DONE
- Workflow DONE (Job done in 00:00:36.018 s.)
The ''checker'' step aims at scanning the files to detect if there is no strong failure (syntax, parsing...). It will check the ''fastq'' files (sequencing reads), the reference genome and the annotation file (gff/gtf).
== Output == When Eoulsan finishes running, you will find in your output directory:
- One output folder for each step (except for the ''checker'' step), named like: ''stepID_output''
- A main folder where the logs and errors messages are kept, named like: ''eoulsan-20180201-144348'' (''eoulsan-YearMonthDay-time'')
- A folder called ''eoulsan-latest'' which is a symbolic link to the latest run
==== Example (import data from previous steps) ==== In this example, you will find how to perform an analysis with input data other than raw sequencing data (''fastq'' files), i.e. from alignment data (''bam'' or ''sam'' files).
== Input files == The configuration and design files are unchanged, you can just keep the same as provided in the above example. Only the workflow file changed:
<file: workflow-test2.xml> 1.0 Drop-seq demo import Demo of Eoulsan for Drop-seq data (single-cell RNA seq): import data from previous step Lehmann
<steps>
<step id="step1importsam" skip="false">
<module>import</module>
<parameters>
<parameter>
<name>files</name>
<value>*.sam</value>
</parameter>
</parameters>
</step>
<step id="step2featurecounts" skip="false">
<module>featurecounts</module>
<parameters></parameters>
</step>
<step id="step3umicount" skip="false">
<module>umi_count</module>
<parameters></parameters>
</step>
</steps>
== Run == You run should look like this:
Eoulsan version 2.0-beta6-SNAPSHOT (heads/master-0-g7b7bfe1, build1 build on csbpc12, 2017-10-20 17:33:31 CEST)
* Step checker DONE
* Step step1importsam DONE
* Step step2featurecounts DONE
* Step step3umicount DONE
* Workflow DONE (Job done in 00:28:54.167 s.)
In the output, you should obtain as expected three output folders (one for each step).
==== Example (full analysis) ==== In this last example, you will find how to perform a full analysis of scRNA-seq data, from the quality control steps (''fastq'' files) to the obtention of an expression matrix (where rows are genes and columns are cells).
== Design file == The design file has changed compared to the above examples, because in this case we have 2 datasets: each line corresponds to a dataset (from ''T1'' to ''T2''). This design file has been created with Eoulsan ''createdesign'' command.
<file: design.txt> [Header] DesignFormatVersion=2 GenomeFile=fasta/hg38.fasta.bz2 GtfFile=annotation/ensembl_Homo_sapiens.GRCh38.84.gtf
[Experiments] Exp.exp1.name=Experiment1
[Columns] SampleId SampleName Reads Date FastqFormat RepTechGroup Exp.exp1.Condition Exp.exp1.Reference UUID T1R2 T1_R2 T1_R2.fastq.gz 2017-11-08 fastq-sanger T1_R2 T1_R2 false 7f5eef1a-2999-492c-8d9e-2e07b6bd3adf T2R2 T2_R2 T2_R2.fastq.gz 2017-11-03 fastq-sanger T2_R2 T2_R2 false 90b772d7-c4ea-4f5d-92e4-50b88e62ad77
== Workflow file ==
<file: workflow-test3.txt> 1.0 Drop-seq demo Demo of Eoulsan for Drop-seq data (single-cell RNA seq) Lehmann
<steps>
<step id="step1whitelist" skip="false">
<module>umi_whitelist</module>
<parameters></parameters>
</step>
<step id="step2extract" dataproduct="match" skip="false">
<module>umi_extract</module>
<parameters></parameters>
</step>
<step id="step3filterreads" skip="false">
<name>filterreads</name>
<parameters>
<parameter>
<name>trim.length.threshold</name>
<value>11</value>
</parameter>
<parameter>
<name>quality.threshold</name>
<value>12</value>
</parameter>
</parameters>
</step>
<step id="step4mapreads" skip="false">
<module>mapreads</module>
<parameters>
<parameter>
<name>mapper</name>
<value>star</value>
</parameter>
<parameter>
<name>mapper.arguments</name>
<value></value>
</parameter>
</parameters>
</step>
<step id="step5filtersam" skip="false">
<module>filtersam</module>
<parameters>
<parameter>
<name>removeunmapped</name>
<value>true</value>
</parameter>
<parameter>
<name>removemultimatches</name>
<value>true</value>
</parameter>
</parameters>
</step>
<step id="step6featurecounts" skip="false">
<module>featurecounts</module>
<parameters></parameters>
</step>
<step id="step7umicount" skip="false">
<module>umi_count</module>
<parameters></parameters>
</step>
</steps>
== Run ==
Eoulsan version 2.0-beta6-SNAPSHOT (heads/master-0-g7b7bfe1, build1 build on csbpc12, 2017-10-20 17:33:31 CEST)
- Step checker DONE
- Step genomedescgenerator DONE
- Step genericindexgenerator DONE
- Step step1whitelist DONE
- Step step2extract DONE
- Step step3filterreads DONE
- Step step4mapreads DONE
- Step step5filtersam DONE
- Step step7umicount DONE
- Step step6featurecounts DONE
- Workflow DONE (Job done in 09:11:25.786 s.)
==== Build a module ==== If you're interested in Eoulsan development, please refer to the dedicated page [LINK].