Skip to content
LehmannN edited this page Jun 15, 2018 · 2 revisions

Welcome to Eoulsan new users instructions!

First, if Eoulsan is not installed, please follow the instructions from the Eoulsan reference website. When installed, check that you have rights to write and execute Eoulsan script (check with ls -l).

Before you start

You need 2 input files to launch Eoulsan :

  • Workflow file: a XML file which defines all the steps (and their parameters) of your analysis. More info here.

  • Design file: text file which contains informations on the experiments and on the samples to be analyzed. There are 2 versions of the workflow file. In scRNA-seq, for example, we only use the version 2 of the design file. More info here. This file is divided into three sections:

    • [Header] section contains (at least) the paths to:
      • a reference genome (FASTA file)
      • an annotation file GTF / GFF
    • [Experiments] section: information for each experiment.
    • [Columns] section: information (ID, name, path, etc.) for each sample. A line might correspond to different things:
      • for Smart-seq2: each line is a cell
      • for 10x Genomics or bulk RNA-seq: each line is a dataset (e.g. 1000 cells)

The design file can be build automatically with the eoulsan createdesign command.

E.g:

# Define paths
reads='*.fq.bz2'
ref='genome.fasta.bz2'
annotation='annotation.gff.bz2'

# Command to build the design file
eoulsan.sh createdesign $reads $ref $annotation

There is a third file you may need to launch your analysis: a configuration file. It is not compuslory as all of these information can be sepcified in the globals section of the workflow file. However it may be easier for you with the configuration file, as the information is more concise.

E.g:

$ head configuration.txt
# Paths
main.tmp.dir=/tmp
main.gff.storage.path=/path/to/genome_annotations
main.genome.desc.storage.path=/path/to/genome_descriptions
main.genome.mapper.index.storage.path=/path/to/genome_indexes
main.genome.storage.path=/path/to/genomes
main.docker.uri=unix:///var/run/docker.sock
main.docker.mount.nfs.roots=true
# Adapt this parameter wisely according to your needs
main.local.threads=4

Modules

We call module a tool that can be used in Eoulsan to perform a specific step (i.e. there is a module to control the quality of the reads, a module for the alignment of the reads to a reference genome...). Each module has its own options and parameters that you can change as you wish in the <fc #009000>Workflow file (i.e. for the alignment step you can choose between different aligners like bowtie, STAR, BWA...). You can keep as many workflow files as you want, so it's an easy way to perform series of analysis with different parameters and keep the exact track of each analysis.

More info on the available modules:

==== Example (unitary test) ==== Running just one module ("unitary test") is a routine while programming new modules, but is also a good way to make some first trials with Eoulsan.

$ head -n 50 run.sh
#!/bin/sh

# INPUT 
annotation=ensembl_Homo_sapiens.GRCh38.84.gtf
genomeRef=hg38.fasta.bz2
data=T1_test.fastq
configuration=configuration.txt
workflow=workflow-test.xml

# CREATE DESIGN FILE
# If this does not work, check that none of the links are symbolic links (readlink -f [path])
eoulsan.sh createdesign $data $genomeRef $annotation

# RUN (you may start a 'screen' before)
#screen
eoulsan.sh -conf $configuration exec $workflow design.txt

This is an //all-in-one// demo, built so that you can test easily how Eoulsan works. This means that all the files you need are already stored in a folder called ''Eoulsan_demo''. If you want to run your own analysis, please make sure to have:

  • your configuration file
  • your workflow file
  • path to your data
  • path to annotation and genome of reference (though it might not differ from the demo if you're working on human data)

== Input data == For this demo, we took data from human pDC (plasmacytoid dendritic cells). Altogether, there are 7 datasets of ~1000 cells each, each corresponding to a different time-point after activation with a virus. More info on the data [HERE].

The ''T1_test.fastq'' dataset used here is a small fraction of one dataset (time-point 1 or ''T1.fastq'') which originally weights 16Go. The ''T1_test.fastq'' is just 2.5Mo but it does not have a true biological meaning (only for the purpose of this demo).

Create design file

This command will create a file called ''design.txt''. The design file obtained should be similar to this: <file: design.txt> [Header] DesignFormatVersion=2 GenomeFile=hg38.fasta.bz2 GtfFile=ensembl_Homo_sapiens.GRCh38.84.gtf

[Experiments] Exp.exp1.name=Experiment1

[Columns] SampleId SampleName Reads Date FastqFormat RepTechGroup Exp.exp1.Condition Exp.exp1.Reference UUID T1test T1_test T1_test.fastq 2017-11-14 fastq-sanger T1_test T1_test false a2cbcf6a-bdc3-466f-8381-30f802250177

If you run the ''createdesign'' command multiple times, you will get an error message such as the following one. However, Eoulsan will keep running normally. === Eoulsan Error === File not found: Output design file design.txt already exists

== Complete workflow file ==

Here is a minimalist version of a workflow file. If you want to try perform some other analysis and bring modifications to the workflow file, you have to keep in mind that:

<file: workflow-test.xml> 1.0 Drop-seq demo Demo of Eoulsan for Drop-seq data (single-cell RNA seq) Lehmann

    <steps>
            <step id="step1whitelist" skip="false">
                            <module>umi_whitelist</module>
                            <parameters></parameters>
            </step>
    </steps>

Don't know what are the parameters that you can add ? It depends on the type of modules you want to use:

== Run == The output should look like this: Eoulsan version 2.0-beta6-SNAPSHOT (heads/master-0-g7b7bfe1, build1 build on csbpc12, 2017-10-20 17:33:31 CEST)

  • Step checker DONE
  • Step step1whitelist DONE
  • Workflow DONE (Job done in 00:00:36.018 s.)

The ''checker'' step aims at scanning the files to detect if there is no strong failure (syntax, parsing...). It will check the ''fastq'' files (sequencing reads), the reference genome and the annotation file (gff/gtf).

== Output == When Eoulsan finishes running, you will find in your output directory:

  • One output folder for each step (except for the ''checker'' step), named like: ''stepID_output''
  • A main folder where the logs and errors messages are kept, named like: ''eoulsan-20180201-144348'' (''eoulsan-YearMonthDay-time'')
  • A folder called ''eoulsan-latest'' which is a symbolic link to the latest run

==== Example (import data from previous steps) ==== In this example, you will find how to perform an analysis with input data other than raw sequencing data (''fastq'' files), i.e. from alignment data (''bam'' or ''sam'' files).

== Input files == The configuration and design files are unchanged, you can just keep the same as provided in the above example. Only the workflow file changed:

<file: workflow-test2.xml> 1.0 Drop-seq demo import Demo of Eoulsan for Drop-seq data (single-cell RNA seq): import data from previous step Lehmann

    <steps>
	<step id="step1importsam" skip="false">
		<module>import</module>
		<parameters>
			<parameter>
				<name>files</name>
				<value>*.sam</value>
			</parameter>
		</parameters>
	</step>
	<step id="step2featurecounts" skip="false">
		<module>featurecounts</module>
		<parameters></parameters>
	</step>
	<step id="step3umicount" skip="false">
        <module>umi_count</module>
        <parameters></parameters>
    </step>
    </steps>

== Run == You run should look like this:

Eoulsan version 2.0-beta6-SNAPSHOT (heads/master-0-g7b7bfe1, build1 build on csbpc12, 2017-10-20 17:33:31 CEST) * Step checker DONE * Step step1importsam DONE * Step step2featurecounts DONE * Step step3umicount DONE * Workflow DONE (Job done in 00:28:54.167 s.)

In the output, you should obtain as expected three output folders (one for each step).

==== Example (full analysis) ==== In this last example, you will find how to perform a full analysis of scRNA-seq data, from the quality control steps (''fastq'' files) to the obtention of an expression matrix (where rows are genes and columns are cells).

== Design file == The design file has changed compared to the above examples, because in this case we have 2 datasets: each line corresponds to a dataset (from ''T1'' to ''T2''). This design file has been created with Eoulsan ''createdesign'' command.

<file: design.txt> [Header] DesignFormatVersion=2 GenomeFile=fasta/hg38.fasta.bz2 GtfFile=annotation/ensembl_Homo_sapiens.GRCh38.84.gtf

[Experiments] Exp.exp1.name=Experiment1

[Columns] SampleId SampleName Reads Date FastqFormat RepTechGroup Exp.exp1.Condition Exp.exp1.Reference UUID T1R2 T1_R2 T1_R2.fastq.gz 2017-11-08 fastq-sanger T1_R2 T1_R2 false 7f5eef1a-2999-492c-8d9e-2e07b6bd3adf T2R2 T2_R2 T2_R2.fastq.gz 2017-11-03 fastq-sanger T2_R2 T2_R2 false 90b772d7-c4ea-4f5d-92e4-50b88e62ad77

== Workflow file ==

<file: workflow-test3.txt> 1.0 Drop-seq demo Demo of Eoulsan for Drop-seq data (single-cell RNA seq) Lehmann

    <steps>                          
         	<step id="step1whitelist" skip="false">
			<module>umi_whitelist</module>
			<parameters></parameters>
	</step>
	<step id="step2extract" dataproduct="match"  skip="false">
                            <module>umi_extract</module>
                            <parameters></parameters>
            </step>			
    	<step id="step3filterreads" skip="false">
            		<name>filterreads</name>
            		<parameters>
                    		<parameter>
                            		<name>trim.length.threshold</name>
                            		<value>11</value>
                    		</parameter>
                    		<parameter>
                            		<name>quality.threshold</name>
                            		<value>12</value>
                    		</parameter>
           			</parameters>
    	</step>
    	<step id="step4mapreads" skip="false">
            	<module>mapreads</module>
            	<parameters>
                    	<parameter>
                            	<name>mapper</name>
                            	<value>star</value>
                    	</parameter>
			<parameter>
                                    <name>mapper.arguments</name>
                                    <value></value>
                            </parameter>
            	</parameters>
    	</step>
    	<step id="step5filtersam" skip="false">
            	<module>filtersam</module>
            	<parameters>
                    	<parameter>
                            	<name>removeunmapped</name>
                            	<value>true</value>
                    	</parameter>
                    	<parameter>
                       		<name>removemultimatches</name>
                            	<value>true</value>
                    	</parameter>
            	</parameters>
    	</step>
	<step id="step6featurecounts" skip="false">
		<module>featurecounts</module>
		<parameters></parameters>
	</step>
	<step id="step7umicount" skip="false">
                    <module>umi_count</module>
                    <parameters></parameters>
            </step>
</steps>

== Run == Eoulsan version 2.0-beta6-SNAPSHOT (heads/master-0-g7b7bfe1, build1 build on csbpc12, 2017-10-20 17:33:31 CEST)

  • Step checker DONE
  • Step genomedescgenerator DONE
  • Step genericindexgenerator DONE
  • Step step1whitelist DONE
  • Step step2extract DONE
  • Step step3filterreads DONE
  • Step step4mapreads DONE
  • Step step5filtersam DONE
  • Step step7umicount DONE
  • Step step6featurecounts DONE
  • Workflow DONE (Job done in 09:11:25.786 s.)

==== Build a module ==== If you're interested in Eoulsan development, please refer to the dedicated page [LINK].