Skip to content
LehmannN edited this page Jun 12, 2018 · 3 revisions

This page browses some concepts that you may come across when using Eoulsan.

Step core concepts

Module

A module a class that contains the code to execute in a step. Several Step can use the same module.

Step

A step contains an instance of a module, have a name, some parameters and is linked to other steps.

Step parameters

The steps parameter are the parameter that are used to configure a Step. The step parameters are transmitted to an instance of a module using Module.configure() method.

Step ports

Data are transmitted between steps using ports. There is two port types : input and output port. Steps are usually linked using input ans output ports.

Input and output ports are not mandatory for a step (e.g. shell module has no port)

Workflow

A workflow contains all the steps to execute.

Task

A task is the execution unit of Eoulsan. Usually a task is the process of sample. As an example, if your filterread step has 4 FASTQ file to process, 4 tasks will be created.

A task is executed in an instance of a module using the Module.execute() method.

Data concept

Data

A Data is an object that is transmitted between 2 steps (between an input port and an output port).

A Data contains : a file, a format and metadata.

For some Data formats (e.g. FASTQ format), data are divided in several files (e.g. there 2 FASTQ files in paired-end).

DataMetadata

The metadata of a Data object contains information about the Data (e.g. FASTQ format for a FASTQ file, design information about the sample related to the data)

DataFile

There are many classes that handle file in the Java ecosystem : File (from Java 1.0), Path (from Hadoop) and Path (from Java 7).

As the File class can only deals with files on local file system and as the Java Path class is not used in Hadoop, an abstraction layer upon this classes was necessary to easily handles all the file systems.

In addition to support many protocols (file://, ftp://, http://, https://, hdfs://) and Eoulsan dedicated protocols (gff://, gtf://, genome://, additionalannotation://), the DataFile come with many useful features:

  • The open() method allow to create an uncompressed InputStream if the filename is related contains a compression extension. To get the compressed stream use the rawOpen() method
  • The create() method allow to create an compressed OutputStream if the filename is related contains a compression extension. To get the compressed stream use the rawCreate() method

Protocol

A Protocol in Eoulsan is a class that define how to read and write file on file system. The current supported protocols are:

  • file://
  • http://
  • https://
  • hdfs://
  • s3://
  • ftp:// (in read only)

In addition Eoulsan contains some storage protocols that allow to avoid data duplication using a central repository for:

  • genomes: genome://
  • gff annotations: gff://
  • gtf annotations: gtf://
  • additional annotations: additionalannotations://

Protocols can handle or not the following features:

  • read data
  • write data
  • create symbolic link
  • list directory
  • create directory
  • delete file or directory
  • rename file or directory

Other concepts

Requirement

A requirement is feature that is necessary for a module to execute a task. A requirement can be as an example a Docker image, an executable in the PATH or a functional RServe serve. A requirement can be optional.

A module informs the worflow of its requirements by overriding the Module.getRequirements() method.

Parallelization mode

The parallelization mode is the method to use by the workflow engine to schedule the execution module tasks. There 3 parallelization modes:

  • standard: the maximum running task at one time is equals to the number of processor of the system.
  • own parallelization: in this mode, the workflow will wait that all running tasks ends before execute a task. Only one task will be executed at one time. Ususaly this method is used for reads mapper (e.g. Bowtie) because this tools are multi-threaded and can use all the available cores of the system.
  • no parallelization: this internal mode allow to immediately execute a task. This mode can be use for very short tasks (few seconds).

Step execution process

  1. First Step object is created with the values of the workflow file
  2. An instance of requested Module is create for each step
  3. The Module.configure() method is called for each step
  4. the workflow get the list of the requirement of the workflow by calling the Module.getRequirement() of all the step modules
  5. For each step, the workflow read the value of Module.getInputPorts(), Module.getOutputPorts() and create links between steps
  6. The workflow starts and will create tasks for each steps and then will execute the Module.execute() for each task

StepConfigurationContext

The StepConfigurationContext interface allow to access to the step configuration inside a Module.configure() method.

TaskContext

The TaskContext contains all the information of a StepConfigurationContext object. It contains all the required method to get the input and output data of a Task inside a Module.execute() method.

TaskStatus

The TaskStatus is an interface used to inform the workflow about the progress of a task inside a Module.execute() method. The TaskStatus also contains the methods to create TaskResult objects.

TaskResult

The TaskStatus object contains the results of a task. This objects are creating using methods from the TaskStatus interface. A TaskStatus can be successful or not, in this last case it can contains an exception and its stacktrace.

Special modules

Eoulsan is coming with modules are dedicated to special tasks.

Generators

A generator is a module that can automatically generate some data if another data exist in the workflow. As an example if there is a genome in FASTA format in your workflow and if there is a mapping step, a generator module will be added to the workflow to create the genome index for the mapper.

In Eoulsan 2.0-beta1, Eoulsan is bundled with the following generators:

  • GenomeDescriptionGeneratorModule: Generate a genome description file from a genome in FASTA format
  • GenomeMapperIndexGeneratorModule: Generate a mapper index from a genome in FASTA format
  • GFFFastaGeneratorModule: Generate a genome in FASTA form a GFF3 file if its contains the genome sequence

Checkers

A checker is a class that allow to check if a file in a data format is valid. The checkers are executed by the checker step at the beginning of a workflow.

In Eoulsan 2.0-beta1, Eoulsan is bundled with the following checker:

  • GenomeChecker: Check if the genome sequence is valid
  • GFFChecker: Check if the GFF3 annotation file is valid
  • ReadsChecker: Check if reads files in FASTQ are valid

Splitter and mergers

Galaxy tool