Skip to content
Jaze8 edited this page Jun 15, 2018 · 3 revisions

This page browses some terms and concepts that you may come across when using Eoulsan.

Step core concepts

Module

A module is a class that contains the code to execute in a step. Several steps can use the same module.

Step

A step contains an instance of a module, have a name, some parameters and is linked to other steps.

Step parameters

The steps parameter are the parameter that are used to configure a Step. The step parameters are transmitted to an instance of a module using Module.configure() method.

Step ports

Data are transmitted between steps using ports. There is two port types : input and output port. Steps are usually linked using input ans output ports.

Input and output ports are not mandatory for a step (e.g. the shell module has no port)

Workflow

A workflow contains all the steps to execute.

Task

A task is the execution unit of Eoulsan. Usually a task is the process of a sample. As an example, if the filterread step has 4 FASTQ files to process, 4 tasks will be created.

A task is executed in an instance of a module using the Module.execute() method.

Data concept

Data

A Data is an object that is transmitted between 2 steps (between an input port and an output port).

A Data contains : a file, a format and metadata.

For some Data formats (e.g. FASTQ format), data are divided in several files (e.g. there 2 FASTQ files in paired-end).

DataMetadata

The metadata of a Data object contains information about the Data (e.g. FASTQ format for a FASTQ file, design information about the sample related to the data)

DataFile

There are many classes that handle files in the Java ecosystem : File (from Java 1.0), Path (from Hadoop) and Path (from Java 7).

As the File class can only deal with files on local file system and since the Java Path class is not used in Hadoop, an abstraction layer upon these classes was necessary to easily handle all the file systems.

In addition to supporting many protocols (file://, ftp://, http://, https://, hdfs://) and Eoulsan dedicated protocols (gff://, gtf://, genome://, additionalannotation://), the DataFile comes with many useful features:

  • The open() method allows to create an uncompressed InputStream if the linked filename contains a compression extension. To get the compressed stream use the rawOpen() method
  • The create() method allows to create a compressed OutputStream if the linked filename contains a compression extension. To get the compressed stream use the rawCreate() method

Protocol

A Protocol in Eoulsan is a class that defines how to read and write files on file systems. The current supported protocols are:

  • file://
  • http://
  • https://
  • hdfs://
  • s3://
  • ftp:// (in read only)

In addition Eoulsan contains some storage protocols that avoids data duplication by using a central repository for:

  • genomes: genome://
  • gff annotations: gff://
  • gtf annotations: gtf://
  • additional annotations: additionalannotations://

Protocols can handle or not the following features:

  • read data
  • write data
  • create symbolic link
  • list directory
  • create directory
  • delete file or directory
  • rename file or directory

Other concepts

Requirement

A requirement is feature that is necessary for a module to execute a task. A requirement can, for example, be a Docker image, an executable in the PATH or a functional RServe server. A requirement can be optional.

A module informs the workflow of its requirements by overriding the Module.getRequirements() method.

Parallelization mode

The parallelization mode is the method to use by the workflow engine to schedule the execution module tasks. There are 3 parallelization modes:

  • standard: the maximum running task at one time is equal to the number of processors of the system.
  • own parallelization: in this mode, the workflow will wait that all running tasks end before executing a task. Only one task will be executed at one time. Usually this method is used for read mappers (e.g. Bowtie) because these tools are multi-threaded and can use all the available cores of the system.
  • no parallelization: this internal mode allow to immediately execute a task. This mode can be used for very short tasks (few seconds).

Step execution process

  1. First, a Step object is created with the values of the workflow file
  2. An instance of the requested Module is created for each step
  3. The Module.configure() method is called for each step
  4. the workflow gets the list of requirements of the workflow by calling the Module.getRequirement() of all the step modules
  5. For each step, the workflow reads the value of Module.getInputPorts(), Module.getOutputPorts() and creates links between steps
  6. The workflow starts and will create tasks for each steps and then will execute the Module.execute() for each task

StepConfigurationContext

The StepConfigurationContext interface allows access to the step configuration inside a Module.configure() method.

TaskContext

The TaskContext contains all the information of a StepConfigurationContext object. It contains all the required method to get the input and output data of a Task inside a Module.execute() method.

TaskStatus

The TaskStatus is an interface used to inform the workflow about the progress of a task inside a Module.execute() method. The TaskStatus also contains the methods to create TaskResult objects.

TaskResult

The TaskStatus object contains the results of a task. These objects are created using methods from the TaskStatus interface. A TaskStatus can be successful or not, in this last case it can contain an exception and its stacktrace.

Special modules

Eoulsan is coming with modules dedicated to special tasks.

Generators

A generator is a module that can automatically generate some data if other data exist in the workflow. As an example if there is a genome in FASTA format in your workflow and if there is a mapping step, a generator module will be added to the workflow to create the genome index for the mapper.

In Eoulsan 2.0-beta1, Eoulsan is bundled with the following generators:

  • GenomeDescriptionGeneratorModule: Generate a genome description file from a genome in FASTA format
  • GenomeMapperIndexGeneratorModule: Generate a mapper index from a genome in FASTA format
  • GFFFastaGeneratorModule: Generate a genome in FASTA form a GFF3 file if its contains the genome sequence

Checkers

A checker is a class that allow to check if a file in a data format is valid. The checkers are executed by the checker step at the beginning of a workflow.

In Eoulsan 2.0-beta1, Eoulsan is bundled with the following checker:

  • GenomeChecker: Check if the genome sequence is valid
  • GFFChecker: Check if the GFF3 annotation file is valid
  • ReadsChecker: Check if reads files in FASTQ are valid

Splitter and mergers

Galaxy tool