Skip to content

Writing Data Format

Laurent Jourdren edited this page Mar 26, 2015 · 2 revisions

How writing a DataFormat

WARNING: This documentation is outdated and will soon be updated.

Introduction

DataFormat is an interface that allow define data to process. A DataFormat is always related to a DataType. As an example, a reads mapper can generate as output a SAM or BAM file. The content of the file is exactly the same but the format is different. In Eoulsan "language", the two output have the same DataType but not have the same DataFormat.

Thanks to the difference between DataType and DataFormat, Eoulsan Step can handle as input several DataFormat that had the same DataType.

Note: Actually DataType and DataFormat are implemented using hand written java classes but in a future version of Eoulsan, this objects will be defined using XML files.

The DataType object

A DataType object contains the following information:

  • A name
  • A description
  • A file prefix
  • A flag to indicate if the data is related to the whole analysis (e.g. genome sequence file) or if there is a file of the DataType for each sample (e.g. a file that contains reads of a sample)
  • A flag to indicate if the DataType can be provided by the design file
  • The name of the column in the design file that provide the DataType

The following code define a DataType for reads.

/** Reads datatype. **/
  public static final DataType READS = new AbstractDataType() {

    @Override
    public String getName() {

      return "reads";
    }

    @Override
    public String getPrefix() {

      return "reads_";
    }

    @Override
    public String getDesignFieldName() {

      return SampleMetadata.READS_FIELD;
    }

    @Override
    public boolean isDataTypeFromDesignFile() {

      return true;
    }

  };

The AbstractDataType provide default values for some field and implementation of the equals(), hashcode() and toString() methods.

Warning: There must be only one instance of a same DataType in the JVM, otherwise it could have some mistakes with the Eoulsan workflow. That's why in the previous sample code, the instance is create using an anonymous class. Using XML files to describe Datatypes in a future version of Eoulsan will solve this issue.

The DataFormat object

The DataFormat contains the following information:

  • A name
  • A DataType
  • A content type (a MIME value, optional)
  • A default extension
  • A list of the known extension for the format
  • A flag to indicate if a Generator is available for this format. A Generator is a Step class that allow to generate the file for the DataFormat if not exists. As an example genome indexes for mapper are created using a Generator.
  • A method that return the Generator instance if exists
  • A flag to indicate if a Checker is available for this format. A Checker is class that allow to check the content of a DataFormat file.
  • A method that return the Checker instance if exists
  • The maximum files for the DataFormat. This value is always 1 except for the Fastq format because in paired-end data is spread on two files for each ends of the cluster.

The following code define a DataFormat for FASTQ file:

public final class ReadsFastqDataFormat extends AbstractDataFormat {

  public static final String FORMAT_NAME = "reads_fastq";

  public DataType getType() {

    return DataTypes.READS;
  }

  @Override
  public String getDefaultExtention() {

    return ".fq";
  }

  @Override
  public String[getExtensions() {

    return new String[](]) {".fq", ".fastq"};
  }

  @Override
  public String getFormatName() {

    return FORMAT_NAME;
  }

  @Override
  public boolean isChecker() {

    return true;
  }

  @Override
  public Checker getChecker() {

    return new ReadsChecker();
  }

  @Override
  public int getMaxFilesCount() {

    return 2;
  }

}

The AbstractDataFormat class only define default methods for the implementation of the DataFormat interface.

Developer can define its own DataFormat in a plug-in for Eoulsan. Therefore instances of DataFormat classes must be created using the DataFormatService. And as for DataType only one instance of each DataFormat class can be created without consequences on Eoulsan worflow, developer must use next line to create a constant with the instance of a DataFormat:

  DataFormat READS_FASTQ = resgistry
      .getDataFormatFromName("reads_fastq");

Note: like DataType, DataFormat will be created in a future version of Eoulsan from XML file avoiding the issues of unique instance.

Registering a !DataType

As the use of the DataFormatService service class is mandatory with DataFormat, all the DataFormat classes must be registered by adding the full name of the class in the *fr.ens.transcriptome.eoulsan.data.DataFormat text file in the META-INF/services directory. See the Writing Step Plugin for more information:

fr.ens.transcriptome.eoulsan.data.ReadsFastqDataFormat

Naming data file in Eoulsan workflow

If the file path is not describe in the design file, Eoulsan looks for a file with 3 informations get from a DataFormat and a Sample object:

  • The prefix of the DataType
  • The sample number in the design file
  • The suffix of the DataFormat

As a example for a reads_fastq DataFormat and the sample number 3 of the design file, Eoulsan will look for : reads_3.fq or its compressed versions (reads_3.fq.gz, reads_3.fq.bz2).