-
Notifications
You must be signed in to change notification settings - Fork 10
Writing Data Format
WARNING: This documentation is outdated and will soon be updated.
DataFormat
is an interface that allow define data to process. A DataFormat
is always related to a DataType
. As an example, a reads mapper can generate as output a SAM or BAM file. The content of the file is exactly the same but the format is different. In Eoulsan "language", the two output have the same DataType
but not have the same DataFormat
.
Thanks to the difference between DataType
and DataFormat
, Eoulsan Step
can handle as input several DataFormat
that had the same DataType
.
Note: Actually DataType
and DataFormat
are implemented using hand written java classes but in a future version of Eoulsan, this objects will be defined using XML files.
A DataType
object contains the following information:
- A name
- A description
- A file prefix
- A flag to indicate if the data is related to the whole analysis (e.g. genome sequence file) or if there is a file of the
DataType
for each sample (e.g. a file that contains reads of a sample) - A flag to indicate if the
DataType
can be provided by the design file - The name of the column in the design file that provide the
DataType
The following code define a DataType
for reads.
/** Reads datatype. **/
public static final DataType READS = new AbstractDataType() {
@Override
public String getName() {
return "reads";
}
@Override
public String getPrefix() {
return "reads_";
}
@Override
public String getDesignFieldName() {
return SampleMetadata.READS_FIELD;
}
@Override
public boolean isDataTypeFromDesignFile() {
return true;
}
};
The AbstractDataType
provide default values for some field and implementation of the equals()
, hashcode()
and toString()
methods.
Warning: There must be only one instance of a same DataType
in the JVM, otherwise it could have some mistakes with the Eoulsan workflow. That's why in the previous sample code, the instance is create using an anonymous class. Using XML files to describe Datatype
s in a future version of Eoulsan will solve this issue.
The DataFormat
contains the following information:
- A name
- A
DataType
- A content type (a MIME value, optional)
- A default extension
- A list of the known extension for the format
- A flag to indicate if a
Generator
is available for this format. AGenerator
is aStep
class that allow to generate the file for theDataFormat
if not exists. As an example genome indexes for mapper are created using aGenerator
. - A method that return the
Generator
instance if exists - A flag to indicate if a
Checker
is available for this format. AChecker
is class that allow to check the content of aDataFormat
file. - A method that return the
Checker
instance if exists - The maximum files for the DataFormat. This value is always 1 except for the Fastq format because in paired-end data is spread on two files for each ends of the cluster.
The following code define a DataFormat
for FASTQ file:
public final class ReadsFastqDataFormat extends AbstractDataFormat {
public static final String FORMAT_NAME = "reads_fastq";
public DataType getType() {
return DataTypes.READS;
}
@Override
public String getDefaultExtention() {
return ".fq";
}
@Override
public String[getExtensions() {
return new String[](]) {".fq", ".fastq"};
}
@Override
public String getFormatName() {
return FORMAT_NAME;
}
@Override
public boolean isChecker() {
return true;
}
@Override
public Checker getChecker() {
return new ReadsChecker();
}
@Override
public int getMaxFilesCount() {
return 2;
}
}
The AbstractDataFormat
class only define default methods for the implementation of the DataFormat
interface.
Developer can define its own DataFormat
in a plug-in for Eoulsan. Therefore instances of DataFormat
classes must be created using the DataFormatService
. And as for DataType
only one instance of each DataFormat
class can be created without consequences on Eoulsan worflow, developer must use next line to create a constant with the instance of a DataFormat
:
DataFormat READS_FASTQ = resgistry
.getDataFormatFromName("reads_fastq");
Note: like DataType
, DataFormat
will be created in a future version of Eoulsan from XML file avoiding the issues of unique instance.
As the use of the DataFormatService
service class is mandatory with DataFormat
, all the DataFormat
classes must be registered by adding the full name of the class in the *fr.ens.transcriptome.eoulsan.data.DataFormat
text file in the META-INF/services directory. See the Writing Step Plugin for more information:
fr.ens.transcriptome.eoulsan.data.ReadsFastqDataFormat
If the file path is not describe in the design file, Eoulsan looks for a file with 3 informations get from a DataFormat
and a Sample
object:
- The prefix of the
DataType
- The sample number in the design file
- The suffix of the
DataFormat
As a example for a reads_fastq DataFormat
and the sample number 3 of the design file, Eoulsan will look for : reads_3.fq or its compressed versions (reads_3.fq.gz, reads_3.fq.bz2).