Skip to content
Laurent Jourdren edited this page Apr 11, 2017 · 3 revisions

The DataFile class

WARNING: This documentation is outdated and will soon be updated.

The DataFile class is one of the major class to use to develop with Eoulsan. As the Hadoop Path cannot be used in local mode, the DataFile class define an abstraction layer upon the File class (in local mode) and Path class (in Hadoop mode) that can be used in any mode.

The DataFile have many advantages:

  • Many protocols are supported
  • Easy creation of InputStream and OutputStream
  • Automatic decompression of data (with '.gz' or '.bz2' extension)
  • Provide many methods to manipulate the path (extension, compression extension...)

Protocols available in vanilla Eoulsan:

  • file:// (default protocol)
  • ftp://
  • http://
  • s3://
  • hdfs:// (only in Hadoop mode)
  • annotation://
  • genome://

The implementation used for a protocol is not always the same in local and hadoop mode. As an example, the s3 protocol is implemented using the Amazon Java SDK in local mode and use Hadoop built-in support in Path class.

The Javadoc of the DataFile class is available here.

Creating DataFile object

The DataFile works like a File object, it is immutable.


DataFile f1 = new DataFile("foo.txt"); // a local file in current directory
DataFile f2 = new DataFile("/home/jourdren/foo.txt"); // Same file
DataFile f3 = new DataFile("file:///home/jourdren/foo.txt"); // Same file
DataFile f4 = new DataFile("hdfs://localhost/dir/foo.txt"); // a file on hdfs
DataFile f5 = new DataFile(new File("foo.txt")); // Create a DataFile object from a File object

You can also create a DataFile from another DataFile when dealing with DataFile that are directories:

DataFile dir = new DataFile("/home/jourdren");
DataFile file = new DataFile(dir, "foo.txt");

A common way of creation of DataFiles is to use the getInputDataFile and getOutputDataFile of the Context object. To use this method, you must provide a DataFormat and a Sample object. Eoulsan will automaticaly found the file that match to the request:

DataFile fastqFile = context.getInputDataFile(DataFormats.READS_FASTQ, sample);

Operation on DataFile path

DataFile file = new DataFile("hdfs://localhost/dir/foo.txt.gz");
file.getParent(); // == new DataFile("hdfs://localhost/dir");
file.getName(); // "foo.txt.gz"
file.getBasename(); // "foo.txt"
file.getCompressionExtension(); // ".gz"
file.getCompressionType(); // CompressionType.GZIP
file.getExtension(); // ".txt"
file.getFullExtension(); // ".txt.gz"
file.getProtocolName(); // PathDataProtocol object

DataFile and streams

As the standard Java API Eoulsan mainly use InputStream and OutputStream to read and write data. The DataFile object have methods to simple create this streams. Note that the open() method allow to automaticaly uncompress data.


DataFile file1 = new DataFile("hdfs://localhost/dir/foo.txt");
OutputStream os = file.create(); // Create an output stream from the DataFile

DataFile file2 = new DataFile("hdfs://localhost/dir/bar.txt.gz");
InputStream is1 = file.open(); // Create an InputStream. The stream is already uncompressed
InputStream is2 = file.rawOpen(); // Create an InputStream. The stream is compressed (must uncompress manually)

Other methods

If the DataFile use a protocol that wrap a File object, you can convert the DataFile to a File object. It is useful when writing Steps dedicated to local mode.

dataFile.toFile(); //

The isLocalFile() method can also be use test if the DataFile use the file protocol.

As the File object, the file on the file system is may not exists. To test if the file exists use the exists() method:

dataFile.exits();

Handling metadata

The getMetadata() for a DataFile object is like a call to the POSIX stat() method of a file. The DataFileMetaData contains many methods to get informations about the file:

  • getContentLenght() get the length of the file
  • getContentEncoding() get the content encoding of the file
  • getContentType() get the content type
  • getContentMD5() get the MD5 sum of the file
  • getLastModified() get the number of second since the last modification since the epoch (1.1.1970)
  • isDir() test if the DataFile is a directory
  • getDataFormat() get the DataFormat of the DataType

Note that the method can return -1 or null if the information is not available.