-
Notifications
You must be signed in to change notification settings - Fork 10
Data File Class
WARNING: This documentation is outdated and will soon be updated.
The DataFile
class is one of the major class to use to develop with Eoulsan. As the Hadoop Path
cannot be used in local mode, the DataFile
class define an abstraction layer upon the File
class (in local mode) and Path
class (in Hadoop mode) that can be used in any mode.
The DataFile
have many advantages:
- Many protocols are supported
- Easy creation of
InputStream
andOutputStream
- Automatic decompression of data (with '.gz' or '.bz2' extension)
- Provide many methods to manipulate the path (extension, compression extension...)
Protocols available in vanilla Eoulsan:
-
file://
(default protocol) ftp://
http://
s3://
-
hdfs://
(only in Hadoop mode) annotation://
genome://
The implementation used for a protocol is not always the same in local and hadoop mode. As an example, the s3 protocol is implemented using the Amazon Java SDK in local mode and use Hadoop built-in support in Path
class.
The Javadoc of the DataFile
class is available here.
The DataFile
works like a File
object, it is immutable.
DataFile f1 = new DataFile("foo.txt"); // a local file in current directory
DataFile f2 = new DataFile("/home/jourdren/foo.txt"); // Same file
DataFile f3 = new DataFile("file:///home/jourdren/foo.txt"); // Same file
DataFile f4 = new DataFile("hdfs://localhost/dir/foo.txt"); // a file on hdfs
DataFile f5 = new DataFile(new File("foo.txt")); // Create a DataFile object from a File object
You can also create a DataFile
from another DataFile
when dealing with DataFile that are directories:
DataFile dir = new DataFile("/home/jourdren");
DataFile file = new DataFile(dir, "foo.txt");
A common way of creation of DataFile
s is to use the getInputDataFile
and getOutputDataFile
of the Context
object. To use this method, you must provide a DataFormat
and a Sample
object. Eoulsan will automaticaly found the file that match to the request:
DataFile fastqFile = context.getInputDataFile(DataFormats.READS_FASTQ, sample);
DataFile file = new DataFile("hdfs://localhost/dir/foo.txt.gz");
file.getParent(); // == new DataFile("hdfs://localhost/dir");
file.getName(); // "foo.txt.gz"
file.getBasename(); // "foo.txt"
file.getCompressionExtension(); // ".gz"
file.getCompressionType(); // CompressionType.GZIP
file.getExtension(); // ".txt"
file.getFullExtension(); // ".txt.gz"
file.getProtocolName(); // PathDataProtocol object
As the standard Java API Eoulsan mainly use InputStream
and OutputStream
to read and write data. The DataFile
object have methods to simple create this streams. Note that the open()
method allow to automaticaly uncompress data.
DataFile file1 = new DataFile("hdfs://localhost/dir/foo.txt");
OutputStream os = file.create(); // Create an output stream from the DataFile
DataFile file2 = new DataFile("hdfs://localhost/dir/bar.txt.gz");
InputStream is1 = file.open(); // Create an InputStream. The stream is already uncompressed
InputStream is2 = file.rawOpen(); // Create an InputStream. The stream is compressed (must uncompress manually)
If the DataFile
use a protocol that wrap a File
object, you can convert the DataFile
to a File
object. It is useful when writing Step
s dedicated to local mode.
dataFile.toFile(); //
The isLocalFile()
method can also be use test if the DataFile
use the file protocol.
As the File
object, the file on the file system is may not exists. To test if the file exists use the exists()
method:
dataFile.exits();
The getMetadata()
for a DataFile
object is like a call to the POSIX stat()
method of a file. The DataFileMetaData
contains many methods to get informations about the file:
-
getContentLenght()
get the length of the file -
getContentEncoding()
get the content encoding of the file -
getContentType()
get the content type -
getContentMD5()
get the MD5 sum of the file -
getLastModified()
get the number of second since the last modification since the epoch (1.1.1970) -
isDir()
test if theDataFile
is a directory -
getDataFormat()
get theDataFormat
of theDataType
Note that the method can return -1 or null if the information is not available.