Sketch of basic input io interface. #5

tfenne · 2018-07-02T20:56:09Z

No description provided.

tfenne · 2018-07-02T21:27:44Z

Should we make it a part of the basic contract of ByteSource that it buffers at least some minimum number of bytes? E.g. we could require all implementations to buffer at least 4kb o4 8kb if that's sufficient to do format detection for all likely formats. That would then guarantee that if you are positioned within the first, e.g., 4kb that you can always skip back to the start again.

Or should we have a specific mark() and reset() combination that requires buffering of all data between the mark() and reset() calls?

lbergelson

@tfenne It's not clear to me if we want to have 1 interface that supports both seekable things and non-seekable ones, or several layered ones that add functionality as you go down the hierachy. The first has the nice property that everything is the same type, but you have to do checks everywhere to test if things are seekable or not, and you may fail late when you end up actually needing to seek. The second makes it more clear which operations require seekability, but probably will end up with us having to cast down from the generic interface in some places which is a similar problem.

lbergelson · 2018-07-02T21:05:17Z

core/src/org/samtools/htsjdk/core/io/ByteSource.java

+ */
+public interface ByteSource extends AutoCloseable {
+    /** Returns a description of the underlying source of the bytes, e.g. a path, stdin, a URL etc. */
+    String getDescription();


lbergelson · 2018-07-02T21:14:46Z

core/src/org/samtools/htsjdk/core/io/ByteSource.java

+    default long length() { throw new UnsupportedOperationException(); }
+
+    /** Returns the offset of the next byte that will be read by any operation. */
+    long position();


Is this going to be a virtual file pointer in the case of a compressed file?

I won't recommend that: a compressed file should also be able to seek/getPosition from the disk-representation (I had some struggles about it with the bgzip implementation in the previous htsjdk)

I was thinking that yes, BlockCompressedInputStream would become a sub-class of this and the positions it deals in would be virtual file pointers. @magicDGS why wouldn't you recommend this?

I have two concerns about handling position in bgzip (and other compressed formats) with virtual file pointers with the same interface:

Position in compressed/uncompressed data mean different things depending on the implementation, which is error prone. The API definition should behave exactly the same in every implementation, at least at the lower-level.

In the case of uncompressed data, the returned value is the file-position; in bgzip, it will be a virtual file pointer (block number and position within block); and what's about other compressed formats?

The concrete use case that I have in mind is the bgzip FASTA reference, which I implemented in the lates HTSJDK release. It requires a .gzi index file to convert to the virtual file pointer the .fai coordinates. Nevertheless, an implementation to seek to a non-virtual file pointer is possible and it might support non-gzi-indexed files, and at the same time handle as a normal stream if interest to seek to a concrete file position is required. Separating both concepts is something benefitial in my opinion.

Methods for seeking to a virtual file-pointer in bgzip and/or uncompressed-data pointer in other compressed format can be supported with a different method where the value passed is interpreted in an implementation-dependent way. That could be another interface to represent data that is transformed (compressed, block-compressed, maybe encrypted...)

lbergelson · 2018-07-02T21:16:34Z

core/src/org/samtools/htsjdk/core/io/ByteSource.java

+    String readAsciiString(final int length);
+
+    /** Reads a null-terminated string, with one byte per character, of unknown length. */
+    String readNullTerminatedAsciiString();


should this take a max length for safety?

lbergelson · 2018-07-02T21:18:18Z

core/src/org/samtools/htsjdk/core/io/ByteSource.java

+    long position();
+
+    /** Returns true if the source can skip bytes without reading them. */
+    boolean canNavigateForward();


I think canSkip() and skip(long) might be more familiar. The meaning of canNavigateForward is sort of vague unless you read the comment.

I'm happy to name them whatever. Perhaps it could even be a single canX() method since I'm not sure when we'd have the ability to skip forwards (without reading) but not skip backwards or vice versa.

lbergelson · 2018-07-02T21:21:23Z

core/src/org/samtools/htsjdk/core/io/ByteSource.java

+
+    /** Returns true if the source can always be repositioned behind the current position. Note that a buffered
+     * source may allow limited backwards navigation within the buffered data, while returning false here. */
+    boolean canNavigateBackwards();


I might call this canSeek and navigateToPosition-> seek(long). Is there any input that supports reverse reading but not random access? Symmetry is nice, but maybe in this case it's not necessary?

I'm not sure what the behavior should be for buffered readers that have some amount of look behind. We could use mark/reset but that's always kind of awkward. We could have some method like Range getSeekableRange() that returns the range you're allowed to seek within.

I agree, I find mark() and reset() awkward. Perhaps just a canSeekTo(long position) would be easiest? Because I think, as a client, that's really what I want to know. I.e. can canSeekTo(0)?

lbergelson · 2018-07-02T21:38:34Z

core/src/org/samtools/htsjdk/core/io/ByteSource.java

+     * @param numBytes the number of bytes desired to be skipped.
+     * @return either numBytes or a smaller number if EOF is hit before numBytes is read
+     */
+    int navigateForward(final long numBytes); // TODO: implement default method that reads


I think we might want to split the navigation methods out into a separate interface? It might make it clearer to document which operations require seekable inputs and which ones don't.

lbergelson · 2018-07-02T21:41:54Z

core/src/org/samtools/htsjdk/core/io/ByteSource.java

+
+    // Methods related to buffering/repositioning/skipping etc.
+    /** Returns true if the source has a known finite length, false otherwise. */
+    boolean hasLength();


hasLength is a kind of funny name, maybe hasKnownLength? More wordy though.

Or lenght returning -1 if not known...

I'd prefer not to use special values when possible. I'm happy to change the name. How would you feel about Optional<Long> length()? Option is widely used in scala for this kind of thing, but I gather Java folks are not as keen on it?

I vote for use Optional more in the htsjdk-next library as a way of represent returned data that might not be present. Much better than magic numbers or a getter method failing if not present and a boolean method to check (I hate to read again an again things similar to hasLength ? length : 0; optional in that case is more idiomatic).

lbergelson · 2018-07-02T21:47:05Z

core/src/org/samtools/htsjdk/core/io/ByteSource.java

+    double readDouble();
+
+    /** Reads an ascii string, with one-byte per character, of the given length. */
+    String readAsciiString(final int length);


Do we also want a more generic readString(int, Charset) for dealing with UTF-8 files and others like it.

I was thinking that in this interface we'd have the necessary methods for reading ascii strings ini binary files as we have in BAM (and I presume BCF, CRAM etc.). But that we'd have a separate TextSource or similar interface for when you're ready a text file and want real decoding.

lbergelson · 2018-07-02T21:51:45Z

core/src/org/samtools/htsjdk/core/io/ByteSource.java

+ * into other datatypes.
+ *
+ * Methods should throw exceptions if:
+ * - The underlying data source/stream/channel is closed


or if you tried to seek / jump to a negative position or one that's > length/ eof.

lbergelson · 2018-07-02T21:58:19Z

core/src/org/samtools/htsjdk/core/io/ByteSource.java

+ *   - Should think through how to deal with unsigned types
+ *   - Need to think how this would work with BlockCompressed data
+ */
+public interface ByteSource extends AutoCloseable {


This looks a lot like the DataInput interface in java.

Funny that :) I looked at DataInput and our exiting code in htsjdk, and came to the opinion that it would be better to have our own interface that we can evolve as we see fit without being constrained to some existing type. Also every method in DataInput throws IOException which means try / catch all over the place. If we're going to use unchecked exceptions that would mean all clients of ByteSource have to do exception handling, as opposed to dealing with it at the lower level.

magicDGS

I am still not sure if combining seekable/non-seekable sources in the same interface is a good idea. I will rather separate the two concepts into two interfaces (ByteSource and SeekableByteSource extending the previous one or not) to distinguish between them in the cases that it is required. Then, we can bound the methods that require seek to one interface.

Anyway, here are the comments for the current implementation.

magicDGS · 2018-07-02T21:30:42Z

core/src/org/samtools/htsjdk/core/io/ByteSource.java

+    long position();
+
+    /** Returns true if the source can skip bytes without reading them. */
+    boolean canNavigateForward();


Is there any situation where the navigation forward won't be allowed? Skipping bytes could be always done by reading them...

What I was trying to get across is that I think clients may wish to know whether they can skip forward. E.g. if I'm reading a 100GB BAM from S3 or GCS and can't skip, maybe I'll find another way to get the unmapped reads at the end of the file.

I think this highlights though that we need to hash out really clear names and semantics for the functions that tell you whether you can re-position/skip in general and/or within some buffered region.

magicDGS · 2018-07-02T21:31:52Z

core/src/org/samtools/htsjdk/core/io/ByteSource.java

+    boolean hasLength();
+
+    /** Returns the length if {@link #hasLength()} is true, otherwise throws an exception. */
+    default long length() { throw new UnsupportedOperationException(); }


If this throws by default, the hasLength() mehod should return true by default.

Sorry, I meant false - it was too late for reviewing...

magicDGS · 2018-07-02T21:34:39Z

core/src/org/samtools/htsjdk/core/io/ByteSource.java

+    byte readByte();
+
+    /** Reads a boolean. */
+    boolean readBoolean();


Some of this methods looks like DataInput - should we extend directly that one instead? In that case, we can use our sources in methods with DataInput signature.

magicDGS · 2018-07-02T21:35:52Z

core/src/org/samtools/htsjdk/core/io/ByteSource.java

+     * @param numBytes the number of bytes desired to be skipped.
+     * @return either numBytes or a smaller number if EOF is hit before numBytes is read
+     */
+    int navigateForward(final long numBytes); // TODO: implement default method that reads


Maybe skipBytes (as in DataInput)?

magicDGS · 2018-07-02T21:39:55Z

core/src/org/samtools/htsjdk/core/io/ByteSource.java

+     * @param numBytes the number of bytes desired to be skipped.
+     * @return either numBytes or a smaller number if numBytes > position
+     */
+    default int navigateBackward(final long numBytes) {


Maybe something to match skipBytes? I'm not sure what...

magicDGS · 2018-07-02T21:57:53Z

core/src/org/samtools/htsjdk/core/io/ByteSource.java

+ * - There are insufficient bytes available (after blocking) to read a single item of the datatype requested
+ *
+ * NOTES:
+ *   - Almost all the methods here could have default implementations that read via one of the readBytes() methods.


Why not having a package protected abstract class with the default implementation? Do we really want this defaults to be part of the public API? If they change, then they mean a major bump!

We could. I just figure that for all the readInt etc. that can be very easily implemented in terms of readBytes() they could live here. It'd still be a major version bump if we changed the contract of an API method, even if the implementation was in an abstract class below it.

If we change the contract, the version is bump; but what if the implementation behaves badly with other methods, but the contract stays the same? Anyway, I am fine with the simple implementations using readBytes...

magicDGS · 2018-07-02T22:01:02Z

core/src/org/samtools/htsjdk/core/io/ByteSource.java

+
+    /** Returns true if the source can always be repositioned behind the current position. Note that a buffered
+     * source may allow limited backwards navigation within the buffered data, while returning false here. */
+    boolean canNavigateBackwards();


Instead of returning false here for limited backwards navigation, we can return number of bytes that can be "safely" navigate backwards:

Buffered data: buffer-size (it could be big for pipes to discover formats, for example)

Random access data: return the current position (in principle, it can be completely reset)

Unbuffered data: returns 0.

And related with this method, we can also include a canReset and reset to come back to the first position in the file (as a sugar syntax for position(0) - this will be just nice to have (I can imagine several parts of the code calling the method to return the number of bytes that can be navigate backwards, compare to the lenght and if they are equal position the stream in 0).

I did think about this @magicDGS I think the problem is that this doesn't work great for cases where you're at position(0) already, and want to ask "if I read N bytes to do format detection, can I go back to 0, or do I need to buffer those bytes myself".

True. Maybe for random access data it should return the maximum long value and for buffered data the buffer size. This would behave correctly in your case of format detection: even in position 0, the answer is true with that API definition.

magicDGS · 2018-07-02T22:04:52Z

core/src/org/samtools/htsjdk/core/io/ByteSource.java

+    /** Returns a description of the underlying source of the bytes, e.g. a path, stdin, a URL etc. */
+    String getDescription();
+
+    // Methods related to buffering/repositioning/skipping etc.


This two block separation suggest that this could be split into two interfaces...

My suggestion would be this:

Get this interface nailed down how we like it

Introduce a parallel interface for reading lines of text

Refactor the seekable part into a standalone interface that is used in both

All of which would be done long before even an alpha release. I just find it easier to keep it in one interface while we work things out.

Even in that case, when implementations start to get more complicated it will be difficult to uncouple the two of them. But I don't know what is the timeframe for splitting them that you are suggesting.

magicDGS · 2018-07-02T22:05:54Z

.gitignore

-*.rar
-
-# virtual machine crash logs, see http://www.java.com/en/download/help/error_hotspot.xml
-hs_err_pid*


I guess that this was unintentional..

Actually, since all the generated stuff lives in out it was intentional

Even the crash logs are placed in out for Mill? That's neat if so!

magicDGS · 2018-07-02T22:08:23Z

core/src/org/samtools/htsjdk/core/io/ByteSource.java

+    /** Reads a boolean. */
+    boolean readBoolean();
+
+    /** Reads a little endian signed short. */


Should we also support big-endian reading in case it is required in the future?

See above comment to @lbergelson.

Sketch of basic input io interface.

6f45466

lbergelson reviewed Jul 2, 2018

View reviewed changes

magicDGS reviewed Jul 2, 2018

View reviewed changes

magicDGS mentioned this pull request Oct 29, 2018

URI and Path specifier classes for input/output resources. #34

Merged

Sketch of basic input io interface. #5

Are you sure you want to change the base?

Sketch of basic input io interface. #5

Conversation

tfenne commented Jul 2, 2018

tfenne commented Jul 2, 2018

lbergelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

magicDGS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment