Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sketch of basic input io interface. #5

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tfenne
Copy link
Member

@tfenne tfenne commented Jul 2, 2018

No description provided.

@tfenne
Copy link
Member Author

tfenne commented Jul 2, 2018

Should we make it a part of the basic contract of ByteSource that it buffers at least some minimum number of bytes? E.g. we could require all implementations to buffer at least 4kb o4 8kb if that's sufficient to do format detection for all likely formats. That would then guarantee that if you are positioned within the first, e.g., 4kb that you can always skip back to the start again.

Or should we have a specific mark() and reset() combination that requires buffering of all data between the mark() and reset() calls?

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tfenne It's not clear to me if we want to have 1 interface that supports both seekable things and non-seekable ones, or several layered ones that add functionality as you go down the hierachy. The first has the nice property that everything is the same type, but you have to do checks everywhere to test if things are seekable or not, and you may fail late when you end up actually needing to seek. The second makes it more clear which operations require seekability, but probably will end up with us having to cast down from the generic interface in some places which is a similar problem.

*/
public interface ByteSource extends AutoCloseable {
/** Returns a description of the underlying source of the bytes, e.g. a path, stdin, a URL etc. */
String getDescription();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please

default long length() { throw new UnsupportedOperationException(); }

/** Returns the offset of the next byte that will be read by any operation. */
long position();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be a virtual file pointer in the case of a compressed file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't recommend that: a compressed file should also be able to seek/getPosition from the disk-representation (I had some struggles about it with the bgzip implementation in the previous htsjdk)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that yes, BlockCompressedInputStream would become a sub-class of this and the positions it deals in would be virtual file pointers. @magicDGS why wouldn't you recommend this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two concerns about handling position in bgzip (and other compressed formats) with virtual file pointers with the same interface:

  • Position in compressed/uncompressed data mean different things depending on the implementation, which is error prone. The API definition should behave exactly the same in every implementation, at least at the lower-level.
  • In the case of uncompressed data, the returned value is the file-position; in bgzip, it will be a virtual file pointer (block number and position within block); and what's about other compressed formats?

The concrete use case that I have in mind is the bgzip FASTA reference, which I implemented in the lates HTSJDK release. It requires a .gzi index file to convert to the virtual file pointer the .fai coordinates. Nevertheless, an implementation to seek to a non-virtual file pointer is possible and it might support non-gzi-indexed files, and at the same time handle as a normal stream if interest to seek to a concrete file position is required. Separating both concepts is something benefitial in my opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Methods for seeking to a virtual file-pointer in bgzip and/or uncompressed-data pointer in other compressed format can be supported with a different method where the value passed is interpreted in an implementation-dependent way. That could be another interface to represent data that is transformed (compressed, block-compressed, maybe encrypted...)

String readAsciiString(final int length);

/** Reads a null-terminated string, with one byte per character, of unknown length. */
String readNullTerminatedAsciiString();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this take a max length for safety?

long position();

/** Returns true if the source can skip bytes without reading them. */
boolean canNavigateForward();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think canSkip() and skip(long) might be more familiar. The meaning of canNavigateForward is sort of vague unless you read the comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to name them whatever. Perhaps it could even be a single canX() method since I'm not sure when we'd have the ability to skip forwards (without reading) but not skip backwards or vice versa.


/** Returns true if the source can always be repositioned behind the current position. Note that a buffered
* source may allow limited backwards navigation within the buffered data, while returning false here. */
boolean canNavigateBackwards();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might call this canSeek and navigateToPosition-> seek(long). Is there any input that supports reverse reading but not random access? Symmetry is nice, but maybe in this case it's not necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the behavior should be for buffered readers that have some amount of look behind. We could use mark/reset but that's always kind of awkward. We could have some method like Range getSeekableRange() that returns the range you're allowed to seek within.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I find mark() and reset() awkward. Perhaps just a canSeekTo(long position) would be easiest? Because I think, as a client, that's really what I want to know. I.e. can canSeekTo(0)?

* @param numBytes the number of bytes desired to be skipped.
* @return either numBytes or a smaller number if EOF is hit before numBytes is read
*/
int navigateForward(final long numBytes); // TODO: implement default method that reads
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to split the navigation methods out into a separate interface? It might make it clearer to document which operations require seekable inputs and which ones don't.


// Methods related to buffering/repositioning/skipping etc.
/** Returns true if the source has a known finite length, false otherwise. */
boolean hasLength();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasLength is a kind of funny name, maybe hasKnownLength? More wordy though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or lenght returning -1 if not known...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to use special values when possible. I'm happy to change the name. How would you feel about Optional<Long> length()? Option is widely used in scala for this kind of thing, but I gather Java folks are not as keen on it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for use Optional more in the htsjdk-next library as a way of represent returned data that might not be present. Much better than magic numbers or a getter method failing if not present and a boolean method to check (I hate to read again an again things similar to hasLength ? length : 0; optional in that case is more idiomatic).

double readDouble();

/** Reads an ascii string, with one-byte per character, of the given length. */
String readAsciiString(final int length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want a more generic readString(int, Charset) for dealing with UTF-8 files and others like it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that in this interface we'd have the necessary methods for reading ascii strings ini binary files as we have in BAM (and I presume BCF, CRAM etc.). But that we'd have a separate TextSource or similar interface for when you're ready a text file and want real decoding.

* into other datatypes.
*
* Methods should throw exceptions if:
* - The underlying data source/stream/channel is closed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or if you tried to seek / jump to a negative position or one that's > length/ eof.

* - Should think through how to deal with unsigned types
* - Need to think how this would work with BlockCompressed data
*/
public interface ByteSource extends AutoCloseable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a lot like the DataInput interface in java.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny that :) I looked at DataInput and our exiting code in htsjdk, and came to the opinion that it would be better to have our own interface that we can evolve as we see fit without being constrained to some existing type. Also every method in DataInput throws IOException which means try / catch all over the place. If we're going to use unchecked exceptions that would mean all clients of ByteSource have to do exception handling, as opposed to dealing with it at the lower level.

Copy link
Member

@magicDGS magicDGS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not sure if combining seekable/non-seekable sources in the same interface is a good idea. I will rather separate the two concepts into two interfaces (ByteSource and SeekableByteSource extending the previous one or not) to distinguish between them in the cases that it is required. Then, we can bound the methods that require seek to one interface.

Anyway, here are the comments for the current implementation.

long position();

/** Returns true if the source can skip bytes without reading them. */
boolean canNavigateForward();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any situation where the navigation forward won't be allowed? Skipping bytes could be always done by reading them...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was trying to get across is that I think clients may wish to know whether they can skip forward. E.g. if I'm reading a 100GB BAM from S3 or GCS and can't skip, maybe I'll find another way to get the unmapped reads at the end of the file.

I think this highlights though that we need to hash out really clear names and semantics for the functions that tell you whether you can re-position/skip in general and/or within some buffered region.

boolean hasLength();

/** Returns the length if {@link #hasLength()} is true, otherwise throws an exception. */
default long length() { throw new UnsupportedOperationException(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this throws by default, the hasLength() mehod should return true by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant false - it was too late for reviewing...

byte readByte();

/** Reads a boolean. */
boolean readBoolean();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this methods looks like DataInput - should we extend directly that one instead? In that case, we can use our sources in methods with DataInput signature.

* @param numBytes the number of bytes desired to be skipped.
* @return either numBytes or a smaller number if EOF is hit before numBytes is read
*/
int navigateForward(final long numBytes); // TODO: implement default method that reads
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe skipBytes (as in DataInput)?

* @param numBytes the number of bytes desired to be skipped.
* @return either numBytes or a smaller number if numBytes > position
*/
default int navigateBackward(final long numBytes) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something to match skipBytes? I'm not sure what...

* - There are insufficient bytes available (after blocking) to read a single item of the datatype requested
*
* NOTES:
* - Almost all the methods here could have default implementations that read via one of the readBytes() methods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not having a package protected abstract class with the default implementation? Do we really want this defaults to be part of the public API? If they change, then they mean a major bump!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could. I just figure that for all the readInt etc. that can be very easily implemented in terms of readBytes() they could live here. It'd still be a major version bump if we changed the contract of an API method, even if the implementation was in an abstract class below it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the contract, the version is bump; but what if the implementation behaves badly with other methods, but the contract stays the same? Anyway, I am fine with the simple implementations using readBytes...


/** Returns true if the source can always be repositioned behind the current position. Note that a buffered
* source may allow limited backwards navigation within the buffered data, while returning false here. */
boolean canNavigateBackwards();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of returning false here for limited backwards navigation, we can return number of bytes that can be "safely" navigate backwards:

  • Buffered data: buffer-size (it could be big for pipes to discover formats, for example)
  • Random access data: return the current position (in principle, it can be completely reset)
  • Unbuffered data: returns 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And related with this method, we can also include a canReset and reset to come back to the first position in the file (as a sugar syntax for position(0) - this will be just nice to have (I can imagine several parts of the code calling the method to return the number of bytes that can be navigate backwards, compare to the lenght and if they are equal position the stream in 0).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did think about this @magicDGS I think the problem is that this doesn't work great for cases where you're at position(0) already, and want to ask "if I read N bytes to do format detection, can I go back to 0, or do I need to buffer those bytes myself".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Maybe for random access data it should return the maximum long value and for buffered data the buffer size. This would behave correctly in your case of format detection: even in position 0, the answer is true with that API definition.

/** Returns a description of the underlying source of the bytes, e.g. a path, stdin, a URL etc. */
String getDescription();

// Methods related to buffering/repositioning/skipping etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This two block separation suggest that this could be split into two interfaces...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion would be this:

  1. Get this interface nailed down how we like it
  2. Introduce a parallel interface for reading lines of text
  3. Refactor the seekable part into a standalone interface that is used in both

All of which would be done long before even an alpha release. I just find it easier to keep it in one interface while we work things out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in that case, when implementations start to get more complicated it will be difficult to uncouple the two of them. But I don't know what is the timeframe for splitting them that you are suggesting.

*.rar

# virtual machine crash logs, see http://www.java.com/en/download/help/error_hotspot.xml
hs_err_pid*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that this was unintentional..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since all the generated stuff lives in out it was intentional

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even the crash logs are placed in out for Mill? That's neat if so!

/** Reads a boolean. */
boolean readBoolean();

/** Reads a little endian signed short. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also support big-endian reading in case it is required in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment to @lbergelson.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants