-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sketch of basic input io interface. #5
base: master
Are you sure you want to change the base?
Conversation
Should we make it a part of the basic contract of Or should we have a specific |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tfenne It's not clear to me if we want to have 1 interface that supports both seekable things and non-seekable ones, or several layered ones that add functionality as you go down the hierachy. The first has the nice property that everything is the same type, but you have to do checks everywhere to test if things are seekable or not, and you may fail late when you end up actually needing to seek. The second makes it more clear which operations require seekability, but probably will end up with us having to cast down from the generic interface in some places which is a similar problem.
*/ | ||
public interface ByteSource extends AutoCloseable { | ||
/** Returns a description of the underlying source of the bytes, e.g. a path, stdin, a URL etc. */ | ||
String getDescription(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please
default long length() { throw new UnsupportedOperationException(); } | ||
|
||
/** Returns the offset of the next byte that will be read by any operation. */ | ||
long position(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to be a virtual file pointer in the case of a compressed file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't recommend that: a compressed file should also be able to seek/getPosition from the disk-representation (I had some struggles about it with the bgzip implementation in the previous htsjdk)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that yes, BlockCompressedInputStream
would become a sub-class of this and the positions it deals in would be virtual file pointers. @magicDGS why wouldn't you recommend this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two concerns about handling position in bgzip (and other compressed formats) with virtual file pointers with the same interface:
- Position in compressed/uncompressed data mean different things depending on the implementation, which is error prone. The API definition should behave exactly the same in every implementation, at least at the lower-level.
- In the case of uncompressed data, the returned value is the file-position; in bgzip, it will be a virtual file pointer (block number and position within block); and what's about other compressed formats?
The concrete use case that I have in mind is the bgzip FASTA reference, which I implemented in the lates HTSJDK release. It requires a .gzi index file to convert to the virtual file pointer the .fai coordinates. Nevertheless, an implementation to seek to a non-virtual file pointer is possible and it might support non-gzi-indexed files, and at the same time handle as a normal stream if interest to seek to a concrete file position is required. Separating both concepts is something benefitial in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Methods for seeking to a virtual file-pointer in bgzip and/or uncompressed-data pointer in other compressed format can be supported with a different method where the value passed is interpreted in an implementation-dependent way. That could be another interface to represent data that is transformed (compressed, block-compressed, maybe encrypted...)
String readAsciiString(final int length); | ||
|
||
/** Reads a null-terminated string, with one byte per character, of unknown length. */ | ||
String readNullTerminatedAsciiString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this take a max length for safety?
long position(); | ||
|
||
/** Returns true if the source can skip bytes without reading them. */ | ||
boolean canNavigateForward(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think canSkip()
and skip(long)
might be more familiar. The meaning of canNavigateForward
is sort of vague unless you read the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to name them whatever. Perhaps it could even be a single canX()
method since I'm not sure when we'd have the ability to skip forwards (without reading) but not skip backwards or vice versa.
|
||
/** Returns true if the source can always be repositioned behind the current position. Note that a buffered | ||
* source may allow limited backwards navigation within the buffered data, while returning false here. */ | ||
boolean canNavigateBackwards(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might call this canSeek
and navigateToPosition
-> seek(long)
. Is there any input that supports reverse reading but not random access? Symmetry is nice, but maybe in this case it's not necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the behavior should be for buffered readers that have some amount of look behind. We could use mark/reset but that's always kind of awkward. We could have some method like Range getSeekableRange()
that returns the range you're allowed to seek within.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I find mark()
and reset()
awkward. Perhaps just a canSeekTo(long position)
would be easiest? Because I think, as a client, that's really what I want to know. I.e. can canSeekTo(0)
?
* @param numBytes the number of bytes desired to be skipped. | ||
* @return either numBytes or a smaller number if EOF is hit before numBytes is read | ||
*/ | ||
int navigateForward(final long numBytes); // TODO: implement default method that reads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might want to split the navigation methods out into a separate interface? It might make it clearer to document which operations require seekable inputs and which ones don't.
|
||
// Methods related to buffering/repositioning/skipping etc. | ||
/** Returns true if the source has a known finite length, false otherwise. */ | ||
boolean hasLength(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hasLength
is a kind of funny name, maybe hasKnownLength
? More wordy though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or lenght
returning -1 if not known...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not to use special values when possible. I'm happy to change the name. How would you feel about Optional<Long> length()
? Option
is widely used in scala for this kind of thing, but I gather Java folks are not as keen on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for use Optional
more in the htsjdk-next library as a way of represent returned data that might not be present. Much better than magic numbers or a getter method failing if not present and a boolean method to check (I hate to read again an again things similar to hasLength ? length : 0
; optional in that case is more idiomatic).
double readDouble(); | ||
|
||
/** Reads an ascii string, with one-byte per character, of the given length. */ | ||
String readAsciiString(final int length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also want a more generic readString(int, Charset) for dealing with UTF-8 files and others like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that in this interface we'd have the necessary methods for reading ascii strings ini binary files as we have in BAM (and I presume BCF, CRAM etc.). But that we'd have a separate TextSource
or similar interface for when you're ready a text file and want real decoding.
* into other datatypes. | ||
* | ||
* Methods should throw exceptions if: | ||
* - The underlying data source/stream/channel is closed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or if you tried to seek / jump to a negative position or one that's > length/ eof.
* - Should think through how to deal with unsigned types | ||
* - Need to think how this would work with BlockCompressed data | ||
*/ | ||
public interface ByteSource extends AutoCloseable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a lot like the DataInput
interface in java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny that :) I looked at DataInput
and our exiting code in htsjdk, and came to the opinion that it would be better to have our own interface that we can evolve as we see fit without being constrained to some existing type. Also every method in DataInput
throws IOException
which means try / catch
all over the place. If we're going to use unchecked exceptions that would mean all clients of ByteSource
have to do exception handling, as opposed to dealing with it at the lower level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still not sure if combining seekable/non-seekable sources in the same interface is a good idea. I will rather separate the two concepts into two interfaces (ByteSource
and SeekableByteSource
extending the previous one or not) to distinguish between them in the cases that it is required. Then, we can bound the methods that require seek to one interface.
Anyway, here are the comments for the current implementation.
long position(); | ||
|
||
/** Returns true if the source can skip bytes without reading them. */ | ||
boolean canNavigateForward(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any situation where the navigation forward won't be allowed? Skipping bytes could be always done by reading them...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I was trying to get across is that I think clients may wish to know whether they can skip forward. E.g. if I'm reading a 100GB BAM from S3 or GCS and can't skip, maybe I'll find another way to get the unmapped reads at the end of the file.
I think this highlights though that we need to hash out really clear names and semantics for the functions that tell you whether you can re-position/skip in general and/or within some buffered region.
boolean hasLength(); | ||
|
||
/** Returns the length if {@link #hasLength()} is true, otherwise throws an exception. */ | ||
default long length() { throw new UnsupportedOperationException(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this throws by default, the hasLength()
mehod should return true
by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant false
- it was too late for reviewing...
byte readByte(); | ||
|
||
/** Reads a boolean. */ | ||
boolean readBoolean(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of this methods looks like DataInput
- should we extend directly that one instead? In that case, we can use our sources in methods with DataInput
signature.
* @param numBytes the number of bytes desired to be skipped. | ||
* @return either numBytes or a smaller number if EOF is hit before numBytes is read | ||
*/ | ||
int navigateForward(final long numBytes); // TODO: implement default method that reads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe skipBytes
(as in DataInput
)?
* @param numBytes the number of bytes desired to be skipped. | ||
* @return either numBytes or a smaller number if numBytes > position | ||
*/ | ||
default int navigateBackward(final long numBytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something to match skipBytes
? I'm not sure what...
* - There are insufficient bytes available (after blocking) to read a single item of the datatype requested | ||
* | ||
* NOTES: | ||
* - Almost all the methods here could have default implementations that read via one of the readBytes() methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not having a package protected abstract class with the default implementation? Do we really want this defaults to be part of the public API? If they change, then they mean a major bump!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could. I just figure that for all the readInt
etc. that can be very easily implemented in terms of readBytes()
they could live here. It'd still be a major version bump if we changed the contract of an API method, even if the implementation was in an abstract class below it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change the contract, the version is bump; but what if the implementation behaves badly with other methods, but the contract stays the same? Anyway, I am fine with the simple implementations using readBytes
...
|
||
/** Returns true if the source can always be repositioned behind the current position. Note that a buffered | ||
* source may allow limited backwards navigation within the buffered data, while returning false here. */ | ||
boolean canNavigateBackwards(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of returning false here for limited backwards navigation, we can return number of bytes that can be "safely" navigate backwards:
- Buffered data: buffer-size (it could be big for pipes to discover formats, for example)
- Random access data: return the current position (in principle, it can be completely reset)
- Unbuffered data: returns 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And related with this method, we can also include a canReset
and reset
to come back to the first position in the file (as a sugar syntax for position(0)
- this will be just nice to have (I can imagine several parts of the code calling the method to return the number of bytes that can be navigate backwards, compare to the lenght and if they are equal position the stream in 0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did think about this @magicDGS I think the problem is that this doesn't work great for cases where you're at position(0)
already, and want to ask "if I read N bytes to do format detection, can I go back to 0, or do I need to buffer those bytes myself".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Maybe for random access data it should return the maximum long value and for buffered data the buffer size. This would behave correctly in your case of format detection: even in position 0, the answer is true with that API definition.
/** Returns a description of the underlying source of the bytes, e.g. a path, stdin, a URL etc. */ | ||
String getDescription(); | ||
|
||
// Methods related to buffering/repositioning/skipping etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This two block separation suggest that this could be split into two interfaces...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion would be this:
- Get this interface nailed down how we like it
- Introduce a parallel interface for reading lines of text
- Refactor the seekable part into a standalone interface that is used in both
All of which would be done long before even an alpha release. I just find it easier to keep it in one interface while we work things out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even in that case, when implementations start to get more complicated it will be difficult to uncouple the two of them. But I don't know what is the timeframe for splitting them that you are suggesting.
*.rar | ||
|
||
# virtual machine crash logs, see http://www.java.com/en/download/help/error_hotspot.xml | ||
hs_err_pid* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that this was unintentional..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, since all the generated stuff lives in out
it was intentional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even the crash logs are placed in out
for Mill? That's neat if so!
/** Reads a boolean. */ | ||
boolean readBoolean(); | ||
|
||
/** Reads a little endian signed short. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also support big-endian reading in case it is required in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comment to @lbergelson.
No description provided.