-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISQ-10] Initial Disq code contribution. #14
Conversation
Corresponds to tomwhite@d3d76ec.
Fixes #10. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review classes in org.disq_bio.disq
package. I did pass through htsjdk.samtools
too, but without getting into details because that will be ported to htsjdk.
import java.util.NoSuchElementException; | ||
|
||
/** Class for reading and querying BAM files. */ | ||
public class BAMFileReader2 extends SamReader.ReaderImplementation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a huge class to be used only one method (BAMFileReader2.findVirtualOffsetOfFirstRecord(seekableStream)
in in BAMSBIIndexer.findVirtualOffsetOfFirstRecord
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, however it's a temporary workaround until the upstream PR is available.
* SBI is an index into BGZF-compressed data files, which has an entry for the file position of the | ||
* start of every <i>n</i>th record. Reads files that were created by {@link BAMSBIIndexer}. | ||
*/ | ||
public final class SBIIndex implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this will be in htsjdk, add a TODO pointing to the PR to be removed once htsjdk is upgraded with that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea - I'll do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* in an index every <i>n</i>th record. When there are no records left call {@link #finish} to | ||
* complete writing the index. | ||
*/ | ||
public final class SBIIndexWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this will be in htsjdk, add a TODO pointing to the PR to be removed once htsjdk is upgraded with that change.
* | ||
* @see HtsjdkReadsRddStorage | ||
*/ | ||
public class HtsjdkReadsRdd { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it requires the prefix Htsjdk
(here and other code in this package)? If we said that this library targets to use the htsjdk library interfaces (in the future, I hope that the proper interfaces in htsjdk-next), we do not require this prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it because of a discussion we had a while back about using different data models (i.e. non-htsjdk) for reads and variants. Interested to see what others think too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If changed to RecordsRdd<Metadata, Record>
, the implementation can either maintain the prefix or be called SAMRecordRdd
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to see the Htsjdk
prefixes not be in the top level of the API as well. The metadata and record types can be generic as discussed here and in some other comments. Not that it should block the initial commit, but I do like the idea. Also, is it worth abstracting RDD to a type parameter at top level to ease migration to DataSet ? It would help with unifying interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep the Htsjdk
prefixes for now.
public class HtsjdkReadsRdd { | ||
|
||
private final SAMFileHeader header; | ||
private final JavaRDD<SAMRecord> reads; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the reads header-less? If so, add to the top-class javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some more javadoc here.
* | ||
* @see HtsjdkVariantsRddStorage | ||
*/ | ||
public class HtsjdkVariantsRdd { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a common interface for reads/variants and any other kind of records (it will be nice to have FASTQ support eventually and other tribble features). What's about using a RecordsRdd<Metadata, Record>
? For reads, it will be RecordsRdd<SAMFileHeader, SAMRecord>
and for variants RecordsRdd<VCFHeader, VariantContext>
- that is more extensible for other formats (and for moving to a different library, or htsjdk-next, in the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably needs more discussion. Should HtsjdkReadsRdd
still exist and implement the interface? Should its only methods be getMetadata()
and getRecords()
, or should it also have getHeader()
and getReads()
? I'd be open to evolving this later - I think it's OK for the API to be unstable for a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that by now the best option is to keep HtsjdkReadsRdd
extending the interface and with the more explicit methods (getHeader/getReads) and then decide if it is really required to have the implementation. This is in line with some discussion that we had in htsjdk-next-beta, where we are planning to add a high-level abstraction for every record reader/iterator with common methods. Maybe at some point disq
can extend directly that interfaces to make, when possible, processing on spark/hadoop more similar to processing on a local machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I would like to make distributed processing as similar as possible to local processing, although that's not always possible. Do you have a link to the discussion in htsjdk-next-beta?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are discussing this design of readers in slack
@lbergelson, you were the one proposing a RecordIterator<Record, Metadata>
and RecordReader<Record, Metadata, QueryKey>
@tomwhite - if you want, I can ask the rest of maintainers to add you to the slack chat (I think that it will be fine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of having both a super interface and also a specialized class with convenience methods for reads.
public class HtsjdkVariantsRddStorage { | ||
|
||
/** An option for configuring how to write a {@link HtsjdkVariantsRdd}. */ | ||
public interface WriteOption {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as before - extract as VariantsWriteOption
instead
} | ||
|
||
/** An option for configuring the number of files to write a {@link HtsjdkVariantsRdd} as. */ | ||
public enum FileCardinalityWriteOption implements WriteOption { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be shared by reads/variants/other - maybe have a common WriteOption
for all kind of writing and for the format include a different class FormatOption
(extending or not WriteOption
).
* An option for controlling which directory to write temporary part files to when writing a | ||
* {@link HtsjdkVariantsRdd} as a single file. | ||
*/ | ||
public static class TempPartsDirectoryWriteOption implements HtsjdkReadsRddStorage.WriteOption { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as cardinality option
import org.disq_bio.disq.impl.formats.vcf.VcfSource; | ||
|
||
/** The entry point for reading or writing a {@link HtsjdkVariantsRdd}. */ | ||
public class HtsjdkVariantsRddStorage { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of paralelism in this class with the reads counterpart - maybe an abstract class would keep in sync both (I saw that there is no NIO support in variants yet, but it could throw by now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking again, there's some commonality, but I'm not sure how useful it is to have an abstract base class at this point. Would it be a distraction to the user looking at javadoc if some methods are in a base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that the javadoc can show the base class methods, and the javadoc can always be overriden even if the implementation is to call the super method.
Thanks for all the feedback @magicDGS. I've gone through them and made changes in a new commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a proposal for the enums, as I think that I did not explain how I was imagining it.
|
||
/** The entry point for reading or writing a {@link HtsjdkReadsRdd}. */ | ||
public class HtsjdkReadsRddStorage { | ||
|
||
/** An option for configuring how to write a {@link HtsjdkReadsRdd}. */ | ||
public interface WriteOption {} | ||
public interface WriteOption { | ||
static AbstractSamSink getSink( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thougth this more like a method for the FormatWriteOption
instead of here, and an non-static method (see the proposal below).
"Unrecognized cardinality: " + fileCardinalityWriteOption); | ||
} | ||
} | ||
} | ||
|
||
/** An option for configuring which format to write a {@link HtsjdkReadsRdd} as. */ | ||
public enum FormatWriteOption implements WriteOption { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be implemented as (I skipped some stuff):
public enum FormatWriteOption implements WriteOption {
BAM(() -> new BamSink(), SamFormat.BAM)
...
// initilalized in constructor
private final Supplier<AbstractSamSink> supplier;
private final SamFormat format;
...
public AbstractSamSink getSink() {
return supplier.get();
}
public SamFormat getSamFormat() {
return format;
}
}
} | ||
|
||
/** An option for configuring the number of files to write a {@link HtsjdkReadsRdd} as. */ | ||
public enum FileCardinalityWriteOption implements WriteOption { | ||
/** Write a single file specified by the path. */ | ||
SINGLE, | ||
/** Write multiple files in a directory specified by the path. */ | ||
MULTIPLE | ||
MULTIPLE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the same, it could be refactored to be:
public enum FileCardinalityWriteOption implements WriteOption {
SINGLE(opt -> opt.getSink()),
MULTIPLE(opt -> new AnySamSinkMultiple(opt.getSamFormat()));
// initialized on construction
private Function<FormatWriteOption, AbstractSamSink> sinkProvider;
...
public AbstractSamSink getSink(final FormatWriteOption option) {
return sinkProvider.apply(option);
}
}
@@ -54,4 +57,17 @@ public static SamFormat fromPath(String path) { | |||
} | |||
return null; | |||
} | |||
|
|||
public AbstractSamSource createAbstractSamSource(FileSystemWrapper fileSystemWrapper) { | |||
switch (this) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as before: instead of a switch with this
, the way will be to have a Function<FileSystemWrapper, AbstractSamSource>
instead, which makes it cleaner (and does not require checking for equality).
@@ -155,37 +178,12 @@ public void write(HtsjdkVariantsRdd htsjdkVariantsRdd, String path, WriteOption. | |||
tempPartsDirectory = path + ".parts"; | |||
} | |||
|
|||
getSink(formatWriteOption, fileCardinalityWriteOption) | |||
WriteOption.getSink(formatWriteOption, fileCardinalityWriteOption) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case that the refactoring is done, this line will read: fileCardinalityWriteOption.getSink(formatWriteOption)
, which is cleaner from my point of view. And it could be understood as: creates a sink with a concrete file cardinality using the format requested.
I also think that storage classes and related enums/options should not be referenced in the other classes (e.g. Should we maybe discuss somewhere (Slack?) the design and try to deconvolute the classes to make the format more independet from the storage classes? |
@magicDGS I've moved the WriteOptions interface up to the top-level as you suggested, and removed the switch statements. Please let me know what you think. |
@lbergelson you might want to have a look at this too. |
@tomwhite - I did a quick pass over it and it looks good for me. Maybe the I am planning to review whenever I have a bit more of time the implementation package. |
@magicDGS Thanks for taking another look. Has anyone else got any comments before we commit this? |
String tempPartsDirectory = null; | ||
if (tempPartsDirectoryWriteOption != null) { | ||
tempPartsDirectory = tempPartsDirectoryWriteOption.getTempPartsDirectory(); | ||
} else if (fileCardinalityWriteOption == FileCardinalityWriteOption.SINGLE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a PR in hadoop bam for independently specifying where the temp parts directory lives. Not being able to configure it to something other than outputPath.parts can be restricting with large files on restricted disks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this in Disq by setting a TempPartsDirectoryWriteOption
.
SerializableHadoopConfiguration confSer = | ||
new SerializableHadoopConfiguration(jsc.hadoopConfiguration()); | ||
|
||
return pathSplitSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarifying question, when you split crams into partitions, they are being split so each cram container is held in its entirety within a partition? Does what happens in degenerate cases with very large cram containers that number fewer than the target number of partitions for splitting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRAM containers are not split, so each one is read entirely by exactly one partition. In the case of a container that is bigger than the file split size some partitions will be empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to stay pretty high level. A few minor comments/preferences inline, and:
- I'd propose the use of "disq" as a prefix at the top level. So
disqReadsRDD
anddisqReadsRDDStorage
, or in my generified world (see inline comments),disqReads
,disqReadsStorage
. - I'd prefer gradle and TestNG over maven and Junit
- Whats the state of 2bit ref support ?
- Would be good to get a couple of hg38 test files in early on
- Are we going to support BCF ?
* | ||
* @see HtsjdkReadsRddStorage | ||
*/ | ||
public class HtsjdkReadsRdd { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to see the Htsjdk
prefixes not be in the top level of the API as well. The metadata and record types can be generic as discussed here and in some other comments. Not that it should block the initial commit, but I do like the idea. Also, is it worth abstracting RDD to a type parameter at top level to ease migration to DataSet ? It would help with unifying interface.
* ReadsFormatWriteOption} and {@link FileCardinalityWriteOption} | ||
* @throws IOException if an IO error occurs while writing | ||
*/ | ||
public void write(HtsjdkReadsRdd htsjdkReadsRdd, String path, WriteOption... writeOptions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the Java APIs uses varargs like this in a few places, but I always find it awkward especially in library APIs. Except in a a few rare cases like String formatting. Perhaps I'm the only one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments @cmnbroad. I've responded to them below. I don't think any of them should hold up committing this as they can all be addressed separately.
I'd propose the use of "disq" as a prefix at the top level.
So that would be DisqReads
etc? That would duplicate the project/package name in the class, which is a bit redundant. I looked at generifying briefly but didn't see a huge gain. We can revisit that though.
I'd prefer gradle and TestNG over maven and Junit
I'm more familiar with Maven and JUnit, but I don't have an issue with the others.
Whats the state of 2bit ref support ?
It's not supported. Ideally we'd have 2bit support in htsjdk, then we can use it here.
Would be good to get a couple of hg38 test files in early on
Agreed
Are we going to support BCF?
I wasn't planning to, unless there's a lot of demand for it.
I know the Java APIs uses varargs like this in a few places, but I always find it awkward especially in library APIs.
Do you have a preferred alternative?
This code passes the Travis build now. I've also opened #19 to address one of the issues identified by @magicDGS and @cmnbroad (we can open more if there are any more comments, e.g. from @lbergelson). I'd like to merge this in the next day or so unless there are any objections. |
@tomwhite As discussed in person, my main comment on this PR is on the lack of checked-in realistic test data. I'd like to see some medium-sized (~a couple hundred MB) bams and crams added, and tests run continuously on them. However, given that some decisions will have to be made first on how to host and version this large test data (git lfs? something else?), I'd be satisfied for now if you created a separate ticket for this task and addressed it in a future PR. Otherwise, 👍 from me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a review pass through the public APIs.
pom.xml
Outdated
<maven.compiler.source>1.8</maven.compiler.source> | ||
</properties> | ||
|
||
<dependencies> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dependency versions are better specified in a separate <dependencyManagement>
section. I also like to define version properties so that dependencies from the same groupId with the same version can be updated in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a <dependencyManagement>
section useful if there are no child POMs? For the version numbers I've pulled out version properties for htsjdk, Hadoop, and Spark.
</dependency> | ||
<dependency> | ||
<groupId>org.apache.spark</groupId> | ||
<artifactId>spark-core_2.11</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark support for Scala version 2.12 is coming soon, and there is a chance for backwards-incompatible changes in the Spark 2.x series. Do we need to consider releasing for a matrix of Spark & Spark+Scala versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good idea. It would simplify our lives if we can avoid using Spark features that have backwards-incompatible changes.
pom.xml
Outdated
<dependency> | ||
<groupId>org.apache.spark</groupId> | ||
<artifactId>spark-core_2.11</artifactId> | ||
<version>2.0.2</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark version 2.4.0 is in RC phase; 2.0.x is rather old at this point. What should our policy be for keeping up with Spark releases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is pretty old - I've updated to 2.2.2.
Spark and Hadoop are provided
dependencies so it's possible to run against later versions of course. My thinking was to use slightly older versions of Spark and Hadoop to allow for a range, however it would be good to test various versions.
* @param sparkContext the Spark context to use | ||
* @return a {@link HtsjdkReadsRddStorage} | ||
*/ | ||
public static HtsjdkReadsRddStorage makeDefault(JavaSparkContext sparkContext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar thing here, perhaps public API methods should accept SparkContext
instead of JavaSparkContext
.
} | ||
|
||
/** | ||
* @param splitSize the requested size of file splits when reading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the units for splitSize
, number of records?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bytes. Updated doc comment.
!(FilenameUtils.getBaseName(f).startsWith(".") | ||
|| FilenameUtils.getBaseName(f).startsWith("_"))) | ||
.collect(Collectors.toList()); | ||
fileSystemWrapper.concat(conf, filteredParts, outputFile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, concat
fails on encrypted HDFS file systems. We have a workaround in ADAM, mentioned previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good to know. There's a fallback in HadoopFileSystemWrapper.
@heuermh thanks for the review! I've responded to your comments - please let me know what you think. |
Opened #23 to capture my review comment. |
@tomwhite I found it odd that this pull request was from a branch on the Perhaps now would be a good time to discuss git workflow (issue #11) and related topics? Specifically, I'm wondering if we should squash commits in this pull request and use the Rebase and merge button (which is what we do in BDG projects, to keep commit history clean) or use the Squash and merge button (which lists individual commit description lines in the merge commit message) or merge as-is, retaining all the intermediate commits. |
I'd vote for "Squash and merge" -- it's possible to edit the final squashed commit message from the github UI when using that option, and it doesn't require the PR author to manually squash as a final step. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomwhite I have some comments. They're a bit randomly distributed since I didn't get through the whole PR.
* @param inflaterFactory InflaterFactory used by BlockCompressedInputStream | ||
* @throws IOException | ||
*/ | ||
BAMFileReader2( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do an htsjdk release soon...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
import org.disq_bio.disq.impl.formats.vcf.VcfSinkMultiple; | ||
|
||
/** An option for configuring whether to write output in a single file, or multiple files. */ | ||
public enum FileCardinalityWriteOption implements WriteOption { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileCardinalityWriteOption
seems more confusing and hard to discover than it needs to be. Maybe FileSplittingWriteOption
would be clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word "splitting" is a bit overloaded with (Hadoop) file splits. Maybe FileNumberWriteOption
?
|
||
FileCardinalityWriteOption( | ||
Function<ReadsFormatWriteOption, AbstractSamSink> samSinkProvider, | ||
Function<VariantsFormatWriteOption, AbstractVcfSink> vcfSinkProvider) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its seems a bit weird to have the enum know about the writers vs the writers knowing which options they support. Maybe this could be solved if the bam/vcf writers shared some common ancestor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be nice to invert this. Note that it doesn't leak into the public API at all, so not an issue from that point of view. (#26)
* | ||
* @see HtsjdkVariantsRddStorage | ||
*/ | ||
public class HtsjdkVariantsRdd { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of having both a super interface and also a specialized class with convenience methods for reads.
*/ | ||
public interface FileSystemWrapper extends Serializable { | ||
|
||
boolean usesNio(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think documenting this interface more completely would be useful, there are a lot of details to each of these operations and it's not clear what it relies on.
I.e. how does delete handle directories, what does normalize guarantee, what order does listDirectory specify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added javadoc for all the methods in this class.
variantsFormatWriteOption -> | ||
new VcfSinkMultiple(VcfFormat.fromFormatWriteOption(variantsFormatWriteOption))); | ||
|
||
private final transient Function<ReadsFormatWriteOption, AbstractSamSink> samSinkProvider; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do these need to be transient? It seems weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does indicate that it's not right. Opened #26
* computation and communication for you. (Of course this is only worthwhile if the underlying | ||
* SeekableByteChannel doesn't already implement prefetching). | ||
*/ | ||
public final class SeekableByteChannelPrefetcher implements SeekableByteChannel { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be replaced now that we've resolved broadinstitute/gatk#3500!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can.
### Ordering Guarantees | ||
|
||
This library does not do any sorting, so it is up to the user to understand what is being read or written. Furthermore, | ||
no checks are carried out to ensure that the records being read or written are consistent with the header. E.g. it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should consider adding a check like this. It tends to find bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add it as an option (default off).
CRAM(".cram", ".crai", CramSource::new), | ||
SAM(".sam", null, fileSystemWrapper -> new SamSource()); | ||
|
||
private String extension; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make these final, mutable anything in an enum is always weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
SamFormat( | ||
String extension, | ||
String indexExtension, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be problems when we support multiple index types for a formate? I.e. bam.csi
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this may need changing at that point.
I agree it's probably more conventional for PRs to come from the contributor's fork - I'll do that in future. In terms of workflow, I'm fine with "Squash and merge". |
Woot! Thank you for the initial contribution, @tomwhite, and to everyone for review. |
Yay! |
Yes - many thanks for all the reviews! |
Corresponds to tomwhite@d3d76ec.