Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISQ-10] Initial Disq code contribution. #14

Merged
merged 6 commits into from
Sep 4, 2018
Merged

Conversation

tomwhite
Copy link
Member

Corresponds to tomwhite@d3d76ec.

@heuermh heuermh changed the title Initial Disq code contribution. [DISQ-10] Initial Disq code contribution. Jul 19, 2018
@heuermh
Copy link
Contributor

heuermh commented Jul 19, 2018

Fixes #10.

Copy link
Contributor

@magicDGS magicDGS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review classes in org.disq_bio.disq package. I did pass through htsjdk.samtools too, but without getting into details because that will be ported to htsjdk.

import java.util.NoSuchElementException;

/** Class for reading and querying BAM files. */
public class BAMFileReader2 extends SamReader.ReaderImplementation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a huge class to be used only one method (BAMFileReader2.findVirtualOffsetOfFirstRecord(seekableStream) in in BAMSBIIndexer.findVirtualOffsetOfFirstRecord

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, however it's a temporary workaround until the upstream PR is available.

* SBI is an index into BGZF-compressed data files, which has an entry for the file position of the
* start of every <i>n</i>th record. Reads files that were created by {@link BAMSBIIndexer}.
*/
public final class SBIIndex implements Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this will be in htsjdk, add a TODO pointing to the PR to be removed once htsjdk is upgraded with that change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - I'll do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* in an index every <i>n</i>th record. When there are no records left call {@link #finish} to
* complete writing the index.
*/
public final class SBIIndexWriter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this will be in htsjdk, add a TODO pointing to the PR to be removed once htsjdk is upgraded with that change.

*
* @see HtsjdkReadsRddStorage
*/
public class HtsjdkReadsRdd {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it requires the prefix Htsjdk (here and other code in this package)? If we said that this library targets to use the htsjdk library interfaces (in the future, I hope that the proper interfaces in htsjdk-next), we do not require this prefix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it because of a discussion we had a while back about using different data models (i.e. non-htsjdk) for reads and variants. Interested to see what others think too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If changed to RecordsRdd<Metadata, Record>, the implementation can either maintain the prefix or be called SAMRecordRdd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to see the Htsjdk prefixes not be in the top level of the API as well. The metadata and record types can be generic as discussed here and in some other comments. Not that it should block the initial commit, but I do like the idea. Also, is it worth abstracting RDD to a type parameter at top level to ease migration to DataSet ? It would help with unifying interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the Htsjdk prefixes for now.

public class HtsjdkReadsRdd {

private final SAMFileHeader header;
private final JavaRDD<SAMRecord> reads;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the reads header-less? If so, add to the top-class javadoc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some more javadoc here.

*
* @see HtsjdkVariantsRddStorage
*/
public class HtsjdkVariantsRdd {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a common interface for reads/variants and any other kind of records (it will be nice to have FASTQ support eventually and other tribble features). What's about using a RecordsRdd<Metadata, Record>? For reads, it will be RecordsRdd<SAMFileHeader, SAMRecord> and for variants RecordsRdd<VCFHeader, VariantContext> - that is more extensible for other formats (and for moving to a different library, or htsjdk-next, in the future).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs more discussion. Should HtsjdkReadsRdd still exist and implement the interface? Should its only methods be getMetadata() and getRecords(), or should it also have getHeader() and getReads()? I'd be open to evolving this later - I think it's OK for the API to be unstable for a while.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that by now the best option is to keep HtsjdkReadsRdd extending the interface and with the more explicit methods (getHeader/getReads) and then decide if it is really required to have the implementation. This is in line with some discussion that we had in htsjdk-next-beta, where we are planning to add a high-level abstraction for every record reader/iterator with common methods. Maybe at some point disq can extend directly that interfaces to make, when possible, processing on spark/hadoop more similar to processing on a local machine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I would like to make distributed processing as similar as possible to local processing, although that's not always possible. Do you have a link to the discussion in htsjdk-next-beta?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are discussing this design of readers in slack
@lbergelson, you were the one proposing a RecordIterator<Record, Metadata> and RecordReader<Record, Metadata, QueryKey>
@tomwhite - if you want, I can ask the rest of maintainers to add you to the slack chat (I think that it will be fine)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of having both a super interface and also a specialized class with convenience methods for reads.

public class HtsjdkVariantsRddStorage {

/** An option for configuring how to write a {@link HtsjdkVariantsRdd}. */
public interface WriteOption {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before - extract as VariantsWriteOption instead

}

/** An option for configuring the number of files to write a {@link HtsjdkVariantsRdd} as. */
public enum FileCardinalityWriteOption implements WriteOption {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be shared by reads/variants/other - maybe have a common WriteOption for all kind of writing and for the format include a different class FormatOption (extending or not WriteOption).

* An option for controlling which directory to write temporary part files to when writing a
* {@link HtsjdkVariantsRdd} as a single file.
*/
public static class TempPartsDirectoryWriteOption implements HtsjdkReadsRddStorage.WriteOption {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as cardinality option

import org.disq_bio.disq.impl.formats.vcf.VcfSource;

/** The entry point for reading or writing a {@link HtsjdkVariantsRdd}. */
public class HtsjdkVariantsRddStorage {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of paralelism in this class with the reads counterpart - maybe an abstract class would keep in sync both (I saw that there is no NIO support in variants yet, but it could throw by now).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking again, there's some commonality, but I'm not sure how useful it is to have an abstract base class at this point. Would it be a distraction to the user looking at javadoc if some methods are in a base class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the javadoc can show the base class methods, and the javadoc can always be overriden even if the implementation is to call the super method.

@tomwhite
Copy link
Member Author

Thanks for all the feedback @magicDGS. I've gone through them and made changes in a new commit.

Copy link
Contributor

@magicDGS magicDGS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a proposal for the enums, as I think that I did not explain how I was imagining it.


/** The entry point for reading or writing a {@link HtsjdkReadsRdd}. */
public class HtsjdkReadsRddStorage {

/** An option for configuring how to write a {@link HtsjdkReadsRdd}. */
public interface WriteOption {}
public interface WriteOption {
static AbstractSamSink getSink(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thougth this more like a method for the FormatWriteOption instead of here, and an non-static method (see the proposal below).

"Unrecognized cardinality: " + fileCardinalityWriteOption);
}
}
}

/** An option for configuring which format to write a {@link HtsjdkReadsRdd} as. */
public enum FormatWriteOption implements WriteOption {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be implemented as (I skipped some stuff):

public enum FormatWriteOption implements WriteOption {
    BAM(() -> new BamSink(), SamFormat.BAM)

    ...

    // initilalized in constructor
    private final Supplier<AbstractSamSink> supplier;
    private final SamFormat format;

    ...

    public AbstractSamSink  getSink() {
        return supplier.get();
    }

   public SamFormat getSamFormat() {
       return format;
   }
}

}

/** An option for configuring the number of files to write a {@link HtsjdkReadsRdd} as. */
public enum FileCardinalityWriteOption implements WriteOption {
/** Write a single file specified by the path. */
SINGLE,
/** Write multiple files in a directory specified by the path. */
MULTIPLE
MULTIPLE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the same, it could be refactored to be:

public enum FileCardinalityWriteOption implements WriteOption {
    SINGLE(opt -> opt.getSink()),
    MULTIPLE(opt -> new AnySamSinkMultiple(opt.getSamFormat()));

    // initialized on construction
    private Function<FormatWriteOption, AbstractSamSink> sinkProvider;

    ...

    public AbstractSamSink  getSink(final FormatWriteOption option) {
        return sinkProvider.apply(option);
    }
}

@@ -54,4 +57,17 @@ public static SamFormat fromPath(String path) {
}
return null;
}

public AbstractSamSource createAbstractSamSource(FileSystemWrapper fileSystemWrapper) {
switch (this) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as before: instead of a switch with this, the way will be to have a Function<FileSystemWrapper, AbstractSamSource> instead, which makes it cleaner (and does not require checking for equality).

@@ -155,37 +178,12 @@ public void write(HtsjdkVariantsRdd htsjdkVariantsRdd, String path, WriteOption.
tempPartsDirectory = path + ".parts";
}

getSink(formatWriteOption, fileCardinalityWriteOption)
WriteOption.getSink(formatWriteOption, fileCardinalityWriteOption)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case that the refactoring is done, this line will read: fileCardinalityWriteOption.getSink(formatWriteOption), which is cleaner from my point of view. And it could be understood as: creates a sink with a concrete file cardinality using the format requested.

@magicDGS
Copy link
Contributor

I also think that storage classes and related enums/options should not be referenced in the other classes (e.g. from* methods in SamFormat). That's why I propose to have the suppliers in the enums instead, and now I understand why separating the PR into several bits was a bit difficult.

Should we maybe discuss somewhere (Slack?) the design and try to deconvolute the classes to make the format more independet from the storage classes?

@tomwhite
Copy link
Member Author

@magicDGS I've moved the WriteOptions interface up to the top-level as you suggested, and removed the switch statements. Please let me know what you think.

@tomwhite
Copy link
Member Author

@lbergelson you might want to have a look at this too.

@magicDGS
Copy link
Contributor

magicDGS commented Aug 3, 2018

@tomwhite - I did a quick pass over it and it looks good for me. Maybe the SamFormat and ReadsFormatWriteOption converters should be linked in the reverse direction, but that could wait and/or we can discuss that later as it evolves the code once it is in.

I am planning to review whenever I have a bit more of time the implementation package.

@tomwhite
Copy link
Member Author

tomwhite commented Aug 6, 2018

@magicDGS Thanks for taking another look.

Has anyone else got any comments before we commit this?

@lbergelson lbergelson self-requested a review August 13, 2018 15:45
String tempPartsDirectory = null;
if (tempPartsDirectoryWriteOption != null) {
tempPartsDirectory = tempPartsDirectoryWriteOption.getTempPartsDirectory();
} else if (fileCardinalityWriteOption == FileCardinalityWriteOption.SINGLE) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a PR in hadoop bam for independently specifying where the temp parts directory lives. Not being able to configure it to something other than outputPath.parts can be restricting with large files on restricted disks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this in Disq by setting a TempPartsDirectoryWriteOption.

SerializableHadoopConfiguration confSer =
new SerializableHadoopConfiguration(jsc.hadoopConfiguration());

return pathSplitSource

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarifying question, when you split crams into partitions, they are being split so each cram container is held in its entirety within a partition? Does what happens in degenerate cases with very large cram containers that number fewer than the target number of partitions for splitting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRAM containers are not split, so each one is read entirely by exactly one partition. In the case of a container that is bigger than the file split size some partitions will be empty.

Copy link
Contributor

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to stay pretty high level. A few minor comments/preferences inline, and:

  • I'd propose the use of "disq" as a prefix at the top level. So disqReadsRDD and disqReadsRDDStorage, or in my generified world (see inline comments), disqReads, disqReadsStorage.
  • I'd prefer gradle and TestNG over maven and Junit
  • Whats the state of 2bit ref support ?
  • Would be good to get a couple of hg38 test files in early on
  • Are we going to support BCF ?

*
* @see HtsjdkReadsRddStorage
*/
public class HtsjdkReadsRdd {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to see the Htsjdk prefixes not be in the top level of the API as well. The metadata and record types can be generic as discussed here and in some other comments. Not that it should block the initial commit, but I do like the idea. Also, is it worth abstracting RDD to a type parameter at top level to ease migration to DataSet ? It would help with unifying interface.

* ReadsFormatWriteOption} and {@link FileCardinalityWriteOption}
* @throws IOException if an IO error occurs while writing
*/
public void write(HtsjdkReadsRdd htsjdkReadsRdd, String path, WriteOption... writeOptions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the Java APIs uses varargs like this in a few places, but I always find it awkward especially in library APIs. Except in a a few rare cases like String formatting. Perhaps I'm the only one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments @cmnbroad. I've responded to them below. I don't think any of them should hold up committing this as they can all be addressed separately.

I'd propose the use of "disq" as a prefix at the top level.

So that would be DisqReads etc? That would duplicate the project/package name in the class, which is a bit redundant. I looked at generifying briefly but didn't see a huge gain. We can revisit that though.

I'd prefer gradle and TestNG over maven and Junit

I'm more familiar with Maven and JUnit, but I don't have an issue with the others.

Whats the state of 2bit ref support ?

It's not supported. Ideally we'd have 2bit support in htsjdk, then we can use it here.

Would be good to get a couple of hg38 test files in early on

Agreed

Are we going to support BCF?

I wasn't planning to, unless there's a lot of demand for it.

I know the Java APIs uses varargs like this in a few places, but I always find it awkward especially in library APIs.

Do you have a preferred alternative?

@tomwhite
Copy link
Member Author

This code passes the Travis build now. I've also opened #19 to address one of the issues identified by @magicDGS and @cmnbroad (we can open more if there are any more comments, e.g. from @lbergelson).

I'd like to merge this in the next day or so unless there are any objections.

@heuermh heuermh self-requested a review August 28, 2018 15:53
@droazen
Copy link

droazen commented Aug 28, 2018

@tomwhite As discussed in person, my main comment on this PR is on the lack of checked-in realistic test data. I'd like to see some medium-sized (~a couple hundred MB) bams and crams added, and tests run continuously on them. However, given that some decisions will have to be made first on how to host and version this large test data (git lfs? something else?), I'd be satisfied for now if you created a separate ticket for this task and addressed it in a future PR. Otherwise, 👍 from me.

Copy link
Contributor

@heuermh heuermh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a review pass through the public APIs.

pom.xml Show resolved Hide resolved
pom.xml Outdated
<maven.compiler.source>1.8</maven.compiler.source>
</properties>

<dependencies>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dependency versions are better specified in a separate <dependencyManagement> section. I also like to define version properties so that dependencies from the same groupId with the same version can be updated in one place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a <dependencyManagement> section useful if there are no child POMs? For the version numbers I've pulled out version properties for htsjdk, Hadoop, and Spark.

</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark support for Scala version 2.12 is coming soon, and there is a chance for backwards-incompatible changes in the Spark 2.x series. Do we need to consider releasing for a matrix of Spark & Spark+Scala versions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good idea. It would simplify our lives if we can avoid using Spark features that have backwards-incompatible changes.

pom.xml Outdated
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.2</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark version 2.4.0 is in RC phase; 2.0.x is rather old at this point. What should our policy be for keeping up with Spark releases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is pretty old - I've updated to 2.2.2.

Spark and Hadoop are provided dependencies so it's possible to run against later versions of course. My thinking was to use slightly older versions of Spark and Hadoop to allow for a range, however it would be good to test various versions.

* @param sparkContext the Spark context to use
* @return a {@link HtsjdkReadsRddStorage}
*/
public static HtsjdkReadsRddStorage makeDefault(JavaSparkContext sparkContext) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar thing here, perhaps public API methods should accept SparkContext instead of JavaSparkContext.

}

/**
* @param splitSize the requested size of file splits when reading
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the units for splitSize, number of records?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bytes. Updated doc comment.

!(FilenameUtils.getBaseName(f).startsWith(".")
|| FilenameUtils.getBaseName(f).startsWith("_")))
.collect(Collectors.toList());
fileSystemWrapper.concat(conf, filteredParts, outputFile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, concat fails on encrypted HDFS file systems. We have a workaround in ADAM, mentioned previously.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good to know. There's a fallback in HadoopFileSystemWrapper.

@tomwhite
Copy link
Member Author

@heuermh thanks for the review! I've responded to your comments - please let me know what you think.

@droazen
Copy link

droazen commented Aug 31, 2018

Opened #23 to capture my review comment.

@heuermh
Copy link
Contributor

heuermh commented Aug 31, 2018

@tomwhite I found it odd that this pull request was from a branch on the disq-bio/disq repo instead of one on your fork. I guess there isn't anything wrong with that, just not what I'm used to in other Github based projects.

Perhaps now would be a good time to discuss git workflow (issue #11) and related topics?

Specifically, I'm wondering if we should squash commits in this pull request and use the Rebase and merge button (which is what we do in BDG projects, to keep commit history clean) or use the Squash and merge button (which lists individual commit description lines in the merge commit message) or merge as-is, retaining all the intermediate commits.

@droazen
Copy link

droazen commented Aug 31, 2018

I'd vote for "Squash and merge" -- it's possible to edit the final squashed commit message from the github UI when using that option, and it doesn't require the PR author to manually squash as a final step.

Copy link
Contributor

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomwhite I have some comments. They're a bit randomly distributed since I didn't get through the whole PR.

* @param inflaterFactory InflaterFactory used by BlockCompressedInputStream
* @throws IOException
*/
BAMFileReader2(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do an htsjdk release soon...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

import org.disq_bio.disq.impl.formats.vcf.VcfSinkMultiple;

/** An option for configuring whether to write output in a single file, or multiple files. */
public enum FileCardinalityWriteOption implements WriteOption {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileCardinalityWriteOption seems more confusing and hard to discover than it needs to be. Maybe FileSplittingWriteOption would be clearer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "splitting" is a bit overloaded with (Hadoop) file splits. Maybe FileNumberWriteOption?


FileCardinalityWriteOption(
Function<ReadsFormatWriteOption, AbstractSamSink> samSinkProvider,
Function<VariantsFormatWriteOption, AbstractVcfSink> vcfSinkProvider) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its seems a bit weird to have the enum know about the writers vs the writers knowing which options they support. Maybe this could be solved if the bam/vcf writers shared some common ancestor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be nice to invert this. Note that it doesn't leak into the public API at all, so not an issue from that point of view. (#26)

*
* @see HtsjdkVariantsRddStorage
*/
public class HtsjdkVariantsRdd {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of having both a super interface and also a specialized class with convenience methods for reads.

*/
public interface FileSystemWrapper extends Serializable {

boolean usesNio();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think documenting this interface more completely would be useful, there are a lot of details to each of these operations and it's not clear what it relies on.

I.e. how does delete handle directories, what does normalize guarantee, what order does listDirectory specify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added javadoc for all the methods in this class.

variantsFormatWriteOption ->
new VcfSinkMultiple(VcfFormat.fromFormatWriteOption(variantsFormatWriteOption)));

private final transient Function<ReadsFormatWriteOption, AbstractSamSink> samSinkProvider;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these need to be transient? It seems weird.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does indicate that it's not right. Opened #26

* computation and communication for you. (Of course this is only worthwhile if the underlying
* SeekableByteChannel doesn't already implement prefetching).
*/
public final class SeekableByteChannelPrefetcher implements SeekableByteChannel {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be replaced now that we've resolved broadinstitute/gatk#3500!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can.

### Ordering Guarantees

This library does not do any sorting, so it is up to the user to understand what is being read or written. Furthermore,
no checks are carried out to ensure that the records being read or written are consistent with the header. E.g. it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should consider adding a check like this. It tends to find bugs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add it as an option (default off).

CRAM(".cram", ".crai", CramSource::new),
SAM(".sam", null, fileSystemWrapper -> new SamSource());

private String extension;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make these final, mutable anything in an enum is always weird.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


SamFormat(
String extension,
String indexExtension,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be problems when we support multiple index types for a formate? I.e. bam.csi.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this may need changing at that point.

@tomwhite
Copy link
Member Author

tomwhite commented Sep 4, 2018

I found it odd that this pull request was from a branch on the disq-bio/disq repo instead of one on your fork

I agree it's probably more conventional for PRs to come from the contributor's fork - I'll do that in future.

In terms of workflow, I'm fine with "Squash and merge".

@tomwhite tomwhite merged commit 11938a5 into master Sep 4, 2018
@heuermh
Copy link
Contributor

heuermh commented Sep 4, 2018

Woot! Thank you for the initial contribution, @tomwhite, and to everyone for review.

@lbergelson
Copy link
Contributor

Yay!

@tomwhite
Copy link
Member Author

tomwhite commented Sep 4, 2018

Yes - many thanks for all the reviews!

@tomwhite tomwhite deleted the tw_disq_initial branch September 4, 2018 16:29
@heuermh heuermh added this to the 0.1.0 milestone Oct 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants