Update to handle API reading and optimize sharded writing and indexing #165

iliat · 2016-02-04T15:00:43Z

It implements sharded index writing and removes GBK stage in writing, relying on the read side sharding.

It exports a full genome BAM file of ~60GB in ~25min.

pgrosu · 2016-02-04T17:29:39Z

Hi Ilia,

Looks really nice, but I don't have time to test it now - maybe a little later. Just a few things in the meantime, in order to get the dependencies to the latest versions:

https://github.com/iliat/dataflow-java/blob/master/pom.xml#L72 -> s/1.1.0/1.4.0

https://github.com/iliat/dataflow-java/blob/master/pom.xml#L117 -> s/v1beta2-rev25-1.19.1/v1-rev56-1.21.0

https://github.com/iliat/dataflow-java/blob/master/pom.xml#L130 -> s/v1beta2-0.36/v1beta2-0.39

https://github.com/iliat/dataflow-java/blob/master/pom.xml#L194 -> s/1.128/2.1.0

https://github.com/iliat/dataflow-java/blob/master/pom.xml#L210 -> s/3.0.0-beta-1/3.0.0-beta-2

A re-test after these changes might not be a bad thing to perform, just to be sure everything passes.

Let me know what you think.

Thanks,
~p

deflaux · 2016-02-11T17:31:25Z

src/main/java/com/google/cloud/genomics/dataflow/utils/BreakFusionTransform.java

+   static class DummyMapFn<T> extends DoFn<T, KV<T, Integer>> {  
+    @Override
+    public void processElement(DoFn<T, KV<T, Integer>>.ProcessContext c) throws Exception {
+      c.output( KV.of(c.element(), 42));


just a nit, perhaps move 42 to a DUMMY_VALUE constant?

add a link to https://cloud.google.com/dataflow/service/dataflow-service-desc#Optimization

Great idea, DONE.

deflaux · 2016-02-11T17:34:20Z

@iliat this looks really awesome. LGTM

Great speed up! I assume mvn verify passed all integration tests.

Nice cleanup on the file names too. Do you think some of this code is useful outside of the context of dataflow? If so, some other time it would be nice to move it elsewhere.

pgrosu · 2016-02-11T18:55:43Z

src/main/java/com/google/cloud/genomics/dataflow/utils/BreakFusionTransform.java

+import com.google.cloud.dataflow.sdk.values.PCollection;
+
+/*
+ * Breaks DataFlow fusion by doing GroubByKey/Ungroup that forces materialization of the data,


s/GroubByKey/GroupByKey

pgrosu · 2016-02-12T00:53:30Z

src/main/java/htsjdk/samtools/BinaryBAMShardIndexWriter.java

+  }
+
+  private void writeNullContent() {
+      codec.writeLong(0);  // 0 bins , 0 intv


Does intv mean interval? It might be nice to expand on the comment here and why this is necessary, even if folks reading through the code will eventually figure it out.

jakeakopp · 2016-02-12T15:35:15Z

@iliat Everything I can understand looks good ;-)

iliat · 2016-02-15T18:36:03Z

@deflaux @pgrosu @jakeakopp Thanks for the review, I addressed most of the comments and will upload the version with these changes once I do a bit more validation beyond basic checks.

iliat · 2016-02-21T22:52:00Z

@deflaux - I can finally update the PR with a version that:

Addresses comment above
Fixes index generation bug and fixes a long lurking bug GCSSeekableStream
Adds a BAMDiff tool I used to compare BAM files exported with API and this method to ensure no reads are missed (it ignores unmapped ones for now).

pgrosu · 2016-02-21T23:10:08Z

Dude, there's some minor error in your JavaDoc - below is the link to the log for JDK8:

https://s3.amazonaws.com/archive.travis-ci.org/jobs/110838148/log.txt

Here's the start of the errors:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:2.10.3:javadoc (default-cli) on project google-genomics-dataflow: An error has occurred in JavaDocs report generation:
[ERROR] Exit code: 1 - /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:180: warning: no @param for log
[ERROR] public static void createIndex(SamReader reader, File output, Log log) {
[ERROR] ^
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:10: error: unexpected text
[ERROR] * @see https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/BAMIndexer.java
[ERROR] ^
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:14: warning - Tag @see:illegal character: "58" in "https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/BAMIndexer.java
[ERROR] and modified to support sharded index writing, where index for each reference is generated
[ERROR] separately and then the index shards are combined."
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:14: warning - Tag @see:illegal character: "47" in "https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/BAMIndexer.java
[ERROR] and modified to support sharded index writing, where index for each reference is generated
[ERROR] separately and then the index shards are combined."
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:14: warning - Tag @see:illegal character: "47" in "https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/BAMIndexer.java
[ERROR] and modified to support sharded index writing, where index for each reference is generated
[ERROR] separately and then the index shards are combined."
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools

iliat · 2016-02-21T23:10:53Z

@pgrosu Yeah, on it :)

pgrosu · 2016-02-21T23:11:24Z

Ah, cool man :)

pgrosu · 2016-02-22T04:35:38Z

src/main/java/com/google/cloud/genomics/dataflow/utils/BAMDiff.java

+    ret = compareReadGroups(h1, h2) && ret;
+    ret = compareProgramRecords(h1, h2) && ret;
+    return ret;
+}


Way too cryptic! Try the following instead free of charge :)

private boolean compareHeaders(SAMFileHeader h1, SAMFileHeader h2) throws Exception { if ( !compareSequenceDictionaries(h1, h2) ) { return false; } else if ( compareValues(h1.getCreator(), h2.getCreator(), "File creator") && compareValues(h1.getAttribute("SO"), h2.getAttribute("SO"), "Sort order") && compareReadGroups(h1, h2) && compareProgramRecords(h1, h2) ) { if ( !options.ignoreFileFormatVersion ) { return compareValues(h1.getVersion(), h2.getVersion(), "File format version"); } else { return true; } } else { return false; } }

iliat · 2016-02-29T18:51:18Z

PTAL @deflaux

deflaux · 2016-03-02T17:32:53Z

LGTM @iliat Again, really nice work here.

Please file issues for stuff to be done in future PRs. Thanks!!!

Update to handle API reading and optimize sharded writing and indexing

iliat assigned jakeakopp and deflaux and unassigned jakeakopp Feb 4, 2016

iliat force-pushed the sharded-export branch from ad4ceb2 to 9fa7583 Compare February 5, 2016 16:18

deflaux reviewed Feb 11, 2016
View reviewed changes

pgrosu reviewed Feb 11, 2016
View reviewed changes

pgrosu reviewed Feb 12, 2016
View reviewed changes

iliat mentioned this pull request Feb 15, 2016

Push HeaderInfo down into utils-java #168

Open

iliat force-pushed the sharded-export branch from 9fa7583 to 239b709 Compare February 21, 2016 22:49

iliat force-pushed the sharded-export branch from 239b709 to e09a973 Compare February 21, 2016 23:10

iliat force-pushed the sharded-export branch from e09a973 to 2878b9e Compare February 21, 2016 23:19

pgrosu reviewed Feb 22, 2016
View reviewed changes

iliat force-pushed the sharded-export branch from 2878b9e to 9df5a43 Compare February 22, 2016 18:58

Update to handle API reading and optimize

8ca7898

iliat force-pushed the sharded-export branch from 9df5a43 to 8ca7898 Compare March 2, 2016 03:23

iliat added a commit that referenced this pull request Mar 2, 2016

Merge pull request #165 from iliat/sharded-export

3a3e17c

Update to handle API reading and optimize sharded writing and indexing

iliat merged commit 3a3e17c into googlegenomics:master Mar 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to handle API reading and optimize sharded writing and indexing #165

Update to handle API reading and optimize sharded writing and indexing #165

iliat commented Feb 4, 2016

pgrosu commented Feb 4, 2016

deflaux Feb 11, 2016

iliat Feb 15, 2016

deflaux commented Feb 11, 2016

pgrosu Feb 11, 2016

iliat Feb 15, 2016

pgrosu Feb 12, 2016

jakeakopp commented Feb 12, 2016

iliat commented Feb 15, 2016

iliat commented Feb 21, 2016

pgrosu commented Feb 21, 2016

iliat commented Feb 21, 2016

pgrosu commented Feb 21, 2016

pgrosu Feb 22, 2016

iliat commented Feb 29, 2016

deflaux commented Mar 2, 2016

Update to handle API reading and optimize sharded writing and indexing #165

Update to handle API reading and optimize sharded writing and indexing #165

Conversation

iliat commented Feb 4, 2016

pgrosu commented Feb 4, 2016

deflaux Feb 11, 2016

Choose a reason for hiding this comment

iliat Feb 15, 2016

Choose a reason for hiding this comment

deflaux commented Feb 11, 2016

pgrosu Feb 11, 2016

Choose a reason for hiding this comment

iliat Feb 15, 2016

Choose a reason for hiding this comment

pgrosu Feb 12, 2016

Choose a reason for hiding this comment

jakeakopp commented Feb 12, 2016

iliat commented Feb 15, 2016

iliat commented Feb 21, 2016

pgrosu commented Feb 21, 2016

iliat commented Feb 21, 2016

pgrosu commented Feb 21, 2016

pgrosu Feb 22, 2016

Choose a reason for hiding this comment

iliat commented Feb 29, 2016

deflaux commented Mar 2, 2016