Skip to content
This repository has been archived by the owner on Oct 29, 2023. It is now read-only.

Update to handle API reading and optimize sharded writing and indexing #165

Merged
merged 1 commit into from
Mar 2, 2016

Conversation

iliat
Copy link
Member

@iliat iliat commented Feb 4, 2016

It implements sharded index writing and removes GBK stage in writing, relying on the read side sharding.

It exports a full genome BAM file of ~60GB in ~25min.

@iliat iliat assigned jakeakopp and deflaux and unassigned jakeakopp Feb 4, 2016
@pgrosu
Copy link

pgrosu commented Feb 4, 2016

Hi Ilia,

Looks really nice, but I don't have time to test it now - maybe a little later. Just a few things in the meantime, in order to get the dependencies to the latest versions:

https://github.com/iliat/dataflow-java/blob/master/pom.xml#L72 -> s/1.1.0/1.4.0

https://github.com/iliat/dataflow-java/blob/master/pom.xml#L117 -> s/v1beta2-rev25-1.19.1/v1-rev56-1.21.0

https://github.com/iliat/dataflow-java/blob/master/pom.xml#L130 -> s/v1beta2-0.36/v1beta2-0.39

https://github.com/iliat/dataflow-java/blob/master/pom.xml#L194 -> s/1.128/2.1.0

https://github.com/iliat/dataflow-java/blob/master/pom.xml#L210 -> s/3.0.0-beta-1/3.0.0-beta-2

A re-test after these changes might not be a bad thing to perform, just to be sure everything passes.

Let me know what you think.

Thanks,
~p

static class DummyMapFn<T> extends DoFn<T, KV<T, Integer>> {
@Override
public void processElement(DoFn<T, KV<T, Integer>>.ProcessContext c) throws Exception {
c.output( KV.of(c.element(), 42));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a nit, perhaps move 42 to a DUMMY_VALUE constant?

add a link to https://cloud.google.com/dataflow/service/dataflow-service-desc#Optimization

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, DONE.

@deflaux
Copy link
Contributor

deflaux commented Feb 11, 2016

@iliat this looks really awesome. LGTM

Great speed up! I assume mvn verify passed all integration tests.

Nice cleanup on the file names too. Do you think some of this code is useful outside of the context of dataflow? If so, some other time it would be nice to move it elsewhere.

import com.google.cloud.dataflow.sdk.values.PCollection;

/*
* Breaks DataFlow fusion by doing GroubByKey/Ungroup that forces materialization of the data,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/GroubByKey/GroupByKey

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

private void writeNullContent() {
codec.writeLong(0); // 0 bins , 0 intv
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does intv mean interval? It might be nice to expand on the comment here and why this is necessary, even if folks reading through the code will eventually figure it out.

@jakeakopp
Copy link
Contributor

@iliat Everything I can understand looks good ;-)

@iliat
Copy link
Member Author

iliat commented Feb 15, 2016

@deflaux @pgrosu @jakeakopp Thanks for the review, I addressed most of the comments and will upload the version with these changes once I do a bit more validation beyond basic checks.

@iliat
Copy link
Member Author

iliat commented Feb 21, 2016

@deflaux - I can finally update the PR with a version that:

  1. Addresses comment above
  2. Fixes index generation bug and fixes a long lurking bug GCSSeekableStream
  3. Adds a BAMDiff tool I used to compare BAM files exported with API and this method to ensure no reads are missed (it ignores unmapped ones for now).

@pgrosu
Copy link

pgrosu commented Feb 21, 2016

Dude, there's some minor error in your JavaDoc - below is the link to the log for JDK8:

https://s3.amazonaws.com/archive.travis-ci.org/jobs/110838148/log.txt

Here's the start of the errors:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:2.10.3:javadoc (default-cli) on project google-genomics-dataflow: An error has occurred in JavaDocs report generation:
[ERROR] Exit code: 1 - /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:180: warning: no @param for log
[ERROR] public static void createIndex(SamReader reader, File output, Log log) {
[ERROR] ^
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:10: error: unexpected text
[ERROR] * @see https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/BAMIndexer.java
[ERROR] ^
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:14: warning - Tag @see:illegal character: "58" in "https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/BAMIndexer.java
[ERROR] and modified to support sharded index writing, where index for each reference is generated
[ERROR] separately and then the index shards are combined."
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:14: warning - Tag @see:illegal character: "47" in "https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/BAMIndexer.java
[ERROR] and modified to support sharded index writing, where index for each reference is generated
[ERROR] separately and then the index shards are combined."
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools/BAMShardIndexer.java:14: warning - Tag @see:illegal character: "47" in "https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/BAMIndexer.java
[ERROR] and modified to support sharded index writing, where index for each reference is generated
[ERROR] separately and then the index shards are combined."
[ERROR] /home/travis/build/googlegenomics/dataflow-java/src/main/java/htsjdk/samtools

@iliat
Copy link
Member Author

iliat commented Feb 21, 2016

@pgrosu Yeah, on it :)

@pgrosu
Copy link

pgrosu commented Feb 21, 2016

Ah, cool man :)

ret = compareReadGroups(h1, h2) && ret;
ret = compareProgramRecords(h1, h2) && ret;
return ret;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Way too cryptic! Try the following instead free of charge :)

private boolean compareHeaders(SAMFileHeader h1, SAMFileHeader h2) throws Exception {

    if ( !compareSequenceDictionaries(h1, h2) ) {

      return false;

    } else if (

      compareValues(h1.getCreator(), h2.getCreator(), "File creator") &&
      compareValues(h1.getAttribute("SO"), h2.getAttribute("SO"), "Sort order") &&
      compareReadGroups(h1, h2) &&
      compareProgramRecords(h1, h2) ) {

        if ( !options.ignoreFileFormatVersion ) {
          return compareValues(h1.getVersion(), h2.getVersion(), "File format version");
        } else {
          return true;
        }

    } else {
      return false;
    }  
}

@iliat
Copy link
Member Author

iliat commented Feb 29, 2016

PTAL @deflaux

@deflaux
Copy link
Contributor

deflaux commented Mar 2, 2016

LGTM @iliat Again, really nice work here.

Please file issues for stuff to be done in future PRs. Thanks!!!

iliat added a commit that referenced this pull request Mar 2, 2016
Update to handle API reading and optimize sharded writing and indexing
@iliat iliat merged commit 3a3e17c into googlegenomics:master Mar 2, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants