First, follow the Google Cloud Dataflow getting started instructions to set up your environment for Dataflow. You will need your Project ID and Google Cloud Storage bucket in the following steps.
To use this code, build the client using Apache Maven:
cd dataflow-java mvn compile mvn bundle:bundle
Then, follow the Google Genomics sign up instructions to generate a valid
client_secrets.json
file.Move the
client_secrets.json
file into the dataflow-java directory. (Authentication will take place the first time you run a pipeline.)Then you can run a pipeline locally with the command line, passing in the Project ID and Google Cloud Storage bucket you made in the first step. This command runs the VariantSimilarity pipeline (which runs PCoA on a dataset):
java -cp target/google-genomics-dataflow-v1beta2-0.1-SNAPSHOT.jar \ com.google.cloud.genomics.dataflow.pipelines.VariantSimilarity \ --project=my-project-id \ --output=gs://my-bucket/output/localtest.txt \ --genomicsSecretsFile=client_secrets.json
Note: when running locally, you may run into memory issues depending on the capacity of your local machine.
To deploy your pipeline (which runs on Google Compute Engine), some additional command line arguments are required:
java -cp target/google-genomics-dataflow-v1beta2-0.1-SNAPSHOT.jar \ com.google.cloud.genomics.dataflow.pipelines.VariantSimilarity \ --runner=BlockingDataflowPipelineRunner \ --project=my-project-id \ --stagingLocation=gs://my-bucket/staging \ --output=gs://my-bucket/output/test.txt \ --genomicsSecretsFile=client_secrets.json \ --numWorkers=10
Note: By default, the max workers you can have without requesting more GCE quota is 16. (That's the default limit on VMs)
In addition to variant similarity you can run other pipelines by changing the first argument provided in the above command lines. For example, to run Identity by State change
VariantSimilarity
toIdentityByState
:java -cp target/google-genomics-dataflow-v1beta2-0.1-SNAPSHOT.jar \ com.google.cloud.genomics.dataflow.pipelines.IdentityByState \ --project=my-project-id \ --output=gs://my-bucket/localtest.txt \ --genomicsSecretsFile=client_secrets.json
The Main code directory contains several useful utilities:
- coders:
- includes
Coder
classes that are useful for Genomics pipelines.GenericJsonCoder
can be used with any of the Java client library classes (likeRead
,Variant
, etc) - functions:
- contains common DoFns that can be reused as part of any pipeline.
OutputPCoAFile
is an example of a complexPTransform
that provides a useful common analysis. - pipelines:
contains example pipelines which demonstrate how Google Cloud Dataflow can work with Google Genomics
VariantSimilarity
runs a principal coordinates analysis over a dataset containing variants, and writes a file of graph results that can be easily displayed by Google Sheets.IdentityByState
runs IBS over a dataset containing variants. See the results/ibs directory for more information on how to use the pipeline's results.
- readers:
- contains functions that perform API calls to read data from the genomics API
- utils:
contains utilities for running dataflow workflows against the genomics API
DataflowWorkarounds
contains workarounds needed to use the Google Cloud Dataflow APIs.GenomicsOptions.java
andGenomicsDatasetOptions
extend these classes for your command line options to take advantage of common command line functionality
This code is also deployed as a Maven artifact through Sonatype. The utils-java readme has detailed instructions on how to deploy new versions.
To depend on this code, add the following to your pom.xml
file:
<project> <dependencies> <dependency> <groupId>com.google.cloud.genomics</groupId> <artifactId>google-genomics-dataflow</artifactId> <version>v1beta2-0.1</version> </dependency> </dependencies> </project>
You can find the latest version in Maven's central repository
We'll soon include an example pipeline that depends on this code in another GitHub repository.
- Provide a Maven artifact which makes it easier to use Google Genomics within Google Cloud Dataflow.
- Provide some example pipelines which demonstrate how Dataflow can be used to analyze Genomics data.
This code is in active development, it will be deployed to Maven soon.
- TODO: Explain all the possible command line args:
zone
,allContigs
, etc - TODO: Setup Travis integration once this repo is public
- TODO: Refine the transmission probability pipeline
- TODO: Add more tests