Skip to content

Latest commit

 

History

History
89 lines (73 loc) · 4.93 KB

tutorial_google.md

File metadata and controls

89 lines (73 loc) · 4.93 KB

Tutorial for Google Cloud Platform

All test samples and genome data are shared on our public Google Cloud buckets. You don't have to download any data for testing our pipeline on Google Cloud.

  1. Sign up for a Google account.

  2. Go to Google Project page and click "SIGN UP FOR FREE TRIAL" on the top left and agree to terms.

  3. Set up a payment method and click "START MY FREE TRIAL".

  4. Create a Google Project [YOUR_PROJECT_NAME] and choose it on the top of the page.

  5. Create a Google Cloud Storage bucket gs://[YOUR_BUCKET_NAME] by clicking on a button "CREATE BUCKET" and create it to store pipeline outputs.

  6. Find and enable following APIs in your API Manager. Click a back button on your web brower after enabling each.

    • Compute Engine API
    • Google Cloud Storage (DO NOT click on "Create credentials")
    • Google Cloud Storage JSON API
    • Genomics API
  7. Install Google Cloud Platform SDK and authenticate through it. You will be asked to enter verification keys. Get keys from the URLs they provide.

    $ gcloud auth login --no-launch-browser
    $ gcloud auth application-default login --no-launch-browser
  8. If you see permission errors at runtime, then unset environment variable GOOGLE_APPLICATION_CREDENTIALS or add it to your BASH startup scripts ($HOME/.bashrc or $HOME/.bash_profile).

      unset GOOGLE_APPLICATION_CREDENTIALS
  9. Set your default Google Cloud Project. Pipeline will provision instances on this project.

    $ gcloud config set project [YOUR_PROJECT_NAME]
  10. Download cromwell.

    $ cd
    $ wget https://github.com/broadinstitute/cromwell/releases/download/34/cromwell-34.jar
    $ chmod +rx cromwell-34.jar
  11. Git clone this pipeline and move into it.

    $ cd
    $ git clone https://github.com/ENCODE-DCC/atac-seq-pipeline
    $ cd atac-seq-pipeline
  12. Run a pipeline for a SUBSAMPLED (1/400) paired-end sample of ENCSR356KRQ.

    $ PROJECT=[YOUR_PROJECT_NAME]
    $ BUCKET=gs://[YOUR_BUCKET_NAME]/ENCSR356KRQ_subsampled
    $ INPUT=examples/google/ENCSR356KRQ_subsampled.json
    $ PIPELINE_METADATA=metadata.json
    
    $ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=google -Dbackend.providers.google.config.project=${PROJECT} -Dbackend.providers.google.config.root=${BUCKET} cromwell-34.jar run atac.wdl -i ${INPUT} -o workflow_opts/docker.json -m ${PIPELINE_METADATA}
  13. It will take about an hour. You will be able to find all outputs on your Google Cloud bucket. Final QC report/JSON will be written on gs://[YOUR_BUCKET_NAME]/ENCSR356KRQ_subsampled/atac/[SOME_HASH_STRING]/call-qc_report/execution/glob*/qc.html or qc.json. See output directory structure for details.

  14. See full specification for input JSON file.

  15. You can resume a failed pipeline from where it left off by using PIPELINE_METADATA(metadata.json) file. This file is created for each pipeline run. See here for details. Once you get a new input JSON file from the resumer, use it INPUT=resume.[FAILED_WORKFLOW_ID].json instead of INPUT=examples/google/ENCSR356KRQ_subsampled.json.

Extras for advanced users

  1. Set quota for Google Compute Engine API per region. Increase quota for SSD/HDD storage, number of vCPUs to process more samples faster simulateneouly.

    • CPUs
    • Persistent Disk Standard (GB)
    • Persistent Disk SSD (GB)
    • In-use IP addresses
    • Networks
  2. Set default_runtime_attributes.zones in workflow_opts/docker.json as your preferred Google Cloud zone.

    {
      "default_runtime_attributes" : {
        ...
        "zones": "us-west1-a us-west1-b us-west1-c",
        ...
    }
  3. Set default_runtime_attributes.preemptible as "0" to disable preemptible instances. This value means a number of retrial for failures in a preemtible instance. Pipeline defaults not to use preemptible instances. If all retrial fails then the instance will be upgraded to a regular one. Disabling preemtible instances will cost you significantly more but you can get your samples processed much faster and stabler. Preemptible instance is disabled by default. Some hard tasks like bowtie2, bwa and spp will not be executed on preemtible instances since they can take longer than the limit (24 hours) of preemptible instances.

    {
      "default_runtime_attributes" : {
        ...
        "preemptible": "0",
        ...
    }