germline_calling_snakemake

Intro

Snakemake workflow to call germline variant by using GATK, VarScan, and Pindel. The Snakefile should be applicable on google cloud, Dinglab cluster, and MGI cluster.

Content of Directory

Working Environment

local_test: Settings for runs in Dinglab cluster Denali.

google_api: Settings for runs on google cloud.

mgi: Settings for runs in MGI cluster.

Required Files

scripts: Customized scripts for germline variant calling.

files: Required files like chromosome intervals (Check if your reads started with chr or not. Change the prefix of chromosome accordingly).

env: Conda env yml files for GATK 3.8. This tool need to be set seperately because it is not compatible with other tools.

Sanity check of your BAM before snakemake run

Since GATK requires BAMs to be formatted in a specific way, Snakemake will complain about it if GATK can't run smoothly (and most of errors come from it). Highly recommend user to run GATK locally first to make sure there is no problem on running GATK HaplotypeCaller. If there is an error, GATK blog is a good resource for troubleshooting. Below are some common errors and solution:

Check if the chromosome starts with chr or not, and change the chromosome interval files accordingly.
Make sure if the ReadGroup is correct. If not, use gatk AddOrReplaceReadGroups to change accordingly.

How to start a run on local environment and MGI

Clone the repository: git clone https://github.com/ding-lab/germline_variant_snakemake.git
Change the priority of conda channels:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Create a conda environment: conda create -n snakemake python=3.6 snakemake pindel varscan gatk4 samtools pandas bcftools vcf2maf ensembl-vep
Create a seperate conda environment for GATK 3.8: conda env create -f germline_variant_snakemake/env/gatk38.yml. conda activate gatk38 and then follow the Note here. Briefly, download archived version GATK 3.8 to folder, and run gatk3-register /path/to/GenomeAnalysisTK[-$PKG_VERSION.tar.bz2|.jar].
Activate environment: conda activate snakemake

On local environment

Go to folder: cd local_test
Change the path to pindel2vcf and GATK 3.8 jar file in config.yaml accordinly. (See the section below)
Dry run: snakemake -n -p all_tools
Run a task: snakemake -j ${how many cpu you want to use} -p all_tools. Noted that all the files will be kept. Remerber to delete temp files to save space.

On MGI

Go to folder : cd mgi
Change the path to pindel2vcf and GATK 3.8 jar file in config.ymal accordinly. (See the section below)
Make a folder for LSF logs: mkdir lsf_logs
Check if /bin/bash works. If it is masked by other bash path such as /gsc/bin/bash, follow the confluence page here. Or, setting LSF_DOCKER_PRESERVE_ENVIRONMENT=true will work.
Run a task: bash run.sh

How to start on google cloud

Generate the required commend for google cloud by using the script: /diskmnt/Projects/Users/wliang/Germline_Noncoding/06_Cloud_Variant_Calling/bampath/generate_command.snakemake.sh
bash generate_command.snakemake.sh TCGA_WGS_gspath_WWL_Mar2018.LowPass.normal.txt LowPass
Create a VM in Project.
Get sufficient authentication scopes: gcloud auth login
Clone the repository: git clone https://github.com/ding-lab/germline_variant_snakemake.git
Run a google pipeline API commend:

gcloud alpha genomics pipelines run \
--pipeline-file ~/germline_variant_snakemake/google_api/germline_snakemake.yaml \
--inputs fafile=gs://dinglab/reference/Homo_sapiens_assembly19.fasta,\
faifile=gs://dinglab/reference/Homo_sapiens_assembly19.fasta.fai,\
dictfile=gs://dinglab/reference/Homo_sapiens_assembly19.dict,\
bamfile=gs://5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/LUAD/DNA/WGS/HMS-RK/ILLUMINA/TCGA-44-4112-11A-01D-1103_120318_SN1120_0124_AC0HNPACXX_s_2_rg.sorted.bam,\
baifile=gs://5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/LUAD/DNA/WGS/HMS-RK/ILLUMINA/TCGA-44-4112-11A-01D-1103_120318_SN1120_0124_AC0HNPACXX_s_2_rg.sorted.bam.bai,\
sample=TCGA-44-4112-11A-01D-1103-02 \
--outputs outputPath=gs://wliang/germline_snakemake/output/LowPass/TCGA-44-4112-11A-01D-1103-02/ \
--logging gs://wliang/germline_snakemake/logging/LowPass/ \
--project washu-medicine-pancan \
--disk-size datadisk:50 \
--preemptible

Since preemptible machines might be shut down for no reasons, it would be helpful to launch batch jobs by using script submit_google_api.py. This scirpt reads manifest, launchs jobs for the first 30 samples of manifest, checks the status of lauched job evey minutes, and keeps a certain number of VMs for running. The output {filename_of_manifest}.result.tsv gives you a snapshot of case_full_barcode cmd status operation_id num_of_repeats. Noted that if a sample related job has been launched more than 16 times, the script stops lauch it again. User is suggested to go check the specific job or sample, make sure there is no problems, and relauch it manually. One sample usually can be completed withing 16 times.

Configure snakemake workflow based on your working enviornment (only for local_test)

Find out the path to the cloned repository.
conda activate snakemake and find out the path to the pindel2vcf by typing which pindel. pindel and pindel2vcf are in the same folder.
vi config.yaml

samples: {Your file with header and sample lines} 
# sample lines should follow the format: ID\tPath2Ref\tPath2BAM
interval_prefix: "{Path to cloned repo}/germline_variant_snakemake/files/interval_chr"
path_to_pindel2vcf: "{Path to pindel}2vcf"
path_to_gatk_jar: "{Path to GenomeAnalysisTK.jar}"

Result VCF

Change the input of rule all_tools in Snakefile to the result VCF user would like. The defalt is one multi-sample merged VCF and one MAF file.

Name		Name	Last commit message	Last commit date
Latest commit History 205 Commits
env		env
files		files
google_api		google_api
local_test		local_test
mgi		mgi
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
dag.svg		dag.svg
ruledag.svg		ruledag.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

germline_calling_snakemake

Intro

Content of Directory

Working Environment

Required Files

Sanity check of your BAM before snakemake run

How to start a run on local environment and MGI

On local environment

On MGI

How to start on google cloud

Configure snakemake workflow based on your working enviornment (only for local_test)

Result VCF

About

Releases

Packages

Contributors 2

Languages

ding-lab/germline_variant_snakemake

Folders and files

Latest commit

History

Repository files navigation

germline_calling_snakemake

Intro

Content of Directory

Working Environment

Required Files

Sanity check of your BAM before snakemake run

How to start a run on local environment and MGI

On local environment

On MGI

How to start on google cloud

Configure snakemake workflow based on your working enviornment (only for local_test)

Result VCF

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages