In this case study, we show an example of running DeepSomatic on Illumina NovaSeq WGS data. We use HCC1395 as an example for this case study.
For this case-study, we use HCC1395 as an example. We run the analysis on chr1
that we hold out during training.
Docker will be used to run DeepVariant and hap.py,
We will be using GRCh38 for this case study.
BASE="${HOME}/deepsomatic-illumina-case-study"
# Set up input and output directory data
INPUT_DIR="${BASE}/input/data"
OUTPUT_DIR="${BASE}/output"
## Create local directory structure
mkdir -p "${INPUT_DIR}"
mkdir -p "${OUTPUT_DIR}"
mkdir -p "${OUTPUT_DIR}/sompy_output"
# Download reference to input directory
FTPDIR=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > ${INPUT_DIR}/GRCh38_no_alt_analysis_set.fasta
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai > ${INPUT_DIR}/GRCh38_no_alt_analysis_set.fasta.fai
# Download bam files to input directory
HTTPDIR=https://storage.googleapis.com/deepvariant/deepsomatic-case-studies/deepsomatic-wgs-case-study
curl ${HTTPDIR}/S1395_WGS_ilm_normal.bwa.dedup.chr1.bam > ${INPUT_DIR}/S1395_WGS_ilm_normal.bwa.dedup.chr1.bam
curl ${HTTPDIR}/S1395_WGS_ilm_normal.bwa.dedup.chr1.bam.bai > ${INPUT_DIR}/S1395_WGS_ilm_normal.bwa.dedup.chr1.bam.bai
curl ${HTTPDIR}/S1395_WGS_ilm_tumor.bwa.dedup.chr1.bam > ${INPUT_DIR}/S1395_WGS_ilm_tumor.bwa.dedup.chr1.bam
curl ${HTTPDIR}/S1395_WGS_ilm_tumor.bwa.dedup.chr1.bam.bai > ${INPUT_DIR}/S1395_WGS_ilm_tumor.bwa.dedup.chr1.bam.bai
# Download truth VCF
DATA_HTTP_DIR=https://storage.googleapis.com/deepvariant/deepsomatic-case-studies/SEQC2-S1395-truth
wget -P ${INPUT_DIR} "${DATA_HTTP_DIR}"/High-Confidence_Regions_v1.2.bed
wget -P ${INPUT_DIR} "${DATA_HTTP_DIR}"/high-confidence_sINDEL_sSNV_in_HC_regions_v1.2.1.merged.vcf.gz
wget -P ${INPUT_DIR} "${DATA_HTTP_DIR}"/high-confidence_sINDEL_sSNV_in_HC_regions_v1.2.1.merged.vcf.gz.tbi
DeepVariant pipeline consists of 3 steps: make_examples_somatic
, call_variants
, and
postprocess_variants
. You can run DeepSomatic with one command using the
run_deepvariant
script.
BIN_VERSION="1.6.1"
sudo docker pull google/deepsomatic:"${BIN_VERSION}"
sudo docker run \
-v ${INPUT_DIR}:${INPUT_DIR} \
-v ${OUTPUT_DIR}:${OUTPUT_DIR} \
google/deepsomatic:"${BIN_VERSION}" \
run_deepsomatic \
--model_type=WGS \
--ref=${INPUT_DIR}/GRCh38_no_alt_analysis_set.fasta \
--reads_normal=${INPUT_DIR}/S1395_WGS_ilm_normal.bwa.dedup.chr1.bam \
--reads_tumor=${INPUT_DIR}/S1395_WGS_ilm_tumor.bwa.dedup.chr1.bam \
--output_vcf=${OUTPUT_DIR}/HCC1395_deepsomatic_output.vcf.gz \
--sample_name_tumor="HCC1395Tumor" \
--sample_name_normal="HCC1395Normal" \
--num_shards=$(nproc) \
--logging_dir=${OUTPUT_DIR}/logs \
--intermediate_results_dir=${OUTPUT_DIR}/intermediate_results_dir \
--regions=chr1
By specifying --model_type WGS
, you'll be using a model that is best suited for Illumina Whole Genome Sequencing data.
NOTE: If you want to run each of the steps separately, add --dry_run=true
to the command above to figure out what flags you need in each step. Based on
the different model types, different flags are needed in the make_examples
step.
--intermediate_results_dir
flag is optional. By specifying it, the
intermediate outputs of make_examples_somatic
and call_variants
stages can be found in the directory.
sudo docker pull pkrusche/hap.py:latest
# Run hap.py
sudo docker run \
-v ${INPUT_DIR}:${INPUT_DIR} -v ${OUTPUT_DIR}:${OUTPUT_DIR} \
pkrusche/hap.py:latest \
/opt/hap.py/bin/som.py \
-N ${INPUT_DIR}/high-confidence_sINDEL_sSNV_in_HC_regions_v1.2.1.merged.vcf.gz \
${OUTPUT_DIR}/HCC1395_deepsomatic_output.vcf.gz \
-r ${INPUT_DIR}/GRCh38_no_alt_analysis_set.fasta \
-o ${OUTPUT_DIR}/sompy_output/deepsomatic.chr1.sompy.output \
--feature-table generic \
-R ${INPUT_DIR}/High-Confidence_Regions_v1.2.bed \
-l chr1
The output:
type total.truth total.query tp fp fn unk ambi recall recall_lower recall_upper recall2 precision precision_lower precision_upper na ambiguous fp.region.size fp.rate
0 indels 133 156 125 31 8 0 0 0.939850 0.889721 0.971155 0.939850 0.801282 0.733470 0.858037 0 0 248956422 0.124520
1 SNVs 3440 3319 3290 29 150 0 0 0.956395 0.949183 0.962840 0.956395 0.991262 0.987653 0.994017 0 0 248956422 0.116486
5 records 3573 3475 3415 60 158 0 0 0.955779 0.948666 0.962155 0.955779 0.982734 0.977992 0.986673 0 0 248956422 0.241006