umccrise post-processess outputs of cancer variant calling analysis pipelines from Illumina DRAGEN and generates reports for researchers and curators at UMCCR.
It takes as input results from the UMCCR DRAGEN Tumor/Normal and DRAGEN Germline variant calling workflows:
- BAM files from both samples
- somatic small variant calls
- germline small variant calls
- somatic structural variant calls
This part of the workflow filters and prioritises small somatic variant calls. The idea of filtering is to remove most of the artefacts and germline leakage, but at the same time to be permissive towards known clinically important sites even if the variants are of low quality.
- Call candidate variants using DRAGEN.
- Keep variants that pass DRAGEN's default quality control thresholds.
- Keep variants that are in the auto/sex/mito chromosomes (1-22, X, Y, M).
- Rescue variants that are in hotspot sites according to SAGE.
- Annotate variants with info from databases/files.
- If the sample is hypermutated (i.e. has > 500K variants):
- remove non-hotspot variants with
gnomad_AF >= 0.01
- remove variants not overlapping gene regions (specified in a BED file)
- remove non-hotspot variants with
- Filter variants based on certain thresholds, for example:
- Keep those with PCGR Tiers 1/2, known hotspots
- Dump those with low AF (
TUMOR_AF < 0.1
)
Steps are:
- Extract passing calls (with
PASS
in FILTER) - Extract calls on main chromosomes (chr1-chr22, chrX, chrY, chrM)
- Run
SAGE v1.0
and add the result to the VCF. SAGE is a low-frequency variant caller with a
high precision, created by Hartwig Medical Foundation. Instead of the whole
genome, it targets only coding regions for inframe indels, and
known hotspot sites
from the following list:
- Cancer Genome Interpreter
- CIViC - Clinical interpretations of variants in cancer
- OncoKB - Precision Oncology Knowledge Base
- Annotate the VCF against the reference sources:
- High-confidence regions from the Genome in a Bottle benchmark;
- GnomAD whole genome "common" variants (max population frequency > 1%);
- Low complexity regions (LCR):
- Homopolymers, STRs, VNTRs and other repetitive sequences, compiled from TRDB,
- Regions compiled by Heng Li from 3 separate masks: low-complexity regions by mDUST and from UCSC repeatMasker plus flanking regions, structural mask (HWE+depth mask from 1000g plus flanking regions), and 75bp mappability mask.
- Low and high-GC regions (regions < 30% or > 65% GC content), compiled by GA4GH
- Bad promoter regions (compiled by GA4GH: "Anecdotal results suggested that many transcription start sites or first exons in the human genome tend to have poor coverage. By a systematic analysis of these regions we defined the 1,000 with the lowest relative coverage based on low coverage by an Illumina data set, which we term the 'bad promoters' list. The bad promoters are, like many exons, GC-rich, averaging 79% GC composition");
- ENCODE blacklist;
- Segmental duplication regions (UCSC);
- UMCCR panel of normals, build from tumor-only mutect2 calls from ~200 normal samples.
- If after removing non-hotspot GnomAD variants there are still > 500k somatic variants left flag the sample as highly mutated (or FFPE) and limit all calls to to cancer genes only (to avoid downstream R performance problems).
- Standardise the VCF fields: add new
INFO
fieldsTUMOR_AF
,NORMAL_AF
,TUMOR_DP
,NORMAL_DP
,TUMOR_VD
,NORMAL_VD
(for use with PCGR), andAD FORMAT
field (for use with PURPLE). - Run PCGR to annotate VCF against the
external sources, and classify by tiers based on annotations and functional
impact. At the end, this step adds
INFO
fields into the VCF:TIER
,SYMBOL
,CONSEQUENCE
,MUTATION_HOTSPOT
,INTOGEN_DRIVER_MUT
,TCGA_PANCANCER_COUNT
,CLINVAR_CLNSIG
,ICGC_PCAWG_HITS
,COSMIC_CNT
. The list of external sources used at this step is:- VEP to infer the functional impact
- TCGA and ICGC recurrent variants
- Open Targets Platform
- ClinVar - Database of variants with clinical significance
- CancerMine - Literature-derived database of tumor suppressor genes/proto-oncogenes
- DoCM - Database of curated mutations
- CBMDB - Cancer Biomarkers database
- DisGeNET - Database of gene-tumor type associations
- Cancer Hotspots - Resource for statistically significant mutations in cancer
- dBNSFP - Database of non-synonymous functional predictions
- UniProt/SwissProt KnowledgeBase - Resource on protein sequence and functional information
- Pfam - Database of protein families and domains
- DGIdb - Database of targeted cancer drugs
- ChEMBL - Manually curated database of bioactive molecules
- Filter variants
to remove putative germline variants and artefacts, but make sure to keep
known hotspots/actionable variants:
- Keep variants called by SAGE in known hotspots (CGI, CiViC, OncoKB) regardless of other evidence;
- Keep variants PCGR TIER 1 and 2 (strong and potential clinical significance, according to ACMG standard guidelines) regardless of other evidence;
- Keep all driver mutations (Intogen); mutation hotspots; ClinVar pathogenic or uncertain (to include the mixed evidence); COSMIC count >=10; TCGA pancancer count >=5; ICGC PCAWG count >= 3 (all annotated by PCGR), regardless of other evidence;
- For all other variants, apply the following LCR, PoN, depth and AF filters.
Remove variants for which one or more of the following conditions apply:
AF<10%
,- Common variant in GnomAD (max
population AF>=1%
), add into the germline set (see below); - Present in >=5 samples of the Panel of Normal set;
- InDel in a "bad promoter" regions (GA4GH: "Anecdotal results suggested that many transcription start sites or first exons in the human genome tend to have poor coverage. By a systematic analysis of these regions we defined the 1,000 with the lowest relative coverage based on low coverage by an Illumina data set, which we term the 'bad promoters' list. The bad promoters are, like many exons, GC-rich, averaging 79% GC composition);
- Overlapping the ENCODE blacklist,
- Variant depth
VD<4
; - Variant depth
VD<6
, and the variant overlaps a low complexity region (see step 4 above); - VarDict strand-biased variants (single strand support for ALT, while REF has both; or REF and ALT have opposite supporting strands), unless supported by all other callers.
- Report passing variants using PCGR, classified by the ACMG tier system.
The idea is to simply bring germline variants in cancer predisposition genes:
- Take passing "ensemble" germline VCF from bcbio. "Ensemble" has variants supported by at least 2 of 3 callers (we use strelka2, vardict, and GATK Haplotype Caller).
- Add back variants from somatic calling filtered as common GnomAD.
- Subset variants to a list of ~200 cancer predisposition genes, which is build by CPSR from 3 curated sources: TCGA pan-cancer study, COSMIC CGC, and Norwegian Cancer Genomics Consortium.
- Report variants using CPSR, which
classifies variants in the context of cancer predisposition by overlapping
with ClinVar pathogenic and VUS
variants and GnomAD rare variants. It also ranks variants according
pathogenicity score by ACMG and cancer-specific criteria. When a variant is
annotated as having multiple functional depending on a context of a gene and
a trascript, a higher impact events are prioritised, and if all things equal,
APPRIS principal transcripts are
preferred. In any case, one event per variant is reported. Events are
reported in 4 tiers according to their significance:
- ClinVar variants: pre-classified variants according to a five-level tier scheme (Pathogenic to Benign):
- Non-ClinVar variants: classified by CPSR according to a five-level tier scheme (Pathogenic to Benign)
- Secondary findings (optional) - pathogenic ClinVar variants in the ACMG recommended list for reporting of incidental findings,
- GWAS hits (optional) - variants overlapping with previously identified hits in genome-wide association studies (GWAS) of cancer phenotypes (i.e. low to moderate risk conferring alleles), using NHGRI-EBI Catalog of published genome-wide association studies as the underlying source. The unclassified non-ClinVar variants are assigned a pathogenicity level based on the aggregation of scores according to previously established ACMG criteria. The ACMG criteria includes cancer-specific criteria, as outlined and specified in several previous studies.
The idea is to report gene fusions, exon deletions, high impact and LoF events in tumor suppressors, and prioritise events in cancer genes.
- Start with the somatic SV VCF from DRAGEN (called internally by a private Manta version). Keep only PASS variants.
- Annotate variants:
- with SnpEff.
- with VEP
- Use the Ensembl gene model and Sequence ontology terminology.
- Subset annotations to APPRIS principal transcripts, keeping one main isoform per gene.
- Prioritise variants with
https://github.com/umccr/vcf_stuff/blob/master/scripts/prioritize_sv. on a
4 tier system - 1 (high) - 2 (moderate) - 3 (low) - 4 (no interest):
- exon loss
- on cancer gene list (1)
- other (2)
- gene fusion
- paired (hits two genes)
- on list of known pairs (1) (curated by HMF)
- one gene is a known promiscuous fusion gene (1) (curated by HMF)
- on list of FusionCatcher known pairs (2)
- other:
- one or two genes on cancer gene list (2)
- neither gene on cancer gene list (3)
- unpaired (hits one gene)
- on cancer gene list (2)
- others (3)
- paired (hits two genes)
- upstream or downstream (a specific type of fusion, e.g. one gene is got
into control of another gene's promoter and get over-expressed (oncogene)
or underexpressed (tsgene))
- on cancer gene list genes (2)
- LoF or HIGH impact in a tumor suppressor
- on cancer gene list (2)
- other TS gene (3)
- other (4)
- exon loss
- If the number of events is over 50K (e.g. FFPE), progressively remove TIER 4/3/2 SVs (in that order).
- Refine SVs using Hartwig's break-point-inspector, which locally re-assembles SV loci to get more accurate breakpoint positions and AF estimates.
- Filter low-quality calls:
- keep PASS FILTER
- dump TIER 3/4 where
SR < 5 & PR < 5
- dump TIER 3/4 where
SR < 10 & PR < 10 & (AF0 < 0.1 | AF1 < 0.1)
- Use filtered variants as a guidance for PURPLE CNV calling (see below). PURPLE will adjust and recover breakpoints at copy number transitions, and adjust AF based on copy number, purity and ploidy.
- Run https://github.com/umccr/vcf_stuff/blob/master/scripts/prioritize_sv on the PURPLE output SVs.
- Report tiered variants in the UMCCR cancer report.
The idea is to report significant CNV changes in key cancer genes and disruptions in tumor suppressors. And also calculate sample purity and ploidy profile.
We almost entirely rely on Hartwig's PURPLE workflow in this step. The PURPLE pipeline outlines as follows:
- Calculate B-allele frequencies (BAF) using AMBER subworkflow,
- Calculate read depth ratios using COBALT subworkflow,
- Perform segmentation (uses structural variant breakpoints for better guidance),
- Estimate the purity and copy number profile (uses somatic variants for better fitting),
- Plot a circos plot that visualises the CN/ploidy profile, as well as somatic variants, SVs, and BAFs,
- Rescue structural variants in copy number transitions and filter single breakends,
- Estimate overall tumor samples purity range,
- Determine gender,
- Report QC status of the sample, that will fail if the structural variants do not correspond to CN transitions, and gender is inconsistently called from BAFs and from the coverage.
From the PURPLE output, we report in the cancer report:
- Circos plot
- Minimal and maximal copy numbers in key cancer genes, that indicate amplifications/deletions as well as CN transitions that should match SVs
- QC status
- We also use Purity to adjust coverage reporting thresholds.
- Genome-wide CNV segments with breakpoint information. Includes segment CN, minor/major allele ploidy, type of SV support at start/end of segment, CN determination method, BAF/BAF count, GC%, Cobalt window coverage
MultiQC aggregates QC from different tools. We report the following:
- Sample contamination level (for both tumor and normal) and T/N concordance (by Conpair),
- Ancestry and sex (by Peddy),
- Mapping QC: the number of mapped reads, paired reads, secondary or duplicated alignments, average coverage (using samtools stats and mosdepth in bcbio),
- Viral content and integration sites (oncoviruses),
- PURPLE stats: QC, sex, purity/ploidy, TMB and MSI statuses
- Number of pre- and post-filtered SNPs and indels (by umccrise) which indicates germline leakage,
- Coverage profile by goleft,
- Variants QC for filtered germline and somatic variants (by bcftools),
- Reads QC (by FastQC).
We also include reference "good" samples in the background for comparison.
The idea is to see if we can miss variants due to abnormal coverage (e.g. because of copy numbers or abnormal GC).
- Run
goleft indexcov
to plot fast global coverage overview from a BAM or CRAM index. - Run
mosdepth
to calculate quantized coverage in exons of cancer genes if interest, using 4 groups for quantization: NO_COVERAGE, LOW_COVERAGE, CALLABLE, HIGH_COVERAGE. For tumor, the thresholds are adjusted by average purity (as reported by PURPLE). The low coverage threshold is 12x divided by purity (the minimal coverage to call a pure heterozugous variant), the high coverage threshold is 100 divided by purity (suspiciously high coverage, might indicate an LCR, a repeat, or a CN). - Run CACAO using the same thresholds to calculate coverage in loci of interest. For germline variants, use ClinVar predisposition variants. For somatic variants, use CiViC actionable variants and cancerhotspots.org somatic hotspots. This step generates two HTML reports (one for somatic, one for germline variants).
umccrise produces 6 reports:
- PCGR containing small somatic variants (SNPs and indels) classified according to ACMG guidelines, and MSI status of the sample.
- CPSR containing small germline variants (SNPs and indels) in cancer predisposition genes, ranked by ACMG guidelines and cancer-specific criteria.
- CACAO for tumor sample, reporting coverage for clinically actionable and pathogenic loci in cancer
- CACAO for normal sample, reporting coverage in likely pathogenic variants cancer predisposition protein-coding genes
- MultiQC report with QC stats and plots
- UMCCR cancer report containing:
- Somatic mutation profile (global and in cancer genes),
- Mutational signatures (by the MutationalPatterns R package),
- Structural variants,
- Copy number variants,
- PURPLE QC status,
- Circos plot,
- Oncoviral content and integration sites.
For reporting and variant prioritization, we prepared a UMCCR cancer key genes set. It has been built off several sources:
- Cancermine with at least 2 publication with at least 3 citations,
- NCG known cancer genes,
- Tier 1 COSMIC Cancer Gene Census (CGC),
- CACAO hotspot genes (curated from ClinVar, CiViC, cancerhotspots),
- At least 2 matches in the following 5 sources and 8 clinical panels:
- Cancer predisposition genes (CPSR list)
- COSMIC Cancer Gene Census (tier 2)
- AZ300
- Familial Cancer
- OncoKB annotated
- MSKC-IMPACT
- MSKC-Heme
- PMCC-CCP
- Illumina-TS500
- TEMPUS
- Foundation One
- Foundation Heme
- Vogelstein
The result is a list of 1248 genes.
-
run_sage
- Runs
sage-2.2.jar
(jar included). Don't actually think this gets run by default, probably was used in a project-specific context. - input: BAMs (T/N),
hotspots_vcf
,coding_bed
,high_conf_bed
- output:
work/{batch}/small_variants/sage2/{batch}.vcf.gz
- Runs
-
somatic_vcf_pass_sort
- Filters to only PASS and sorts
- input:
batch_by_name[wc.batch].somatic_vcf
or above output VCF - output:
work/{batch}/small_variants/pass_sort/{batch}-somatic.vcf.gz
-
somatic_vcf_select_noalt
- Filters to variants within noalt regions (chr1-chr22, chrX, chrY, chrM)
- input: above output VCF,
noalts_bed
- output:
work/{batch}/small_variants/noalt/{batch}-somatic.vcf.gz
-
somatic_vcf_sage1
- Runs
sage v1.0
- input: above output (noalt) VCF, BAMs (T/N)
- output:
work/{batch}/small_variants/sage1/{batch}-somatic.vcf.gz
{batch}/small_variants/sage1/{batch}-sage.vcf.gz
- Snakemake subworkflow:
run_sage
- sage v1.0 with T/N BAMs, hotspots, coding regions,
- output:
work/call/{SAMPLE}-sage.vcf.gz
sage_rename_anno
- sed
HOTSPOT -> SAGE_HOTSPOT
- output:
work/rename_anno/{SAMPLE}-sage.vcf.gz
- sed
sage_reorder_samples
- reorder samples, tumor first
- output:
work/sage_reorder_samples/{SAMPLE}-sage.vcf.gz
(final Sage VCF)
sage_pass
- keep PASS
- output:
work/sage_pass/{SAMPLE}-sage.vcf.gz
sage_pass_novel
- intersect Sage PASS VCF with noalt VCF, keep only those 'novel' variants found in Sage
- output:
work/sage_pass_novel/{SAMPLE}-sage.vcf.gz
add_novel_sage_calls
- concatenate above 'novel' with noalt VCF
- output:
work/add_novel_sage_calls/{SAMPLE}.vcf.gz
sort_saged
- sort above
- output:
work/sort_saged/{SAMPLE}.vcf.gz
annotate_from_sage
- input: above output VCF 'vcf' and final Sage VCF ('sage')
- output:
work/annotate_from_sage/{SAMPLE}.vcf.gz
- iterate through 'sage' variants and create a
sage_calls
dict with chr/pos/ref/alt keys and the cyvcf2 record as values, then for 'vcf' variants that are PASSed, annotate 'SAGE_HOTSPOT' and use a 'PASS' FILTER, else use the 'SAGE_lowconf' FILTER tag. Then set theFORMAT/DP
andFORMAT/AD
based on the 'sage' call. Something like that.
copy_result
- copy above output VCF to
work/{batch}/small_variants/sage1/{batch}-somatic.vcf.gz
- copy above output VCF to
sage
- copy
work/call/{SAMPLE}-sage.vcf.gz
to{batch}/small_variants/sage1/{batch}-sage.vcf.gz
- copy
- Runs
-
somatic_vcf_annotate
-
input: above (work) output file if
batch_by_name[wc.batch].somatic_vcf
, else thenoalt
-
output:
work/{batch}/small_variants/annotate/{batch}-somatic.vcf.gz
work/{batch}/small_variants/somatic_anno/subset_highly_mutated_stats.yaml
-
Runs
anno_somatic_vcf
fromvcf_stuff
- https://github.com/umccr/vcf_stuff/blob/master/scripts/anno_somatic_vcf
- https://github.com/umccr/vcf_stuff/blob/master/vcf_stuff/filtering/annotate_somatic_vcf.smk
- Snakemake subworkflow:
prep_hmf_hotspots
- Filters HMF hotspots file from 17,875 in total down to 10,209 HMF variants
- input:
hotspots
reference file - output:
somatic_anno/hmf_hotspot.vcf.gz
prep_anno_toml
- Prepares TOML file for use with vcfanno
- input: lots of reference files
- output:
somatic_anno/tricky_vcfanno.toml
somatic_vcf_regions_anno
- Runs vcfanno
- input: TOML from above and the original (
sage1
) input VCF - output:
somatic_anno/vcfanno/{SAMPLE}-somatic.vcf.gz
maybe_subset_highly_mutated
- input: above output vcfanno'd VCF
- output:
somatic_anno/subset/{SAMPLE}-somatic.vcf.gz
,stats.yaml
- Count total vars
- If that's
<= 500K
, all good, just copy that to the next step - If that's
> 500K
:- grab the
INFO/gnomad_AF
field- if that exists, is
>= 0.01
, and is not a HMF/SAGE HOTSPOT, discard it
- if that exists, is
- grab the
- If that's
- Write the count stats into the yaml output
somatic_vcf_clean_info
- input: above output subset VCF
- output:
somatic_anno/clean_info/{SAMPLE}-somatic.vcf.gz
- process:
- Add a
TRICKY
INFO field in the VCF metadata (proc_hdr
func) - Remove
TRICKY_*
andANN
fields from VCF metadata (postproc_hdr
func) - Remove
ANN
andTRICKY_*
INFO fields from variants, but join theTRICKY_*
fields into a single pipe separated field underINFO/TRICKY
. - This basically reduces the VCF size substantially.
- Add a
somatic_vcf_prep
- input: above output cleaned VCF
- output:
somatic_anno/prep/{SAMPLE}-somatic.vcf.gz
- annotate
TUMOR/NORMAL_
-AF/DP/VD
fields for PCGR
sage_pon
- input: above output VCF,
hmf_pon
VCF withPON_COUNT
/PON_MAX
/PON_TOTAL
columns for annotation - output:
somatic_anno/sage_pon/{SAMPLE}-somatic.vcf.gz
- annotates VCF with
SageGermlinePon.hg38.98x.vcf.gz
HMF PoN counts
- input: above output VCF,
somatic_vcf_pon_anno
- input: above output VCF, and
panel_of_normals_dir
(withpon.snps.vcf.gz
andpon.indels.vcf.gz
) - output:
somatic_anno/pon/{SAMPLE}-somatic.vcf.gz
- annotates with
INFO/PoN_CNT
, optionallyINFO/PoN_CNT>=filter_hits with FILTER=PoN
- input: above output VCF, and
somatic_vcf_pcgr_round1
- input: above output VCF, and PCGR reference data (
pcgr_data
) - output:
somatic_anno/pcgr_run/{SAMPLE}-somatic.pcgr_ready.vep.vcf.gz
somatic_anno/pcgr_run/{SAMPLE}-somatic.pcgr.snvs_indels.tiers.tsv
- Runs PCGR (first time) via https://github.com/umccr/umccrise/blob/master/scripts/pcgr_wrap
- input: above output VCF, and PCGR reference data (
somatic_vcf_pcgr_anno
- input:
- above output VCF + tiers files
- annotated VCF from
somatic_vcf_pon_anno
step
- output:
somatic_anno/pcgr_ann/{SAMPLE}-somatic.vcf.gz
- annotate with
PCGR_
-SYMBOL
,TIER
,CONSEQUENCE
,MUTATION_HOTSPOT
,PUTATIVE_DRIVER_MUTATION
,TCGA_PANCANCER_COUNT
,CLINVAR_CLNSIG
, andCOSMIC_CNT
,ICGC_PCAWG_HITS
,CSQ
- input:
annotate
- input: above output VCF
- output:
work/{batch}/small_variants/annotate/{batch}-somatic.vcf.gz
- simply copies input to output
-
-
somatic_vcf_filter
- Runs
filter_somatic_vcf
fromvcf_stuff
- input: above final output VCF
(
work/{batch}/small_variants/annotate/{batch}-somatic.vcf.gz
) - output:
{batch}/small_variants/{batch}-somatic.vcf.gz
- steps:
- Keep where
INFO/PCGR_TIER in [1, 2]
- Keep where
INFO/SAGE_HOTSPOT == 'known'
- Keep where
INFO/TIER == 'HOTSPOT'
- Dump where
INFO/TUMOR_VD < 4
(too few reads support variant) - Dump where
INFO/PoN_CNT >= 5
(in PoN, likely germline or artefact)- Report as Germline if is otherwise PASSed and
INFO/TUMOR_AF >= 0.2
- See the
germline_leakage
rule ingermline.smk
- See the
- Report as Germline if is otherwise PASSed and
- Keep potential hotspots:
HMF_HOTSPOT
PCGR_INTOGEN_DRIVER_MUT
PCGR_MUTATION_HOTSPOT
PCGR_CLINVAR_CLNSIG
is 'pathogenic' or 'uncertain'COSMIC_CNT >= 10
PCGR_TCGA_PANCANCER_COUNT >= 5
ICGC_PCAWG_HITS >= 5
- Dump where
INFO/TUMOR_AF < 0.1
- Dump where has
INFO/ENCODE
flag (hits ENCODE blocklist) - Dump indels in homopolymers (
INFO/MSILEN
andINFO/MSI
plus formula) - Dump where
INFO/TUMOR_VD < 6
and:- variant in low complexity region (LCR)
- no
INFO/HMF_GIAB_CONF
flag INFO/HMF_MAPPABILITY < 0.9
- Dump indels in bad promoters
- Dump strand biased variants based on Vardict
- (DRAGEN) Dump where
INFO/TLOD < 15
- Dump where
INFO/gnomAD_AF >= 0.01
- Report as Germline if is otherwise PASSed
- See the
germline_leakage
rule ingermline.smk
- See the
- Report as Germline if is otherwise PASSed
- Keep where
- Runs
-
somatic_vcf_filter_pass
-
Keep only PASS from above filtered VCF
-
input: above output VCF
-
output:
{batch}/small_variants/{batch}-somatic-PASS.vcf.gz
-
The
{batch}/small_variants/{batch}-somatic-PASS.vcf.gz
VCF is also passed as input to:Pierian
vcf2maf.pl
bcftools stats
PCGR
-
-
run_pcgr
- run PCGR with purity and ploidy as inferred by PURPLE
- uses the
scripts/pcgr
wrapper - raw PCGR outputs get renamed to remove the
_acmg.hg38
suffix
- uses the
- input:
{batch}/small_variants/{batch}-somatic-PASS.vcf.gz
- PCGR reference data (
pcgr_data
) work/{batch}/purple/{batch}.purple.purity.tsv
- output:
- HTML:
work/{batch}/pcgr/{batch}-somatic.pcgr.html
- VCF:
work/{batch}/pcgr/{batch}-somatic.pcgr.pass.vcf.gz
- TSV:
work/{batch}/pcgr/{batch}-somatic.pcgr.snvs_indels.tiers.tsv
- HTML:
- run PCGR with purity and ploidy as inferred by PURPLE
-
pcgr_copy_report
- copy HTML and TSV (not VCF) into final
umccrised/{batch}/
directory - input: above HTML and TSV outputs
- output:
{batch}/{batch}-somatic.pcgr.html
{batch}/small_variants/{batch}-somatic.pcgr.snvs_indels.tiers.tsv
- copy HTML and TSV (not VCF) into final
-
PCGR notes
- Step 0: Validate input data and options
- Runs
scripts/pcgr_validate_input.py
- input: raw VCF input to PCGR
- output:
pcgr_ready.vcf.gz
decomposed with no FORMAT or sample columns
- Runs
- Step 1: VEP annotation
- Runs
vep
with lots of options - input:
pcgr_ready.vcf.gz
- output:
pcgr_ready.vep.vcf.gz
- Runs
- Step 2: vcfanno precision oncology
- Runs
pcgr_vcfanno.py
- input:
pcgr_ready.vep.vcf.gz
- output:
pcgr_ready.vep.vcfanno.vcf.gz
- Runs
- Step 3: Cancer gene annotations with pcgr-summarise
pcgr_summarise.py
vcf2tsv.py
- Step 4: Generation of outputs/reports
- Uses
pcgrr
functions via thepcgrr.R
CLI
- Uses
-
sv_keep_pass
- Reset INFO (
SIMPLE_ANN
,SV_HIGHEST_TIER
) and FILTER (Intergenic
,MissingAnn
,REJECT
) fields. Keep only PASS variants. - input: final Manta VCF from DRAGEN tumor/normal workflow
- output:
work/{batch}/structural/keep_pass/{batch}-manta.vcf.gz
- Reset INFO (
-
sv_snpeff_maybe
- Run snpEff if there is not an
INFO/ANN
annotation.- use
-hgvs
,-cancer
,-csvStats
,-s
,-dataDir
options - else, just copy input to output
- use
- input:
- above output VCF
- snpEff database
- output:
work/{batch}/structural/snpeff/{batch}-sv-snpeff.vcf.gz
- Run snpEff if there is not an
-
sv_vep
- Run VEP on final Manta VCF
- input:
- final Manta VCF from DRAGEN tumor/normal workflow
- PCGR data directory (for the VEP reference data)
- FASTA file:
grch38/.vep/homo_sapiens/105_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
- FASTA file:
- output:
work/{batch}/structural/vep/{batch}-sv-vep.vcf.gz
-
fix_snpeff
- Replace the wrongly capitalised ALT fields with lowercase to avoid
downstream bugs
- TODO: turn the multiple
sed
commands into a single one with-e
.
- TODO: turn the multiple
- input: snpEff output VCF
- output:
work/{batch}/structural/snpeff/{batch}-sv-snpeff-fix.vcf.gz
- Replace the wrongly capitalised ALT fields with lowercase to avoid
downstream bugs
-
sv_prioritize
- Run https://github.com/umccr/vcf_stuff/blob/master/scripts/prioritize_sv
- input: above fixed snpEff VCF
- output:
work/{batch}/structural/prioritize/{batch}-sv-eff-prio.vcf.gz
-
sv_subsample_if_too_many
- If >= 50,000 SVs, progressively remove TIER 4/3/2 SVs (in that order).
- TODO: compress output VCF
- input: above sv prioritised VCF
- output:
work/{batch}/structural/sv_subsample_if_too_many/{batch}-manta.vcf
- If >= 50,000 SVs, progressively remove TIER 4/3/2 SVs (in that order).
-
sv_bpi_maybe
- Run BPI on above output VCF
- input: above output VCF, T/N BAMs
- output:
work/{batch}/structural/maybe_bpi/{batch}-manta.vcf
-
filter_sv_vcf
- Filters in sequence:
- keep PASS FILTER
- dump TIER 3/4 where
SR < 5 & PR < 5
- dump TIER 3/4 where
SR < 10 & PR < 10 & (AF0 < 0.1 | AF1 < 0.1)
- TODO: reorg this
- input: above BPI output VCF
- output:
work/{batch}/structural/filt/{batch}-manta.vcf.gz
- This is fed to PURPLE.
- Filters in sequence:
-
reprioritize_rescued_svs
- Run https://github.com/umccr/vcf_stuff/blob/master/scripts/prioritize_sv
after PURPLE. Also remove
INFO/ANN
annotation. - input:
- purple SVs:
work/{batch}/purple/{batch}.purple.sv.vcf.gz
- purple SVs:
- output:
work/{batch}/structural/sv_after_purple/{batch}-manta.vcf.gz
- Run https://github.com/umccr/vcf_stuff/blob/master/scripts/prioritize_sv
after PURPLE. Also remove
-
copy_sv_vcf_ffpe_mode
- Copy either the filtered or the purple prioritised to the final output
- input: filtered or purple prioritised VCF
- output:
{batch}/structural/{batch}-manta.vcf.gz
-
prep_sv_tsv
- input: above final VCF
- output:
{batch}/structural/{batch}-manta.tsv
- Parse info from the VCF and export to TSV
- give non-tiered variants an
INFO/SV_TOP_TIER
of4
. - set
simple_ann = INFO/SIMPLE_ANN
, andPURPLE_status = ''
- for purple inferred variants:
simple_ann = {INFO/SVTYPE}||||From_CNV|{tier}
PURPLE_status = 'INFERRED'
- for purple recovered variants:
PURPLE_status = 'RECOVERED'
- for purple inferred variants:
- there are 23 columns:
caller = 'manta'
,sample = tumor name
,chrom = CHROM
,start = POS
,end = INFO/END
svtype = INFO/SVTYPE
split_read_support, paired_support_PE/PR = FORMAT/SR-PE-PR for tumor
AF_BPI = INFO/BPI_AF
,somaticscore = INFO/SOMATICSCORE
tier = INFO/SV_TOP_TIER
(or4
if missing)annotation = INFO/SIMPLE_ANN
PURPLE_
:AF
,CN
,CN_CHANGE
,PLOIDY
,STATUS
(INFERRED
orRECOVERED
)START/END_BPI = BPI_START/END
ID = ID
MATEID = INFO/MATEID
ALT = ALT[0]
- give non-tiered variants an
viral_content
- input:
- tumor BAM
- refdata
- input:
- output1:
prioritized_oncoviruses.tsv
## Viral sequences (from https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files) found in unmapped reads
## Sample: MDX230148
## Minimal completeness: 50.0% at 1x or 1000bp at 5x
#virus size depth 1x 5x 25x significance
HPV45 7858 67.1 0.768 0.767 0.767 significant
HCV-2 9711 33.0 0.012 0.011 0.010 .
HPV71 8037 20.8 0.011 0.011 0.009 .
HCV-1 9646 6.3 0.011 0.010 0.010 .
HPV19 7685 1.9 0.010 0.009 0.009 .
[...]
- output2:
present_viruses.txt
- grab the
significant
viruses from above
- grab the
HPV45
- command:
oviraptor \
{tumor_bam} \
-o {work_dir} \
-s {tumor_name} \
--genomes-dir {refdata} \
--only-detect # just detect without finding integration sites, do that in next step
viral_integration_sites
- input:
- tumor BAM
present_viruses.txt
output from above
- input:
- output1:
{batch}/oncoviruses/oncoviral_breakpoints.vcf
chr7 19105901 3_1 N N[HPV45:3062[ . PASS SVTYPE=BND;STRANDS=+-:44;EVENT=3;MATEID=3_2;CIPOS=-3,2;CIEND=-10,9;CIPOS95=0,0;CIEND95=0,0;SU=44;PE=34;SR=10;DisruptedGenes=TWIST1;GenesWithin100kb=AC003986.2,AC003986.3,TWIST1,AC003986.1 GT:SU:PE:SR ./.:44:34:10
chr7 19105936 4_1 N ]HPV45:1236]N . PASS SVTYPE=BND;STRANDS=-+:91;EVENT=4;MATEID=4_2;CIPOS=0,0;CIEND=-10,3;CIPOS95=0,1;CIEND95=0,0;IMPRECISE;SU=91;PE=64;SR=27;DisruptedGenes=TWIST1;GenesWithin100kb=AC003986.2,AC003986.3,TWIST1,AC003986.1 GT:SU:PE:SR ./.:91:64:27
chr13 98758355 5_1 N N]HPV45:5033] . PASS SVTYPE=BND;STRANDS=++:5;EVENT=5;MATEID=5_2;CIPOS=-10,9;CIEND=-10,9;CIPOS95=0,0;CIEND95=0,0;SU=5;PE=0;SR=5;GenesWithin100kb=SLC15A1,DOCK9-AS1 GT:SU:PE:SR ./.:5:0:5
HPV45 1237 4_2 N N[chr7:19105935[ . PASS SVTYPE=BND;STRANDS=+-:91;SECONDARY;EVENT=4;MATEID=4_1;CIPOS=-10,3;CIEND=0,0;CIPOS95=0,0;CIEND95=0,1;IMPRECISE;SU=91;PE=64;SR=27;DisruptedGenes=TWIST1;GenesWithin100kb=AC003986.2,AC003986.3,TWIST1,AC003986.1 GT:SU:PE:SR ./.:91:64:27
HPV45 3063 3_2 N ]chr7:19105900]N . PASS SVTYPE=BND;STRANDS=-+:44;SECONDARY;EVENT=3;MATEID=3_1;CIPOS=-10,9;CIEND=-3,2;CIPOS95=0,0;CIEND95=0,0;SU=44;PE=34;SR=10;DisruptedGenes=TWIST1;GenesWithin100kb=AC003986.2,AC003986.3,TWIST1,AC003986.1 GT:SU:PE:SR ./.:44:34:10
HPV45 5034 5_2 N N]chr13:98758354] . PASS SVTYPE=BND;STRANDS=++:5;SECONDARY;EVENT=5;MATEID=5_1;CIPOS=-10,9;CIEND=-10,9;CIPOS95=0,0;CIEND95=0,0;SU=5;PE=0;SR=5;GenesWithin100kb=SLC15A1,DOCK9-AS1 GT:SU:PE:SR ./.:5:0:5
- command:
oviraptor \
{tumor_bam} \
-o {work_dir} \
-s {tumor_name} \
--genomes-dir {genomes_dir} \
-v $(cat present_viruses.txt)
oncoviruses_breakpoints_tsv
- input:
{batch}/oncoviruses/oncoviral_breakpoints.vcf
output from abovework/{batch}/oncoviruses/present_viruses.txt
output from above-above
- input:
- command:
- Parse VCF from above, grab FORMAT PE/SR fields and add to
PAIR_COUNT
- Parse VCF from above, grab FORMAT PE/SR fields and add to
- output1:
work/{batch}/oncoviruses/oncoviral_breakpoints.tsv
# A tibble: 6 × 10
sample contig start end svtype PAIR_COUNT DisruptedGenes UpstreamGenes ID MATEID
<chr> <chr> <dbl> <lgl> <chr> <dbl> <chr> <chr> <chr> <chr>
1 MDX230148 chr7 19105901 NA BND 44 TWIST1 AC003986.2, AC003986.3, AC003986.1 3_1 3_2
2 MDX230148 chr7 19105936 NA BND 91 TWIST1 AC003986.2, AC003986.3, AC003986.1 4_1 4_2
3 MDX230148 chr13 98758355 NA BND 5 NA SLC15A1, DOCK9-AS1 5_1 5_2
4 MDX230148 HPV45 1237 NA BND 91 NA . 4_2 4_1
5 MDX230148 HPV45 3063 NA BND 44 NA . 3_2 3_1
6 MDX230148 HPV45 5034 NA BND 5 NA . 5_2 5_1
oviraptor - https://github.com/umccr/oviraptor
-
extract_unmapped_and_mate_unmapped_reads
- input: tumor BAM
- output:
step1_host_unmapped_or_mate_unmapped.namesorted.bam
- command:
sambamba view -F 'unmapped or mate_is_unmapped' ... | samtools sort -n ...
-
unmapped_and_mate_unmapped_reads_to_fastq
- input: output BAM from above rule
- output:
step2_host_unmapped_or_mate_unmapped.R1.fq
step2_host_unmapped_or_mate_unmapped.R2.fq
step2_host_unmapped_or_mate_unmapped.single.fq
- command:
samtools fastq input_bam ...
-
bwa_unmapped_and_mate_unmapped_reads_to_gdc
- input:
- paired FASTQs from above
- gds viral FASTA (this is where EBV gets lost; see #143)
- output:
detect_viral_reference/host_unmapped_or_mate_unmapped_to_gdc.bam
- command:
- input:
minimap2 \
-ax sr \
-Y \
-t{threads} \
-R '{params.rg}' \
{input.gdc_fa} {input.fq1} {input.fq2} \
| samtools sort -@{threads} -Obam -o {output.gdc_bam}
-
index_virus_bam
- Just
samtools index
above output BAM
- Just
-
mosdepth
- Runs mosdepth using the GDS viral FASTA index
- output1:
thresholds.bed.gz
#chrom start end region 1X 5X 25X
CMV 0 235646 unknown 27 0 0
HBV 0 3215 unknown 0 0 0
HCV-1 0 9646 unknown 104 98 94
HCV-2 0 9711 unknown 116 104 98
HIV-1 0 9181 unknown 0 0 0
HIV-2 0 10359 unknown 0 0 0
KSHV 0 137969 unknown 28 27 0
HTLV-1 0 8507 unknown 0 0 0
MCV 0 5387 unknown 0 0 0
[...]
- output2:
regions.bed.gz
CMV 0 235646 0.00
HBV 0 3215 0.00
HCV-1 0 9646 6.28
HCV-2 0 9711 33.05
HIV-1 0 9181 0.00
HIV-2 0 10359 0.00
KSHV 0 137969 0.00
HTLV-1 0 8507 0.00
MCV 0 5387 0.00
SV40 0 5243 0.00
[...]
prioritize_viruses
- input: above mosdepth output files
- output:
prioritized_oncoviruses.tsv