- Qian Zeng (Labcorp)
- Ayan Malakar (Columbia University)
- Colin Diesh (University of California, Berkeley)
- Peiming (Peter) Huang (Baylor College of Medicine)
- Seung Jae Lee (University of Southern California)
- Sagayamary Sagayaradj (BASF)
Structural variants (SVs) are widely present in human genomes, but accurate detection of SVs in NGS data has been challenging. A number of SV calling tools have been developed in the past few years, but it remains unclear what an ideal SV-calling protocol should look like. The goal of this project is to develop a generalized framework to evaluate the performance of SV calling tools for NGS short-read dataset and to formulate an optimized SV-calling protocol. Since much of the current clinical testing is still short-read based, our goal is to define a SV calling protocol for analyzing large-scale short-reads dataset.
https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/AshkenazimTrio/alignment.index.AJtrio_Illumina_2x250bps_novoalign_GRCh37_GRCh38_NHGRI_06062016.HG002 https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/AshkenazimTrio/alignment.index.AJtrio_Illumina300X_wgs_novoalign_GRCh37_GRCh38_NHGRI_07282015.HG002 The source location of the NIST HG002 benchmark: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NIST_SV_v0.6/
SV calling methods generally fall into two groups: mapping-based and assembly-based. In mapping based methods, short reads are first aligned to a reference genome and SVs are then called based on the read alignment. For the assembly based methods, reads can be assembled directly into contigs (“global assembly”), or reads could be aligned to a reference genome and then reads aligned to each region could be assembled into contigs (“local assembly”), and the contigs (and the corresponding reads) are then aligned to the reference genome for SV calling. Once the SV calls are available, the performance of the SV calling protocol can be evaluated by comparison with an independently developed high-confidence truth set. For this purpose, we are using the GIAB HG002 (ASJ son) SV dataset as the truth. The findings will help us to assess the pros and cons for each method and recommend an optimized SV calling protocol for NGS short reads.
We selected 2x250bp BAM files (80X coverage) from HG002 (Ashkenazim Trio Son, NA24385) as the input for mapping-based SV calling with Lumpy, Delly, Manta, Tardis, dysgu, cue and followed by Parliament (https://github.com/fritzsedlazeck/parliament2) and SURVIVOR (https://github.com/fritzsedlazeck/SURVIVOR). The BAM files are converted to FASTQ files and used as input for assembly-based SV calling via SVABA, NucDiff and SVanalyzer, followed by SURVIVOR. The individual SV calls and the consolidated SV calls are then compared with the HG002 SV truth dataset to assess the performance by Truvari.
## Quick Start:
## SV evaluation :
path/truvari bench -b ../HG002_SVs_Tier1_v0.6.DEL.vcf.gz -c parliament2_HG002.merge.22/HG002.merge.22.survivor_sorted.DEL.vcf.gz -o HG002.merge.22.survivor_sorted.DEL.vcf.gz_vs_HG002_SVs_Tier1_v0.6_0_0_0.5_3000_3000 --passonly --includebed ../HG002_SVs_Tier1_v0.6.bed -p 0 --pctovl 0 --pctsize 0.5 --refdist 3000 -C 3000
## Assembly based:
svaba run -t HG002.hs37d5.2x250.bam -r all -p 80 -G /home/dnanexus/svaba/reference/GRCh37_hs37d5_1kGenomes/hs37d5.fa
This is a novel and generic computational test for evaluating errors and false positives from variant callers including SV callers. Colin’s test uses the recently available T2T genomes as the reference to generate read alignment and identify artifacts as a performance metric for variant callers, complementing the output by Truvari. Specifically, we mapped GIAB HG002 reads back to the HG002 T2T phased genome assembly (both maternal and paternal haplotypes included) to assess how the "self alignment" would perform. We aligned the HG002 T2T phased genome assembly to the hg19 chr22, and also aligned the HG002 T2T maternal to the paternal genome assembly, so then we could inspect the NGS alignments at the matched locations, with the intent of finding the source of errors in the reference based SV calls.
In order to demonstrate some of the results of our analysis, we created a genome browser instance containing the alignments we produced. We created an automated script that prepared the alignments and the genome browser setup
We also created a web based portal to browse SVs, with links to the JBrowse 2 genome browser to view evidence
Screenshot showing a heterozygous deletion
The repo for this web application is at https://github.com/cmdcolin/svxplorer with live demo at https://cmdcolin.github.io/svxplorer