SV simulation for rapid benchmarking
Make a tool that (A) performs SV and read simulation to create inputs for benchmarking an SV caller (B) creates an evaluation/reporting of the SV caller's performance. Users supply SVTeaser with a reference sequence file (.fasta) and, optionally, a set of SVs (.vcf). SVTeaser outputs assorted statistical metrics across a range of read lengths and depths. SVTeaser achieves rapid assessment by downsampling the full reference to a subset of numerous 10kb samples to which it adds SVs.
-
Build the SVTeaser pip install-able tarball
-
Download and install SURVIVOR
-
Put the
SURVIVOR
executable into your environment's PATH -
The three steps of this are handled by
bash install.sh
-
Install vcftools
-
Ensure
vcftools
(e.g.vcf-sort
) is in your environment's PATH -
Put ART read simulator executable into your environment's PATH
-
Install truvari
usage: svteaser [-h] CMD ...
SVTeaser v0.0.1 - SV simulation for rapid benchmarking
CMDs:
sim_sv Simulate SVs
surv_sim Simulate SVs with SURVIVOR
surv_vcf_fmt Correct a SURVIVOR simSV vcf
sim_reads Run read simulators
positional arguments:
CMD Command to execute
OPTIONS Options to pass to the command
optional arguments:
-h, --help show this help message and exit
Workflow:
- Create a SVTeaser working directory (
output.svt
) by simulating SVs over a reference
svteaser surv_sim reference.fasta workdir
- in progress Simulate reads over the altered reference and place them in the
output.svt
directory
svteaser sim_reads workdir.svt
- Call SVs over the reads (
output.svt/read1.fastq output.svt/read2.fastq
) with your favorite SV caller - Run
truvari bench
with the--base output.svt/simulated.sv.vcf.gz
and--comp your_calls.vcf.gz
- Open the
notebooks/SVTeaser.ipynb
and point to youroutput.svt
directory
See test/workflow_test.sh
for an example
Two methods for SV simulation are supported in SVTeaser
- (done) simulation of SV with SURVIVOR
and (in progress) simulation of SVs from VCFs.
Running simulation in either mode results in an output directory of the following structure -
$ svteaser surv_sim reference.fasta workdir
$ ll -h workdir
total 2.3M
drwxr-xr-x 2 user hardware 4.0K Oct 12 15:38 ./
drwxr-xr-x 13 user hardware 4.0K Oct 12 15:38 ../
-rw-r--r-- 1 user hardware 1.1M Oct 12 15:38 svteaser.altered.fa # <---- Multi-FASTA with all altered region sequences
-rw-r--r-- 1 user hardware 980K Oct 12 15:38 svteaser.ref.fa # <---- Multi-FASTA with all unaltered region sequences
-rw-r--r-- 1 user hardware 228K Oct 12 15:38 svteaser.sim.vcf # <---- Combined VCF with variants from each region
-rw-r--r-- 1 user hardware 34K Oct 12 15:38 svteaser.sim.vcf.gz
-rw-r--r-- 1 user hardware 121 Oct 12 15:38 svteaser.sim.vcf.gz.tbi