Toolkit to download datasets, generate virtual polyploids, phase them, and assess phasing quality.
The Phasing Toolkit is an open source collection of scripts which automate the steps of alignment-based phasing operations.
Currently, the focus is on making it easy to benchmark phasing algorithms with a focus on polyploid tools. Ultimately the goal is to develop a general purpose phasing toolkit. Everyone is welcome and encouraged to contribute to this project.
Install bioconda and set the correct channels by following the first two steps here: https://bioconda.github.io/user/install.html
Then you can create a new environment and install polyploidBenchmarking with the following commands
conda create -n polyploidBenchmarking python=3.8
conda activate polyploidBenchmarking
conda install -c oakheart nphase
conda install -c bioconda -c bioinf-mcb sra-downloader
conda install whatshap
conda install -c bioconda ncbi-genome-download
conda install psutil
If you wish to test flopp you must download it manually (any help automating this step would be appreciated), instructions are available here: https://github.com/bluenote-1577/flopp
Make sure "flopp" is in your $PATH (such that entering "flopp" into the command line launches it)
git clone https://github.com/OmarOakheart/Phasing-Toolkit.git
There are currently five scripts which all require a parameter file (benchmarkingParameters.tsv) as input. In its current state, the Phasing Toolkit is optimized to easily benchmark polyploid phasing algorithms on virtual polyploid datasets. A virtual polyploid is created by taking the sequencing reads of genetically distinct haploid or homozygous diploid organisms of the same species and merging them, thus using real reads to generate simulated polyploids. We can then run phasing algorithms on these simulated or virtual polyploids and compare the genotypes of the haploids to the predictions to calculate phasing accuracy metrics. This toolkit provides a way to perform all of these steps, from data download to accuracy metric calculation, with minimal effort.
Note: Virtual polyploids don't perfectly approximate real polyploid data and we welcome any contribution to this project for alternate ways of generating polyploid phasing benchmarking datasets.
benchmarkingParameters.tsv
mainPath /path/to/outputFolder/
longReads ACA:ERR4352153 BMB:ERR4352154 CCN:ERR4352155 CRL:ERR4352156
longReadMethod ont
shortReads ACA:ERR1309429 BMB:ERR1308675 CCN:ERR1308732 CRL:ERR1308952
coverages 20 10
heterozygosityRates 0.05 0.1 0.5 1
reference taxID:559292 group:fungi speciesName:sCerevisiae
strainLists BMB,CCN ACA,BMB,CCN ACA,BMB,CCN,CRL
genomeSize 12500000
threads 8
The parameter file contains all of the information required by the various scripts of the phasing toolkit. It is tab-separated and contains multiple distinct lines. Line order is not important. We provide an example file, benchmarkingParameters.tsv, which can be used as a template. A detailed description of this file is available at the end of the page.
Usage: python3.8 downloadData.py benchmarkingParameters.tsv
- Downloads reads using sra-downloader (https://github.com/s-andrews/sradownloader). Input: SRA Accession numbers associated with strain names
- Downloads and indexes a reference sequence using ncbi-genome-download (https://github.com/kblin/ncbi-genome-download). Input: taxID of the species, group according to the ncbi (fungi, viral, etc.), and species name.
Usage: python3.8 processReads.py benchmarkingParameters.tsv groundTruth
- Generates the ground truth dataset. Maps short reads and long reads, variants calls short reads
Usage: python3.8 hybridGenerator.py benchmarkingParameters.tsv
- Generates virtual hybrids by merging the ground truth dataset reads at specified coverage levels.
Usage: python3.8 processReads.py benchmarkingParameters.tsv virtualHybrids
- Runs on the virtual polyploids generated by hybridGenerator.py. Maps short reads and long reads, variants calls and subsets short read VCFs, with and without indels
Usage: python3.8 phaseToolRunner.py benchmarkingParameters.tsv flopp nphase whatshap-polyphase [...]
- Phases the virtual polyploids generated by hybridGenerator.py and processed by processReads.py [virtualHybrids]. Will run every phasing algorithm specified in the command line. This script also outputs performance metrics (runtime and memory usage). Currently only flopp, nphase and whatshap polyphase are implemented. Please feel free to contribute to the project by adding an implementation of another phasing algorithm you would like to see included.
Usage: python3.8 accuracyCalculator.py benchmarkingParameters.tsv flopp nphase whatshap-polyphase [...]
- Calculates accuracy metrics of phasing results obtained by running phaseToolRunner.py and outputs them to a file and to stdout.
With the example benchmarkingParameters.tsv file, you can run a full benchmarking pipeline using 4 strains to generate a virtual 2n, 3n and 4n, at coverage levels 10X/haplotype and 20X/haplotype, and randomly subset their variants to obtain VCF files at 1%, 0.5%, 0.1% and 0.05% heterozygosity, with and without indels. In total it generates 3 ploidies at 2 coverage levels and 4 heterozygosity levels each, with and without indels, for a total of 3*2*4*2=48 virtual polyploids which are all phased by flopp, nphase and whatshap polyphase and their results analyzed in comparison with the known ground truth.
Please note that this can take a few days and over 100GB of hard drive space to run fully. You can modify the benchmarkingParameters.tsv file to reduce the number of tests to perform to test run the phasing toolkit faster and more efficiently. We provide a very complete benchmarking test here as an example to show the range of available options.
python3.8 downloadData.py benchmarkingParameters.tsv
python3.8 processReads.py benchmarkingParameters.tsv groundTruth
python3.8 hybridGenerator.py benchmarkingParameters.tsv
python3.8 processReads.py benchmarkingParameters.tsv virtualHybrids
python3.8 phaseToolRunner.py benchmarkingParameters.tsv flopp nphase whatshap-polyphase
python3.8 accuracyCalculator.py benchmarkingParameters.tsv flopp nphase whatshap-polyphase
benchmarkingParameters.tsv
This file contains all of the information needed to share a full benchmark with anyone else. Here we describe it line by line:
Output folder which will be used to store all of the data we're going to download and generate
mainPath /path/to/outputFolder/
Names of the strains we will phase and the accessions of their long reads, format is strainName:SRA_Accession
longReads ACA:ERR4352153 BMB:ERR4352154 CCN:ERR4352155 CRL:ERR4352156
Define the type of long read (ont or pacbio)
longReadMethod ont
Names of the strains we will phase and the accessions of their long reads, format is strainName:SRA_Accession
. The names must be the same as for the long reads.
shortReads ACA:ERR1309429 BMB:ERR1308675 CCN:ERR1308732 CRL:ERR1308952
Coverage levels to test (10 means 10X per haplotype, for a diploid that means a total of 20X, for a tetraploid it means a total of 40X)
coverages 20 10
Heterozygosity levels to test (0.5 means 0.5%, i.e. on average 1 heterosygous SNP every 2000 bp)
heterozygosityRates 0.05 0.1 0.5 1
Reference information, must include the taxID of the species according to the ncbi, the ground name according to the ncbi, and a name for the reference sequence (speciesName). More information can be obtained from https://github.com/kblin/ncbi-genome-download.
reference taxID:559292 group:fungi speciesName:sCerevisiae
Define the virtual polyploids to generate. ACA,BMB,CCN
means it should generate a virtual triploid with reads from the strains ACA, BMB and CCN.
strainLists BMB,CCN ACA,BMB,CCN ACA,BMB,CCN,CRL
Reference genome size, needed to calculate coverage
genomeSize 12500000
Number of threads to use when applicable
threads 8
If you're interested in helping with the development of this toolkit, please feel free to create issues on this page or email me at [email protected] to discuss ways in which we can improve this project and make it more flexible and useful for the community. I'm very interested in extending this to (in no particular order):
- Improve the flexibility of tools provided to make the phasing toolkit more versatile (for example reusing the VCF subsetting method to make it easy for anyone to subset their own VCF)
- Include more default benchmarking datasets which push the limits of phasing methods in different ways (highly similar haplotypes, aneuploidy, LOH tracts, SVs...)
- Develop tools which analyze SVs and phasing results in order to generate accurate haplotig sequences from a reference sequence
- Develop tools to take phasing results and generate high-quality de novo phased sequences
- Host a simple website which displays the results of different tools on different datasets
- Test how different parameters influence accuracy results
- Provide more insight into which variants are well-phased
- Investigate the effect of read and mapping quality on results
- Provide tools to analyze and visualize the results of phased data
- Discuss and develop accuracy metrics
- Discuss and develop the ideal input and output formats of polyploid phasing algorithms