Nivretta Thatra October 15, 2016
-
When tackling optimization of parameters, the process is manual and tedious: submitting jobs to a scheduler, rerunning failed jobs, inspecting outputs, tweaking parameters, and repeating. In genome sequence assembly, for example, there are a variety of parameters related to expected coverage of the reads and heuristics to remove read errors and collapse heterozygous variation.
-
BlackBox parameter optimization tools do exist, but their usibility and speed need to be compared & evaluated.
-
Approaches of optimimation tools:
- Exploitation Approach: Start halfway and try a point to the left and to the right, find next best, iterate
- Exploration Approach: Try a bunch of random ranges
- Grid Search Approach
-
Here we evaluate 3 optimization tools - Opal, SpearMint, and ParOpt - on a dataset of a human bacterial artificial chromosome (BAC), using the assembler ABySS.
-
We find that
-
Divvy up 3 tools for testing
-
Download ABySS
-
Access dataset from ORCA computing machine
-
Backend
- What types of input parameters (discrete with large/small ranges, continuous, binary)
- Make it portable to other commandline tools so optimizer can be told how to launch the tool
-
Results output
- Generate target metrics vs parameters plot
- Generate Pareto frontier of the target metric and a second metric of interest (contiguity and correctness) likely in R using ggplot
- Generate a report of the results of the optimization
-
Write a short report of our experience
-
Post on GitHub pages
-
Possibly submit to a preprint server (bioRxiv, PeerJ, Figshare)
-
Possibly submit for peer review, such as F1000Research Hackathons
-
Dataset
-
a human bacterial artificial chromosome (BAC), using assembler ABySS
-
Metrics
-
The key metrics are contiguity (a.) and correctness (b. through d.).
- contiguity (NG50, N50) and aligned contiguity (NGA50, NA50)
- number of breakpoints when aligned to the reference as a proxy for misassemblies
- number of mismatched nucleotides when aligned to the reference, Q = -10*log(mismatches / total_aligned)
- completeness, number of reference bases covered by aligned contigs divided by number of reference bases
-
We'll be optimizing the NG50 metric (or NGA50 with a reference genome) and reporting (but probably not optimizing) the correctness metrics.
-
The primary parameter we'll be optimizing is k (a fundamental parameter of nearly all de Bruijn graph assemblers), and there's a bunch other parameters that we can play with (typically thresholds related to expected coverage).
-
Optimizers being evaluated
-
OPAL by @dpo — Optimization of algorithms with OPAL
-
'nondifferentiable optimization tools'
-
'runs on most platforms supporting the Python language and possessing a C++ compiler'
-
'an optimization method supported by a strong convergence theory, yielding solutions that are local minimizers in a meaningful sense'
-
SpearMint by @mgelbart — Predictive Entropy Search for Multi-objective Bayesian Optimization
-
Bayesian method for identifying the Pareto set of multi-objective optimization problems, when the functions are expensive to evaluate
-
a sum of objective specific acquisition functions, which enables the algorithm to be used in decoupled scenarios in which the objectives can be evaluated separately and perhaps with different costs
-
ParOpt by @sseemayer
-
Possibly R packages, a long list
-
Possibly Python packages, like scikit-optimize
-