ABySS Blackbox Optimization

Nivretta Thatra October 15, 2016

Abstract

When tackling optimization of parameters, the process is manual and tedious: submitting jobs to a scheduler, rerunning failed jobs, inspecting outputs, tweaking parameters, and repeating. In genome sequence assembly, for example, there are a variety of parameters related to expected coverage of the reads and heuristics to remove read errors and collapse heterozygous variation.
BlackBox parameter optimization tools do exist, but their usibility and speed need to be compared & evaluated.
Approaches of optimimation tools:
- Exploitation Approach: Start halfway and try a point to the left and to the right, find next best, iterate
- Exploration Approach: Try a bunch of random ranges
- Grid Search Approach
Here we evaluate 3 optimization tools - Opal, SpearMint, and ParOpt - on a dataset of a human bacterial artificial chromosome (BAC), using the assembler ABySS.
We find that

Goals/Methods

Evaluate 3 parameter optimization tools for usability and speed

Divvy up 3 tools for testing
Download ABySS
Access dataset from ORCA computing machine
Backend
- What types of input parameters (discrete with large/small ranges, continuous, binary)
- Make it portable to other commandline tools so optimizer can be told how to launch the tool
Results output
- Generate target metrics vs parameters plot
- Generate Pareto frontier of the target metric and a second metric of interest (contiguity and correctness) likely in R using ggplot
- Generate a report of the results of the optimization
Write a short report of our experience
Post on GitHub pages
Possibly submit to a preprint server (bioRxiv, PeerJ, Figshare)
Possibly submit for peer review, such as F1000Research Hackathons

Dataset(s) and Optimizers

Dataset
a human bacterial artificial chromosome (BAC), using assembler ABySS
Metrics
The key metrics are contiguity (a.) and correctness (b. through d.).
1. contiguity (NG50, N50) and aligned contiguity (NGA50, NA50)
2. number of breakpoints when aligned to the reference as a proxy for misassemblies
3. number of mismatched nucleotides when aligned to the reference, Q = -10*log(mismatches / total_aligned)
4. completeness, number of reference bases covered by aligned contigs divided by number of reference bases
We'll be optimizing the NG50 metric (or NGA50 with a reference genome) and reporting (but probably not optimizing) the correctness metrics.
The primary parameter we'll be optimizing is k (a fundamental parameter of nearly all de Bruijn graph assemblers), and there's a bunch other parameters that we can play with (typically thresholds related to expected coverage).
Optimizers being evaluated
- OPAL by @dpo — Optimization of algorithms with OPAL
- 'nondifferentiable optimization tools'
- 'runs on most platforms supporting the Python language and possessing a C++ compiler'
- 'an optimization method supported by a strong convergence theory, yielding solutions that are local minimizers in a meaningful sense'
- SpearMint by @mgelbart — Predictive Entropy Search for Multi-objective Bayesian Optimization
- Bayesian method for identifying the Pareto set of multi-objective optimization problems, when the functions are expensive to evaluate
- a sum of objective specific acquisition functions, which enables the algorithm to be used in decoupled scenarios in which the objectives can be evaluated separately and perhaps with different costs
- ParOpt by @sseemayer
- scipy.optimize
- Nelder-Mead
- Possibly R packages, a long list
- Possibly Python packages, like scikit-optimize

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abyss-Blackbox-optimization.md

Abyss-Blackbox-optimization.md

ABySS Blackbox Optimization

Abstract

Goals/Methods

Evaluate 3 parameter optimization tools for usability and speed

Dataset(s) and Optimizers

Results

Files

Abyss-Blackbox-optimization.md

Latest commit

History

Abyss-Blackbox-optimization.md

File metadata and controls

ABySS Blackbox Optimization

Abstract

Goals/Methods

Evaluate 3 parameter optimization tools for usability and speed

Dataset(s) and Optimizers

Results