Performance of Silicon on long running examples (e.g. Nagini)

Background

The benchmarks were performed due to Silicon's bad performance on examples generated by Nagini. The benchmarks compare several versions of Silicon that try different optimisations/heuristics in different combinations and with different settings, e.g. timeout values.

Document outline

Overview
Results
Observations and Interpretation

Overview

Test files were taken from the Viper test suite, from examples provided by Marco (Nagini, SCION) and from examples provided by Vytautas (Chalice obligations, Rust)
Only tests for which the current Silicon configuration (i.e. the baseline configuration) needs at least one second to verify were considered; 213 files in total. The test files were selected based on a single run of Silicon on my office computer, the benchmarks were afterwards performed on a different computer (Vytautas prepared a "benchmarking laptop" for me).
Each file was verified five times and the average runtime was recorded
Silicon was run in sequential mode, i.e. without any kind of internal parallelisation
All Silicon versions were run via Nailgun
All benchmarks were performed using the Viper Runner tool

The overall benchmark consists of three groups:

In group A I experimented with the path feasibility checks that Silicon makes. I.e. before branching over a condition b, Silicon queries Z3 to find out b is known to be true or false; if so, the corresponding path isn't explored. Group A contains the following five Silicon configurations (in the terminology of the Viper Runner tool):

Configuration	Description
`base_cT250`	The baseline configuration of Silicon. Path feasibility checks are always made, with a timeout of 250ms.
`base_cT10`	Baseline configuration , but with a path feasibility check timeout of 10ms
`skip_1`	Don't check path feasibility if the branching is due to short-circuiting evaluation of conjunctions (`b1 && b2` is treated as if it were `b1 && (b1 ==> b2)`
`skip_2`	Don't check path feasibility if the condition contains a quantified variable (affects, e.g. `forall x :: x in xs ==> ...`)
`skip_3`	Combination of `skip_1` and `skip_3`

Note: I observed early on that reducing the path feasibility check timeout to 10ms (base_cT10) always performs better than a timeout of 250ms (base_cT250). Thus, all non-base configurations, e.g. skip1/2/3 and those of groups B and C, had a path feasibility check timeout of 10ms.

In group B I experimented with performing additional Z3 check-sats in order to saturate Z3's state, i.e. to get Z3 to learn additional lemmas. The underlying idea is that Z3 would learn lemmas early, e.g. because they follow from domain definitions and are independent of any Viper state, which would avoid situations in which Z3 relearns lemmas again and again in subsequent push-pop scopes. Group B contains the following 10 Silicon configurations:

Configuration	Description
`base_cT250`	As before
`base_cT10`	As before
`pre500`	Perform a saturating check-sat after the program preamble (Silicon's internal preamble plus domain definitions from the program to verify) has been emitted to Z3. The check-sat is made with a timeout of 500ms.
`pre250`	Analogous, but with a timeout of 250ms
`pre100`	Analogous, but with a timeout of 100ms
`pre10`	Analogous, but with a timeout of 10ms
`many500`	Performs saturating check-sats at the following places: (1) after preamble, (2) after inhaling a contract, (3) after unfold, (4) after inhale, (5) before Silicon repeatedly queries Z3 in an iteration, e.g. as done in the QP consume algorithm or as part of a state consolidation. Each check is done with a different timeout. The base timeout is 500ms, which is then divided by the following reducers: 1, 2, 2.5, 5, 50. Examples: base timeout 500ms; after preamble reducer 1, effective timeout 500ms; after unfold reducer 2.5, effective timeout 200ms. An effective minimal timeout of 10ms is enforced.
`many250`	Analogous, but with a base timeout of 250ms
`many100`	Analogous, but with a base timeout of 100ms
`many10`	Analogous, but with a base timeout of 10ms

In group C I combined the best configurations from groups A and B. Group C thus represents the final run of the benchmark and its results determine which configuration will be the next default configuration of Silicon. Group C contains the following 9 Silicon configurations:

Configuration	Description
`base_cT250`	As before
`base_cT10`	As before
`skip3`	As before
`pre10`	As before
`many100`	As before
`many10`	As before
`skip3+pre10`	Combination of `skip3` and `pre10`
`skip3+many100`	Combination of `skip3` and `many100`
`skip3+many10`	Combination of `skip3` and `many10`

Results

Group A

The results can be found in this Excel file.

The first block of data in the Excel file are the average runtime per Silicon configuration and test file, ordered by runtime of baseline Silicon:

The second block of data, columns Min [s] and Config, list the minimum verification time per test file and the Silicon configuration that yielded the minimum, respectively:

The third block of data compares "new" configurations (e.g. optimised ones) to "old" configurations (typically the baseline configuration). Each comparison consists of two columns: absolute difference (e.g. cT10-cT250 [s]) and relative difference (e.g. cT10/cT250 [%diff]). Negative values (green) mean that the new configuration is faster, positive (red) ones mean the opposite.

The Excel file also contains a strip of summaries below the main data (scroll down to row 213) whose structure is analogous to the previously described structure, i.e. which is also divided into three blocks. The first block comprises, per configuration, the sum, min, max, median and average of the recorded runtimes. The second block lists how often (column Wins [#]) a particular configuration (column Config) was the fastest configuration (i.e. the sum of all wins equals the number of test files, modulo tests that timed out). The third block is similar to the first, but computes sum, min, etc. for the absolute and relative differences between configurations.

Group B

The results can be found in this Excel file. The file has the same structure as that of group A.

Group C

The results can be found in this Excel file.

The file has the same structure as that of group A, except that the block that lists which configuration was fastest for how many files has been extended to provide the necessary overview to finally decide which configuration should be the next default Silicon configuration.

The block has been extended as follows: in addition to the total number of wins per configuration (column Total [#]) it also contains wins per duration bin, e.g. how often was a configuration fastest for files that verified in between two and five seconds (column Wins 2-5s [#]). Moreover, the block also contains total, averaged, maximum etc. per configuration across all test files.

Observations and Interpretation

Any new configuration is better than then current Silicon configuration (cT250)
skip3+many100 is overall the fastest configuration: 610s (cT250: 850s)
skip3 wins for "fast" tests (up to five seconds)
skip3+many100 and skip3+many10 win for "slow" tests (of which we don't have that many, though)
The lead of skip3 over skip3+many100/skip3+many10 is always negligible and less than 500ms, whereas the lead in the opposite direction can be very substantial

In conclusion, it seems that

skip3+many100 should be implemented as the next default Silicon configuration
the default path feasibility check timeout should be 10ms
it is worthwhile to allow users (e.g. via command-line options) to adjust the timeouts used for the various Z3 saturation calls
it is not necessary to allow users to select when to skip path feasibility checks, i.e. to choose between skip1/2/3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly