-
Notifications
You must be signed in to change notification settings - Fork 32
Performance of Silicon on long running examples (e.g. Nagini)
The benchmarks were performed due to Silicon's bad performance on examples generated by Nagini. The benchmarks compare several versions of Silicon that try different optimisations/heuristics in different combinations and with different settings, e.g. timeout values.
- Test files were taken from the Viper test suite, from examples provided by Marco (Nagini, SCION) and from examples provided by Vytautas (Chalice obligations, Rust)
- Only tests for which the current Silicon configuration (i.e. the baseline configuration) needs at least one second to verify were considered; 213 files in total. The test files were selected based on a single run of Silicon on my office computer, the benchmarks were afterwards performed on a different computer (Vytautas prepared a "benchmarking laptop" for me).
- Each file was verified five times and the average runtime was recorded
- Silicon was run in sequential mode, i.e. without any kind of internal parallelisation
- All Silicon versions were run via Nailgun
- All benchmarks were performed using the Viper Runner tool
The overall benchmark consists of three groups:
-
In group A I experimented with the path feasibility checks that Silicon makes. I.e. before branching over a condition b, Silicon queries Z3 to find out b is known to be true or false; if so, the corresponding path isn't explored. Group A contains the following five Silicon configurations (in the terminology of the Viper Runner tool):
Configuration Description base_cT250
The baseline configuration of Silicon. Path feasibility checks are always made, with a timeout of 250ms. base_cT10
Baseline configuration , but with a path feasibility check timeout of 10ms skip_1
Don't check path feasibility if the branching is due to short-circuiting evaluation of conjunctions ( b1 && b2
is treated as if it wereb1 && (b1 ==> b2)
skip_2
Don't check path feasibility if the condition contains a quantified variable (affects, e.g. forall x :: x in xs ==> ...
)skip_3
Combination of skip_1
andskip_3
Note: I observed early on that reducing the path feasibility check timeout to 10ms (base_cT10
) always performs better than a timeout of 250ms (base_cT250
). Thus, all non-base configurations, e.g. skip1/2/3
and those of groups B and C, had a path feasibility check timeout of 10ms.
-
In group B I experimented with performing additional Z3 check-sats in order to saturate Z3's state, i.e. to get Z3 to learn additional lemmas. The underlying idea is that Z3 would learn lemmas early, e.g. because they follow from domain definitions and are independent of any Viper state, which would avoid situations in which Z3 relearns lemmas again and again in subsequent push-pop scopes. Group B contains the following 10 Silicon configurations:
Configuration Description base_cT250
As before base_cT10
As before pre500
Perform a saturating check-sat after the program preamble (Silicon's internal preamble plus domain definitions from the program to verify) has been emitted to Z3. The check-sat is made with a timeout of 500ms. pre250
Analogous, but with a timeout of 250ms pre100
Analogous, but with a timeout of 100ms pre10
Analogous, but with a timeout of 10ms many500
Performs saturating check-sats at the following places: (1) after preamble, (2) after inhaling a contract, (3) after unfold, (4) after inhale, (5) before Silicon repeatedly queries Z3 in an iteration, e.g. as done in the QP consume algorithm or as part of a state consolidation. Each check is done with a different timeout. The base timeout is 500ms, which is then divided by the following reducers: 1, 2, 2.5, 5, 50. Examples: base timeout 500ms; after preamble reducer 1, effective timeout 500ms; after unfold reducer 2.5, effective timeout 200ms. An effective minimal timeout of 10ms is enforced. many250
Analogous, but with a base timeout of 250ms many100
Analogous, but with a base timeout of 100ms many10
Analogous, but with a base timeout of 10ms -
In group C I combined the best configurations from groups A and B. Group C thus represents the final run of the benchmark and its results determine which configuration will be the next default configuration of Silicon. Group C contains the following 9 Silicon configurations:
Configuration Description base_cT250
As before base_cT10
As before skip3
As before pre10
As before many100
As before many10
As before skip3+pre10
Combination of skip3
andpre10
skip3+many100
Combination of skip3
andmany100
skip3+many10
Combination of skip3
andmany10
The results can be found in this Excel file.
The first block of data in the Excel file are the average runtime per Silicon configuration and test file, ordered by runtime of baseline Silicon:
The second block of data, columns Min [s]
and Config
, list the minimum verification time per test file and the Silicon configuration that yielded the minimum, respectively:
The third block of data compares "new" configurations (e.g. optimised ones) to "old" configurations (typically the baseline configuration). Each comparison consists of two columns: absolute difference (e.g. cT10-cT250 [s]
) and relative difference (e.g. cT10/cT250 [%diff]
). Negative values (green) mean that the new configuration is faster, positive (red) ones mean the opposite.
The Excel file also contains a strip of summaries below the main data (scroll down to row 213) whose structure is analogous to the previously described structure, i.e. which is also divided into three blocks.
The first block comprises, per configuration, the sum, min, max, median and average of the recorded runtimes.
The second block lists how often (column Wins [#]
) a particular configuration (column Config
) was the fastest configuration (i.e. the sum of all wins equals the number of test files, modulo tests that timed out).
The third block is similar to the first, but computes sum, min, etc. for the absolute and relative differences between configurations.
The results can be found in this Excel file. The file has the same structure as that of group A.
The results can be found in this Excel file.
The file has the same structure as that of group A, except that the block that lists which configuration was fastest for how many files has been extended to provide the necessary overview to finally decide which configuration should be the next default Silicon configuration.
The block has been extended as follows: in addition to the total number of wins per configuration (column Total [#]
) it also contains wins per duration bin, e.g. how often was a configuration fastest for files that verified in between two and five seconds (column Wins 2-5s [#]
).
Moreover, the block also contains total, averaged, maximum etc. per configuration across all test files.
- Any new configuration is better than then current Silicon configuration (
cT250
) -
skip3+many100
is overall the fastest configuration: 610s (cT250
: 850s) -
skip3
wins for "fast" tests (up to five seconds) -
skip3+many100
andskip3+many10
win for "slow" tests (of which we don't have that many, though) - The lead of
skip3
overskip3+many100
/skip3+many10
is always negligible and less than 500ms, whereas the lead in the opposite direction can be very substantial
In conclusion, it seems that
-
skip3+many100
should be implemented as the next default Silicon configuration - the default path feasibility check timeout should be 10ms
- it is worthwhile to allow users (e.g. via command-line options) to adjust the timeouts used for the various Z3 saturation calls
- it is not necessary to allow users to select when to skip path feasibility checks, i.e. to choose between
skip1/2/3