This repository provides the code and replication instructions for our paper BeDivFuzz: Integrating Behaviorial Diversity into Generator-based Fuzzing (ICSE'2022).
BeDivFuzz is implemented as an extension of JQF.
First, build the image:
docker build -t bedivfuzz .
Then, run the container (with the current directory mounted to /workspace
inside the container):
docker run -it --rm -v ${PWD}:/workspace bedivfuzz
To build BeDivFuzz and Zest run:
mvn package
RLCheck needs to be build separately:
cd RLCheck/jqf/
mvn package
cd ../..
We can now perform a test run of BeDivFuzz (e.g., on Rhino) as follows:
bin/jqf-bediv -c $(scripts/examples_classpath.sh) edu.berkeley.cs.jqf.examples.rhino.CompilerTest testWithSplitGenerator
After a while, you should see a status screen similar to this:
BeDivFuzz: Behavioral Diversity Fuzzing.
Test name: edu.berkeley.cs.jqf.examples.rhino.CompilerTest#testWithSplitGenerator
Results directory: /Users/lam/_projects/clustering-guided-fuzzing/code/BeDivFuzz/experiments/fuzz-results
Elapsed time: 30s (no time limit)
Number of executions: 4,577
Valid inputs: 3,607 (78.81%)
Cycles completed: 0
Unique failures: 1
Queue size: 153 (0 favored last cycle)
Current parent input: 14 (favored) {229/360 mutations}
Execution speed: 215/sec now | 150/sec overall
Valid coverage: 5,236 branches (7.99% of map)
Behavioral Diversity: (B(0): 5367 | B(1): 3001 | B(2): 2563)
Unique valid inputs: 1,588 (34.70%)
Unique valid paths: 3,607
Structure-changing mutations (exploration): 216,776
Overall exploration score: 0.295
Structure-preserving mutations (exploitation): 611,515
Overall exploitation score: 0.990
Since this process runs without a timeout, you have to manually abort it with Ctrl+C
.
The evaluation script can be executed as follows:
scripts/run_parallel_experiments.sh -o out_dir -t timeout -n repetitions -p parallel_workers [-r]
out_dir
is the folder where the results should be savedtimeout
is the timeout (in seconds) per trialrepetitions
is the number of repetitions to performparallel_workers
is the number of parallel trials to perform and must be a factor ofrepetitions
to evenly distribute the workload (e.g., with repetitions=30 and parallel_workers=10, each instance will perform 3 repetitions)- The
-r
flag enables coverage replay and is required to collect coverage data for RLCheck and QuickCheck
In our original evaluation, we first performed experiments with 1 hour timeout and 30 repetitions to answer RQ1 (input diversity) and RQ2 (behavioral diversity). The command for this setup (with 15 parallel instances) is:
scripts/run_parallel_experiments.sh -o coverage-results -t 3600 -n 30 -p 15 -r
To answer RQ3 (fault finding capabilities), we extended the timeout to 24 hours, but did not measure any coverage (i.e., no -r
flag):
scripts/run_parallel_experiments.sh -o crash-results -t 86400 -n 30 -p 15
For this step, we assume that the results are stored under coverage-results
(RQ1/RQ2) and crash-results
(RQ3).
To generate the plots for Figure 3 (diverse valid inputs) and Figure 4 (behavioral diversity), use:
python3 scripts/gen_figures.py coverage-results
The plots will be produced in the subdirectory coverage-results/figs
.
To generate the crash table (Table 1) from crash-results
, the following command can be used:
python3 scripts/gen_crash_table.py crash-results
The table will be printed on the terminal, but also saved as crash-results/crash_table.txt
.