Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AFL filename formats #11

Merged
merged 9 commits into from
Oct 1, 2020
Merged

AFL filename formats #11

merged 9 commits into from
Oct 1, 2020

Conversation

cponcelets
Copy link
Contributor

Goal:
Fix filename formats between SAVIOR and AFL.

Issue:

AFL can use two kinds of filename formats:

  • a simple one consisting of the string "id_" and six numbers (id_[0-9]{6}),
  • and the standard one starting with the string "id:", six numbers (id:[0-9]{6}) and followed by AFL information.

These two formats are exclusive, i.e. choosing the simple one with the flag SIMPLE_FILES will prevent AFL from identifying files following the standard format (id:[0-9]{6}).

However, SAVIOR:

  • compiles AFL with the flag SIMPLE_FILES,
  • creates files into savior queues with standard format.

As a consequence AFL does not read SAVIOR testcases because of a format mismatch.


A first solution has been committed (commit:e2c18d9bf) removing SIMPLE_FILES flag.
However, there is still a mix between simple and standard formats in SAVIOR.

For example::

  1. SAVIOR extends AFL to output statistics into the files coverage.csv, edge_sanitizer.csv.
    These files are using the simple format to specify testcases.
  2. Whenever an edge oracle reads AFL queues, standard filename formats are imported.

Problems:

Solution: Use only simple filename format into SAVIOR.

The idea is to:

  1. convert the AFL filename into simple format whenever SAVIOR imports AFL testcases.
  2. match the standard filename back whenever SAVIOR uses or accesses a file.

This way, a testcase has a unique internal name into SAVIOR (the simple formatted one).


Example:

Before:

# ls output_folder/master/queue/
id_000000  id_000001
# ls output_folder/slave_000001/queue/
id_000000  id_000001

# ls output_folder/klee_instance_conc_000001/queue/
id:000001  id:000002  id:000003

od output_folder/master/.synced/klee_instance_conc_000001
0000000 000000 000000
0000004

Comments:

  • id_000000 and id_000001 do not pass the check, and nothing new has been imported from savior.
  • You can check the latter statement by looking at the .synced value storing the id of the next testcase to check with od (dumping the file in octal formats). The second 0000000 shows that none of the testcases have been checked.

After:

  • Running the same example with the fixes and the SIMPLE_FILES removed, outputs:
ls output_folder/master/queue/
id:000000,orig:seed1.txt                   id:000002,sync:klee_instance_conc_000001,src:000002,+cov
id:000001,src:000000,op:havoc,rep:16,+cov  id:000003,src:000002,op:int32,pos:8,val:-2147483648,+cov

# ls output_folder/klee_instance_conc_000001/queue/
id:000001  id:000002  id:000003

# od output_folder/master/.synced/klee_instance_conc_000001
0000000 000004 000000
0000004

Comment:

  • AFL imported the second testcase of klee.
  • The .synced value is 4, all the testcases have been checked.

Just to be sure and check the AFL testcases:

# ./savior-example < output_folder/master/queue/id:000000,orig:seed1.txt
# ./savior-example < output_folder/master/queue/id:000001,src:000000,op:havoc,rep:32,+cov
# ./savior-example < output_folder/master/queue/id:000002,sync:klee_instance_conc_000001,src:000002,+cov
Magic number passed


  • Please check if I havent missed a conversion.
  • I noticed a bug with the random oracle while testing the changes. The crash also happens before this PR.

@cponcelets
Copy link
Contributor Author

Note that only the two first commits directly concern the PR.
I added the example and a release flag to build llvm faster (in case you are interested).

@evanmak
Copy link
Owner

evanmak commented Sep 28, 2020

Thanks for the PR @cponcelets! I will take a closer look at the code sometime soon, sorry about that as I am quite busy these days.
Meanwhile, can you please provide a bit more context on why you would like to convert the file name format? Is using the simplified name across AFL, KLEE and coordinator making fuzzing more difficult? While it is less informative, it makes the implementation much easier and less room for (inconsistency) bugs.

A bit more context on why using the simplified format, as KLEE needs to convert the synthesized input back to something AFL can recognize, it does not have the mutation context on this seed, thus I chose the simplified format in the first place.

BTW, I am not opinionated towards either directions though, just curious on what you think the pros and cons are : )

@cponcelets
Copy link
Contributor Author

A) Meanwhile, can you please provide a bit more context on why you would like to convert the file name format?

Sure, let me be clearer. As you said, there are three modules within SAVIOR: the fuzzer (AFL), the coordinator and the concolic engine (klee). Now, you have also different data-flows between these modules:

  1. AFL -> Coordinator (reading afl queues)
  2. AFL -> Coordinator (coverage/score statistics)
  3. Coordinator -> Klee (running a concolic execution)
  4. Klee -> AFL (outputting back new testcases)

As shown in the example, the current version is using:

  • the simple format for 1. 2. 3.
  • the standard format for 4. (because klee-converter prints testcases starting with id: which is the standard format).

In order for AFL to understand klee outputs:

  • either 4. has to print simple formats,
  • or 1. 2. 3. have to deal with the standard format.

B) Is using the simplified name across AFL, KLEE and coordinator making fuzzing more difficult?
While it is less informative, it makes the implementation much easier and less room for (inconsistency) bugs.

True, the problem here is that I cannot change klee converter outputs. It should also work if klee outputs id_<num> format files.
Beside, I think the standard format is more understandable for users, but it is a personal preference.

C) ... as KLEE needs to convert the synthesized input back to something AFL can recognize, it does not have the mutation context on this seed, thus I chose the simplified format in the first place.

Yes, but this is not a problem, only id:000000 works. You can also follow the qsym way which is only adding a src to keep track of the seed a testcase has been generated from.

The problem are the first_seen values in AFL. You cannot retrieve easily filenames from ids as it is currently implemented in AFL. This is the reason why I kept simple file format inside the coordinator, demanding the format conversions.

To depict you an overview after the PR, the modules use:

  • the simple format for 2. (answer C)
  • the standard format for 1. 3. 4. (answers A and B)

Correct me if I am wrong but the coordinator uses filename as a testcase id in its seed lists. In order to map the scores/first_seen values to a file, it needs to unify the names coming from A. and B. It is thus necessary to convert standard into simple formats and avoid confusions between id:000001:orig and id_000001 for example which both point to the same file.

@evanmak
Copy link
Owner

evanmak commented Sep 30, 2020

Thanks for the detailed clarification, Indeed KLEE outputting the standard format will disrupt AFL from benefiting from the generated seeds. I was scratching my head to recall what happened, cuz when we experimented before we see AFL imported KLEE's generated inputs, @junxzm1990 please hold me accountable here.

Now here we have 2 options, make the rest of the modules understand standard format, and we have more insightful names maybe for further analysis, or we can ask KLEE to output simplified name to have minimum change.

I am happy to accept the PR btw, but would like to flag it to @DanielGuoVT for awareness as he is working to open source KLEE, and AFAIK there is another version of savior in Baidu's internal repo.

Copy link
Owner

@evanmak evanmak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove the binary files incl. *.o and *.o.bc?

Copy link
Owner

@evanmak evanmak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem are the first_seen values in AFL. You cannot retrieve easily filenames from ids as it is currently implemented in AFL. This is the reason why I kept simple file format inside the coordinator, demanding the format conversions.

it's been a few years since i work on this code, can you elaborate a bit more, why we can't keep the naming scheme consistent (i.e., use the standard naming across all modules?), the first_seen values in AFL can be modified here:

fprintf(f, "%u\t%s/queue/id_%06u\n", i, out_dir, edge_san_first_seen[i]);

would it help make things come cleaner?

if not tmp:
return ""
else:
return tmp[0]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is there a case when there will be multiple entries return by glob given a unique name?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah never mind, glob.glob returns a list,
can we add an assert here to ensure the list len is 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an assert is too strong since a file can be removed by AFL on the fly.
Time to time AFL calls a routine to polish the queue (a cmin similar function if you want), this is briefly mentionned here as a part of afl-fuzz algorithm.
Unfortunately, it may raise the assertion if the file savior wants to read has been removed by AFL.
I preferred the way you chose here and simply continue if a problem occurred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem are the first_seen values in AFL. You cannot retrieve easily filenames from ids as it is currently implemented in AFL. This is the reason why I kept simple file format inside the coordinator, demanding the format conversions.

it's been a few years since i work on this code, can you elaborate a bit more, why we can't keep the naming scheme consistent (i.e., use the standard naming across all modules?), the first_seen values in AFL can be modified here:

fprintf(f, "%u\t%s/queue/id_%06u\n", i, out_dir, edge_san_first_seen[i]);

would it help make things come cleaner?

The problem here is the use of edge_san_first_seen[i] storing only the id of the first testcase covering a branch. I have not seen a simple way to print back the full filename in the standard format. A solution would be to store the full name but it does not sound like a simpler way.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I got your point, reading the code again we use the seed names in the input_id_map for SE converter, so it needs to be a full match.

Thanks for the discussion btw, my concern was keeping a mixed scheme will make the code logic more convoluted, being able to modified KLEE seems like a more straightforward approach but we don't have source.

@DanielGuoVT maybe you could consider release another klee version before fully open source. But until then we can use the solution in this PR.

@cponcelets
Copy link
Contributor Author

cponcelets commented Sep 30, 2020 via email

@evanmak evanmak merged commit a7b4810 into evanmak:master Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants