Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future of fuzzing #9827

Closed
gmilescu opened this issue Dec 2, 2022 · 7 comments
Closed

Future of fuzzing #9827

gmilescu opened this issue Dec 2, 2022 · 7 comments
Labels
C-housekeeping Category: Refactoring, cleanups, code quality GroomedA Node Node team Task

Comments

@gmilescu
Copy link

gmilescu commented Dec 2, 2022

Recently, we [have been seeing fuzzer crashes being found but that somehow disappear](Near-One/nayduck#36). This is a bug that makes our fuzzing infra much less useful if it keeps happening.

We also currently [don’t support cargo-bolero yet](Near-One/nayduck#33), which makes it harder than necessary to add a new fuzz target.

This issue is for investigating the potential ways forward and choosing what to do. My personal preference leans towards using ClusterFuzz now and making nayduck interrupt its workers later.

The current situation is:

  • Nayduck runs on 25 VMs (~$2-3k/mo)
  • Nayduck actually really uses these VMs only ~1hr/day average
  • During the rest of the time, our homegrown fuzzer runner runs fuzzing
  • Our homegrown fuzzer runner pauses and resumes fuzzing whenever a nayduck test wants to run
  • We only support cargo-fuzz fuzz targets, which means adding a fuzz target is a mess
      1. Keep the status quo
  • *Pro:* Least amount of work
  • *Con:* We keep losing some fuzzer artifacts, which means missing potentially-S0 issues. We can implement cargo-bolero support with some work
  • *Con:* We keep not supporting cargo-bolero fuzz targets
  • *Con:* When we replace nayduck with something less in-house, we’ll have the fuzzer keeping us on the old infrastructure until we switch to ClusterFuzz
      1. Keep the status quo but fix the disappearing reproducers issue
  • *Pro:* Least changes to infrastructure
  • *Pro/Con:* It is hard to estimate the amount of work needed to fix it. This could end up being either a pro or a con.
  • *Con:* Supporting cargo-bolero fuzz targets is a ~2 weeks additional project on top of the fix
  • *Con:* When we replace nayduck with something less in-house, we’ll have the fuzzer keeping us on the old infrastructure until we switch to ClusterFuzz
      1. Rewriting the current fuzzer infra to be more resilient
  • *Pro:* We can keep using the same machines as nayduck
  • *Pro:* We take advantage of this change to start supporting cargo-bolero fuzz targets
  • *Con:* 1-2 months of engineering time to implement it in rust based on the experience with writing the current python runner
  • *Con:* We don’t know yet how much of the issue with disappearing artifacts is due to the infra runner vs interactions with nayduck going wrong
  • *Con:* When we replace nayduck with something less in-house, we’ll have the fuzzer keeping us on the old infrastructure until we switch to ClusterFuzz
      1. Using ClusterFuzz
  • *Pro:* Supported by Google, so we’re pretty sure it’d work well
  • *Pro:* We take advantage of this change to start supporting cargo-bolero fuzz targets (Running cargo-bolero jobs on ClusterFuzz camshaft/bolero#98 describes a way to do that)
  • *Con:* Around 1 month of engineering time to deploy it with a proper build pipeline
  • *Con:* Additional expenses for the infra as it could not run alongside nayduck (around $2k/mo to have the same amount of fuzzing as nayduck)
      1. Using ClusterFuzz and making nayduck interrupt its workers when not actually using them
  • *Pro:* ClusterFuzz is supported by Google, so we’re pretty sure it’d work well
  • *Pro:* We take advantage of this change to start supporting cargo-bolero fuzz targets
  • *Pro:* It’d probably become even less expensive than the current situation, as ClusterFuzz uses pre-emptible machines
  • *Con:* The most engineering effort, as we don’t have (m?)any people knowing nayduck well enough to actually implement the interruption
      1. Running both nayduck and ClusterFuzz on top of nested virtualization VMs
  • *Pro:* Same cost as today, plus fuzzer would be supported by Google
  • *Pro:* We take advantage of this change to start supporting cargo-bolero fuzz targets
  • *Con:* Amount of work is hard-to-guess, as ClusterFuzz seems to attempt to create its own GCP VMs, so running it on top of a non-directly-GCP VM might be hard
  • *Con:* It’s unknown how well just setting CPU prio etc. would work for both nayduck and the fuzzer, as today the fuzzer gets a full SIGSTOP when nayduck is running a test
    1. Current status: implementing "Using ClusterFuzz" solution

Still missing:

  • [x] need to build fuzzers on each new commit rather than once a day
  • need to integrate the nayduck fuzzers’ corpus into clusterfuzz
  • verify that the release process does document building new ondemand fuzzers
  • can we move the workflows from github actions to buildkite?

ND-265 created by None

@gmilescu
Copy link
Author

gmilescu commented Dec 6, 2022

jakmeier commented: @Ekleog-NEAR that's a super useful and very clear description of status quo and possible options!

Quick question, could there be a small effort fix of the status quo that we could discover if we do some time-boxed investigation on Near-One/nayduck#36 ? If yes, this would give us better S0 detection immediately and more time to consider the best options for fuzzing long term.

Second question, would migration to ClusterFuzz change anything regarding the cargo-bolero situation? Making it easier to add new fuzzing targets seems very valuable to me and would make it easier to justify spending a month or two on it.

by 557058:c020323a-70e4-4c07-9ccc-3ad89b1c02ec

@gmilescu
Copy link
Author

gmilescu commented Dec 6, 2022

Ekleog-NEAR commented: For #36, there’s nothing in any of the logs I could find that’d explain why the reproducer is not showing up, so I’m not expecting anything good to come out of investigating, especially as it seems to be an intermittent. Now, it’s not impossible that it’d have a good result, but my rough guess would be <= 40% success likelihood. (Which is also my expectation for the likelyhood of the next crash being successfully reported)

Migration to ClusterFuzz would require making a build pipeline that generates a libfuzzer binary. I have looked into it and it seems like cargo-bolero would be reasonably easily possible to integrate with it, though the niceties brought by bolero do mean that it’s not "just" generating a libfuzzer binary mean there’d probably be some work around there (I just investigated possible ways forward and opened camshaft/bolero#98 to discuss it with camshaft). Still, I don’t expect the total time, including clusterfuzz deployment and cargo-bolero patching, to take more than 2 months.

by 557058:c020323a-70e4-4c07-9ccc-3ad89b1c02ec

@gmilescu
Copy link
Author

gmilescu commented Dec 6, 2022

Ekleog-NEAR commented: I’ve just added cargo-bolero support to the first post above

by 557058:c020323a-70e4-4c07-9ccc-3ad89b1c02ec

@gmilescu
Copy link
Author

Ekleog-NEAR commented: @andrei-near is working on productionalizing ClusterFuzz, which should fix this once all the features are complete

by 557058:c020323a-70e4-4c07-9ccc-3ad89b1c02ec

@gmilescu
Copy link
Author

Ekleog-NEAR commented: Current status before this is completely done:

  • [] need to build fuzzers on each new commit rather than once a day
  • need to integrate the nayduck fuzzers’ corpus into clusterfuzz
  • verify that the release process does document building new ondemand fuzzers

by 557058:c020323a-70e4-4c07-9ccc-3ad89b1c02ec

@gmilescu
Copy link
Author

Closing this issue as we moved to ClusterFuzz.

by 6297d8d51648f2006962b123

@gmilescu
Copy link
Author

Ekleog-NEAR commented: @gmilescu I don’t think the two points I listed in my last comment have been completed yet? If I’m correct, should we track them here or at some other issue tracker? (Reopening in the meantime)

by 557058:c020323a-70e4-4c07-9ccc-3ad89b1c02ec

@gmilescu gmilescu added the Node Node team label Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-housekeeping Category: Refactoring, cleanups, code quality GroomedA Node Node team Task
Projects
None yet
Development

No branches or pull requests

2 participants