
Open-Sourcing ClusterFuzz - markoa
https://opensource.googleblog.com/2019/02/open-sourcing-clusterfuzz.html
======
metzmanj
I work on this. Happy to answer questions if people have any.

~~~
devy
Hi, good stuff! I am curious to know why it was primarily written in Python
(according to GitHub 83.3%)?

~~~
tracker1
Just generally speaking, code that does orchestration and testing in general
is often easier under a dynamic scripted language over something that is built
and compiled, even if it winds up as a custom DSL. I think Python is one of
the better options here for the broader community support, and tooling.

Aside: I tend to reach for node/js often for similar reasons (despite
detractors) mostly because I'm more comfortable with it over Python or Ruby,
but also because it's already integrated to most of the build/test
environments I'm working on anyway.

------
boulos
Disclosure: I work on Google Cloud.

I'm super pleased to see this! Abhishek and the cluterfuzz team were one of
our initial customers for Preemptible VMs, still are, and make for a great
example. Congrats to the team!

------
guidovranken
I don't want to hijack the thread subject but here are my thoughts on the
usefulness of fuzzing of safe languages.

Even in the absence of memory corruption bugs there is a subclass of bugs that
can emerge in any general-purpose language, like slowness/hangs, assert
failures, panics and excessive resource consumption.

Barring those, you can detect invariant violations, (de)serialization
inconsistencies (eg. deserialize(serialize(input)) != input, eg. see [1]),
different behavior across multiple libraries whose semantics must be identical
(crypto currency implementations are notable in this regard as deviation from
the spec or canonical implementation in the execution of scripts or smart
contracts can lead to chain splits).

With some effort you can do differential 64 bit/32 bit fuzzing on the same
machine, and I've found interesting discrepancies between the interpretation
of numeric values in JSON parsers, which makes sense if you think about it
(size_t and float have a different size on each architecture, causing the 32
bit parser to truncate values). This might be applicable to every language
that does not guarantee type sizes across architectures like Go (not sure?),
but I haven't tested that yet.

You can detect path escape/traversal (which is entirely language-agnostic but
potentially severe) by asserting that any absolute path that is ever accessed
within an app has a legal path, or by fuzzing a path sanitizer specifically.

And so on.

Code coverage is the primary metric used in fuzzing, but other metrics can be
useful as well. I've experimented extensively with metrics such as allocation,
code intensity (number of basic blocks executed) (which helped me prove that
V8's WASM JIT compiler can be subjected to inputs of average size that take
>20 seconds to compile), and stack depth, see also [2].

Any quantifier can be used as a fuzzing metric, for example the largest
difference between two variables in your program.

Let's say you have a decompression algorithm that takes C as an input and
outputs D. Calculate R = len(D) / len(C), so that R is the ratio between
compressed input and decompressed output. Use R as a fuzzing metric and the
fuzzer will tend to generate inputs that have a high compressed/decompressed
size ratio, possibly leading to the discovery of decompression bombs [3].

Wrt. this, libFuzzer now also natively supports custom counters I believe [4].

Based on Rody Kersten's work I implemented libFuzzer-based fuzzing of Java
applications supporting code coverage, intensity and allocation metrics [5],
and it should not be difficult to plug this into ClusterFuzz/oss-fuzz.

Feel free to get in touch if you have any questions or need help.

[1]
[https://github.com/nlohmann/json/blob/develop/test/src/fuzze...](https://github.com/nlohmann/json/blob/develop/test/src/fuzzer-
parse_json.cpp)

[2] [https://github.com/guidovranken/libfuzzer-
gv](https://github.com/guidovranken/libfuzzer-gv)

[3]
[https://en.wikipedia.org/wiki/Zip_bomb](https://en.wikipedia.org/wiki/Zip_bomb)

[4]
[https://llvm.org/docs/doxygen/FuzzerExtraCounters_8cpp_sourc...](https://llvm.org/docs/doxygen/FuzzerExtraCounters_8cpp_source.html)

[5] [https://github.com/guidovranken/libfuzzer-
java](https://github.com/guidovranken/libfuzzer-java)

~~~
metzmanj
Great post Guido!

Guido's bignum fuzzer which tests the correctness of math operations in crypto
libraries is one of the most interesting fuzzers we run on ClusterFuzz.

------
rarecoil
Thank you for open sourcing this. For those interested in trying multiple
cluster-based fuzzing solutions, I'd also like to point at yahoo/yfuzz[1],
which is k8s-backed.

[1] [https://github.com/yahoo/yfuzz](https://github.com/yahoo/yfuzz)

------
bobwaycott
For those interested in the repo:
[https://github.com/google/clusterfuzz](https://github.com/google/clusterfuzz)

------
polskibus
Is there a fuzzing tool oriented towards web applications? Something that
could generate loads of Selenium cases automatically and verify whether the
application crashes, logs an exception or continues to work smoothly??

~~~
zdragnar
There are a boatload of pentesting (i.e. penetration testing) tools that use
fuzzing. Just be sure that your sysadmin and / or your cloud provider are
aware that you intend on running such a test, as that's a really quick way to
accidentally bring down servers you weren't anticipating would be connected,
or get your IP address banned for DDOS attacks (looking at you, junior QA guy
who had good intentions but caused all sorts of havoc).

Edit: Just realized I didn't quite address your question fully.
[https://pentest-tools.com/home](https://pentest-tools.com/home) is an online
service that will run tests, including URL fuzzing and what not. All of the
features they offer can also be found in open source and proprietary software.
Not sure about saving failed tests as selenium tests for re-running in the
future, though I imagine that you'd just re-run the same tool in the first
place.

~~~
polskibus
What open source did you mean? I'd like to run such tests on the intranet,
without access to any SaaS.

~~~
babayega2
Last time I tested
[https://github.com/s0md3v/XSStrike](https://github.com/s0md3v/XSStrike) whith
some quite interesting results. It's important to note that it's not a
fizzbuzz tool, but just a pentest one.

------
Insanity
Perhaps a noobie question, but it mentions c/c++ specifically. How does this
hold up for Go? Where you have pointers but no pointer arithmetic?

~~~
guidovranken
Go software can exhibit a variety of denial-of-service bugs such as slice out-
of-bounds access (since there is no try/catch mechanism, this leads to a
panic), excessive allocations, excessive computation/timeout (consider "for i
:= 0; i < N; i++" where N is untrusted), stack overflow due to unbounded
recursion (rare because Go has a custom, large stack).

My bignum-fuzzer project [1] runs on oss-fuzz and tries to find mismatches
between bignum computations across different libraries (OpenSSL, Go, Rust,
etc). This is one example of how fuzzing can be useful even if the underlying
language is "safe".

With some small hacks you can also have Go code coverage instrumentation as a
libFuzzer counter.

[1] [https://github.com/guidovranken/bignum-
fuzzer/blob/master/mo...](https://github.com/guidovranken/bignum-
fuzzer/blob/master/modules/go/lib.go)

~~~
staticassertion
And Go isn't memory safe given race conditions.

If you're using goroutines you may want to consider fuzzing with the race
detector.

------
painful
How about people stop using unsafe languages such as C and C++?

~~~
stormbeard
What planet do you live on? What do you think embedded/realtime systems,
signal processing, graphics, and kernel developers are supposed to use? Also,
what do you think these memory-safe, garbage collected, runtime environments
are written in?

~~~
saagarjha
> What do you think embedded/realtime systems, signal processing, graphics,
> and kernel developers are supposed to use?

My guess is that they'd use Rust for new code.

~~~
adrianN
Rust doesn't yet support every platform that C supports and training staff to
use a new language and adding tooling to integrate a new language into an
existing codebase is extremely expensive.

~~~
saagarjha
I don't even know Rust, so there's a reason I write low-level stuff like that
in C/C++ ;)

------
syastrov
Makes you think about choosing to write software in C / C++ / other non-
memory-safe languages when you need 25000 cores churning away to ensure you
don’t make mistakes that could cause serious security issues.

It makes me wonder why Google wouldn’t put their efforts into using Rust, for
example.

Of course, server power is cheap, but not for our planet.

~~~
twblalock
I know some embedded/kernel devs and they don't take Rust very seriously. I
don't think it has a lot of mindshare in industry among the kind of people who
currently write C.

Even if all new code was written in safe languages we would still need to do
fuzzing until all legacy code was rewritten -- operating systems, SSL
libraries, browsers, load balancers, etc.

~~~
master-litty
I truly believe that's the only reason people aren't using Rust as a
replacement for C / C++ immediately. The adoption isn't widespread, but I'm
rather optimistic it will be.

It's getting there.

~~~
twblalock
I don't think it's the only reason.

My C programmer friends like the dangerous features of the language and want
to use them. Their programs are designed around the assumption that they can
mutate anything in memory whenever they want to. They use mutable global
variables. They would strongly dislike Rust's memory protections and ownership
concept, and its other safety features.

~~~
feanaro
To what purpose? It sounds like they are being reckless just for the heck of
it, because that's how the big boys do it and they can't be bothered to learn
a new skill.

