
Evaluating Fuzz Testing - DyslexicAtheist
https://arxiv.org/abs/1808.09700
======
mwhicks1
I'm one of the authors of the paper. Two quick things I want to say:

I've written a blog post that is a short but rigorous description of the
paper's results: [http://www.pl-enthusiast.net/2018/08/23/evaluating-
empirical...](http://www.pl-enthusiast.net/2018/08/23/evaluating-empirical-
evaluations-for-fuzz-testing/)

The paper does not make strong statements of the form, "fuzzers are/do X" or
"all prior papers' claims are bogus." Rather, we say that the standard of
evidence should be higher, and we demonstrate why failing to reach that
standard _could_ result in bogus claims. It's quite possible that a paper's
idea really is an improvement. But it's also possible that additional evidence
would cast doubt on that, or nuance it. For example, our experiments show that
AFLFast probably does improve on AFL, though perhaps not as much as that paper
made out (note we didn't do enough experiments ourselves to make that
definitive statement).

~~~
bluGill
It was a very interesting paper, thanks for writing it.

Now to figure out how to fuzz my program...

------
tyoma
This paper is absolutely fascinating.

Fuzzers have some inherent nondeterminism, meaning that some runs agains the
same program will differ in how many bugs they find.

It turns out pretty much every fuzzer evaluation to date has not been
statistically rigorous. We dont actually know if a lot of published results
are, well, really an improvement.

There is also a fascinating bit towards the end. It turns out doing the dumb
thing and _not_ tracking code coverage finds more crashes faster than
intelligent coverage tracking, but only with good seed inputs (see page 7).

~~~
hannob
> It turns out pretty much every fuzzer evaluation to date has not been
> statistically rigorous.

Shall I tell you a not so well hidden secret? Almost nothing published in
computer science, let alone in IT security, is statistically rigorous.

~~~
Ar-Curunir
That's not true; plenty of papers in the systems field perform careful and
rigorous experiments. This doesn't even include the theory side of stuff,
where you only have proofs, and no statistics to get wrong.

------
fulafel
Choice bit from abstract: "We surveyed the recent research literature and
assessed the experimental evaluations carried out by 32 fuzzing papers. We
found problems in every evaluation we considered. We then performed our own
extensive experimental evaluation using an existing fuzzer."

~~~
mwhicks1
Note that some papers do come close to meeting what we view is the right
standard. Table 1 shows that T-Fuzz does pretty well, for example.

