
Fuzzing Perl: A Tale of Two American Fuzzy Lops - geeknik
http://www.geeknik.net/71nvhf1fp
======
moyix
The paper [1] on AFLFast is, IMO, a great example of where academia shines:
carefully looking at how and why something works, developing some theory and a
working model, and then using that to get a substantial improvement on the
state of the art (and doing a nice evaluation to show that it really works).

[1]
[https://www.comp.nus.edu.sg/~mboehme/paper/CCS16.pdf](https://www.comp.nus.edu.sg/~mboehme/paper/CCS16.pdf)

~~~
nickpsecurity
The abstract sounds like it. They said with no program analysis, though. I
thought program analysis was good enough that it could probably auto-generate
tests for every branch in a program, possibly in less time or with more
assurance. W as I wrong or is this a parallel subfield?

~~~
moyix
The question of whether randomized testing or program analysis gives you more
coverage of a program is a really interesting one. Böhme actually has an
earlier paper that addresses this question:
[https://www.comp.nus.edu.sg/~mboehme/paper/FSE14.pdf](https://www.comp.nus.edu.sg/~mboehme/paper/FSE14.pdf)

> We study the relative efficiencies of the random and systematic approaches
> to automated software testing. Using a simple but realistic set of
> assumptions, we propose a general model for software testing and define
> sampling strategies for random (R) and systematic (S₀) testing, where each
> sampling is associated with a sampling cost: 1 and c units of time,
> respectively. The two most important goals of software testing are: (i)
> achieving in minimal time a given degree of confidence x in a program's
> correctness and (ii) discovering a maximal number of errors within a given
> time bound n̂. For both (i) and (ii), we show that there exists a bound on c
> beyond which R performs better than S₀ on the average. Moreover for (i),
> this bound depends asymptotically only on x. We also show that the
> efficiency of R can be fitted to the exponential curve. Using these results
> we design a hybrid strategy H that starts with R and switches to S₀ when S₀
> is expected to discover more errors per unit time. In our experiments we
> find that H performs similarly or better than the most efficient of both and
> that S₀ may need to be significantly faster than our bounds suggest to
> retain efficiency over R.

Also, for (say) C programs of moderate size, I believe current program
analysis techniques do not achieve anything near 100% branch coverage. In the
KLEE paper they manage to get ~90% on average for the GNU coreutils using
concolic execution – which is really impressive! But the coreutils are tiny
programs. For larger programs the cost of symbolic execution is high and
random testing can often get you more bugs faster.

~~~
nickpsecurity
Forgot to say earlier, very insightful comment. I'm not sure if I trust the
2014 paper's claim at first glance. I'm going to have to think on it.
Interestingly, their hybrid strategy in the abstract is one of the two
approaches I was going to ask you about after your comment. I always think
about combining very different methods whenever they're in competition with
different strengths. Including program analysis and probabilistic testing.

KLEE result was interesting. I'm wondering about combining program analysis
with semi-random testing that generates its values within the ranges, etc that
come from program analysis. Might also spread the effort along the control
flow graphs and such. Alternatively, as in dynamic analysis, do the kinds of
instrumentation in software, annotations or compiler-level, that catch
situations that are probably risky combined with input from random testing
likely to set it off. Might speed up process but I imagine someone is already
doing one or both of these.

------
jjnoakes
In the first pass, 6 bugs were found and reported. 3 heap-use-after-free, 3
heap-buffer-overflow. Similar numbers in the second.

I'm so glad new programming languages are making strides which prevent this
sort of thing outright. They don't prevent all bugs, but they sure prevent
some of the most damaging ones.

~~~
theonemind
This seems like a fairly dubious claim without something to back it. What
makes implementations of newer languages more immune to heap-use-after-free
and heap-buffer-overflow bugs?

My first guess is that the two main things causing this is number of lines of
code (attack surface), and being implemented in C. The reference
implementation of Python is also in C. GHC is partially implemented in C.
Being implemented in C still seems common, and C is prone to these kind of
bugs. Perl is a complex language, giving the code base a large attack surface.
If anything, I'd guess languages with smaller implementations have fewer bugs,
since one of the most robust findings on these sorts of matters is that the
number of bugs is proportional to the number of lines of code independently of
other factors.

~~~
SwellJoe
I believe it'll be interesting to see how things shake out when we start
seeing languages implemented in Rust. We already have a wide variety of
languages for the JVM, which I believe most everyone would agree is a safer
environment for a language than C, and I'd bet the theory above holds up. I
don't know enough to know how to test that theory, however.

Basically, I think the "newer languages" comment refers to the implementation
language (C being the most common today, but probably not for more than a few
more years) rather than the language itself (Perl in this case). Though I
think you're entirely right that surface area plays a big role. I would be
willing to bet that Perl 6 will be more secure (by some definition of "more"
and "secure") than Perl 5, because the surface area of Perl 6 that is in C is
much smaller (the VM, with most of the language itself being written in a
subset of Perl called NQP, or Not Quite Perl).

~~~
rurban
Not because the surface area is smaller in perl6, only because with GC those
use-after-free bugs do not happen. Only with refcounted VM's, where it's easy
to access a value with refcount 0. With a GC other memory bugs are common:
Forgetting write-barriers, forgetting roots, external data.

And since most of the old dynamic languages (and also the new conservatives
ones) are refcounted, the most common error is use-after-free, and then buffer
overflows (heap or stack).

off-by-one is common to all systems.

Since I know perl5 and perl6 internals inside out I wouldn't bet that perl6 is
more secure than perl5 at all. The perl5 is more hairy and eroded over time,
but the perl6 vm is not really battle-proof and way too slow to be serious.

------
rurban
The published SEGV's are not security relevant. They only happen in DEBUGGING
output, which is not compiled into production perl's. Unless you use an old
redhat system, where they shipped 10x slower debugging perl.

I fixed the publicly reported bugs in 2 minutes. I cannot fix the other bugs
since they were not reported to cperl (the perl5 fork which is doing the
actual development of perl5). The perl5 security team is doing horrible work,
so I would prefer to get the reports also, for independent and usually better
fixes.

Brian Carpenter and Dan Collins provided excellent afl work lately for perl5.

~~~
mst
While your assertion about your fork being "the actual development" is
unfortunate, I would encourage anybody submitting security issues to the
actual perl5 to also consider sending them to cperl and MLEHMANN's stableperl,
since the users of minority forks deserve security too.

~~~
rurban
No worries, found all of them in public tickets already.

2 of them were actually security relevant, one stack overflow, one heap
overflow, all 6 issues already fixed in git.

Regarding your comment about "deserve security": yes. perl5 would deserve a
bit security, but all they do is theatre. dozens of known security fixes are
ignored.

------
thyrsus
My understanding is that fuzz testing uses pseudo-random variation of the seed
code; given a different seed to the PRNG, how common is it for the same fuzz
test to identify different flaws?

~~~
hannob
In my experience very rare. It's so rare I'd say it basically doesn't happen.

You'll get some fluctuation, a bug may come up half an hour earlier or not.
But results tend to be pretty reproducible. If you find a bug with a specific
test and tool in x hours then the next time you try for at least x+1 hours
you'll find it again.

~~~
moyix
I think that's right for just picking a different PRNG seed. When you start
looking at modifying the search heuristics, mutation operators, or other parts
of the "strategy", you definitely start finding different bugs, though.

~~~
hannob
Yes, absolutely. Different strategies can lead to vastly different results,
often it's subtle things.

Good example: There was a bug I found in openssl that Libfuzzer was unable to
find. The Libfuzzer developer was quite interested in this and has now adopted
new mutation strategies:
[https://github.com/google/sanitizers/issues/710](https://github.com/google/sanitizers/issues/710)

