

Fuzzing at scale - abraham
http://googleonlinesecurity.blogspot.com/2011/08/fuzzing-at-scale.html

======
tptacek
Gulp.

A team at Google pulled 20tb(!) of SWF files out of their crawl and fed them
through a simple algorithm that determined the subset of 20,000 SWF files that
exercised the maximum number of basic blocks in Adobe's Flash Player.

Then, using 2000 CPU cores at Google for 3 weeks, they flipped random bits in
those 20,000 SWF files and fed them through an instrumented Flash Player.

Result: 80 code changes in Flash Player to fix security bugs from the
resulting crashes.

This is great stuff; you can imagine a _very_ well organized adversary
spending the money on comparable compute resources, and even (if you stretch)
obtaining the non-recoverable engineering time to build a comparably
sophisticated fuzzing farm. But no other entity excepting perhaps Microsoft
can generate the optimal corpus of SWF files to fuzz from.

DO PDF NEXT, GOOGLE.

You've got to ask yourself: in a year or so, if there are still regular
updates for exploitable zero-day memory corruption flaws in Flash, even after
Google exhaustively tests the input to every basic block in the player with
the union of all file format features observed on the entire Internet, what
does that say about the hardness of making software resilient to attack?

~~~
jrockway
_what does that say about the hardness of making software resilient to attack_

Well, we already know the answer to that question. The long-term solution is
to build software out of safer building blocks. A good example is high-level
programming languages. When you write C and use cstrings (or other
unstructured "blocks-o-ram" data structures), you have to "get it right" every
single time you touch a string. In a codebase the size of Flash's, this
probably amounts to tens of thousands of possible bugs. But if you write it in
a high-level language, the runtime implementor only has to get it right once
-- and there's no way you can get it wrong, even if you want to. (The middle
ground is something like better string handling functions; OpenBSD tried to do
this with strl*, and there are libraries like bstrings that represent strings
properly. But you can still do unsafe pointer math on these strings with not-
much-benefit.)

The way it stands right now, it's cheaper to write software with the approach
of "throw some trained monkeys at it and hope for the best". But in the
future, we're going to need to do a lot more thinking and a lot less typing if
we want to write secure software without performance penalties.

~~~
yuhong
Binary file formats make it worse, because you are dealing with untrusted data
disguised as a C struct. For example, even multiplying with an integer that is
too big can result in an integer overflow, and C will silently truncate.

------
nbpoole
That blog post seems to contradict what Tavis Ormandy claimed on Twitter a few
days ago, when the patch was released:

> _Adobe patched around 400 unique vulnerabilities I had sent them in
> APSB11-21 as part of an ongoing security audit. Not a typo._

<https://twitter.com/#!/taviso/status/101046246277521409>

> _Apparently that number was embarrassingly high, and they're trying to bury
> the results, so I'll publish my own advisory later today._

<https://twitter.com/#!/taviso/status/101046396790128640>

Whereas the blog post cites 400 unique crashes, 106 security bugs, and 80 code
changes (the same numbers that Adobe used:
[http://blogs.adobe.com/asset/2011/08/how-did-you-get-to-
that...](http://blogs.adobe.com/asset/2011/08/how-did-you-get-to-that-
number.html)).

\---

Regardless of the exact numbers though, this is a supremely awesome feat of
security engineering. It's very impressive.

~~~
tptacek
Code changes feels like the best count, unless you believe Adobe's letting
crashers slip past this release.

------
wglb
Ok, I am going to say that this is just a little scary, scalewise. And I am
thinking that the 2000 cores they used was some teeny fraction of what they
might have deployed if they really needed it.

~~~
nitrogen
On the subject of raw computing power, if you live in the US you've probably
heard about some NSA or CIA data facility being installed in your general
region, and how the local power company built new infrastructure just to power
the building. If Google can throw 2000 cores at securing software, how many
can a government throw at breaking it, e.g. in preparation for the next
iteration of Stuxnet?

~~~
tptacek
The cluster is interesting, but not as interesting as the giant corpus of SWF
files Google got to use. Do you think the government has a crawl as complete
as Google's under its hat? How? People notice when the Googlebot does new
things. Wouldn't we noticed the Fedbot?

~~~
wisty
Perhaps Fedbot crawls in a less deterministic manner, uses a lot of different
ips, and sets user agent to IE?

~~~
bigiain
I suspect "fedbot" works by calling up google and saying "Hi, it's us again,
we've got another white van on the way to the googleplex, have a petabyte or
two of the Internet ready for us to collect in 20 minutes. thanks"

------
SoftwareMaven
This is really awesome. The one question I had: are there copyright issues
associated with Google using its index this way?

~~~
mbrubeck
Copyright law generally allows copying for research purposes that don't
compete commercially with the author's use of the work. For example, see
<https://w2.eff.org/IP/DMCA/Felten_v_RIAA/> (And Google, unlike some of us,
can afford lawyers to press that point in court if anyone tries to sue them.)

