
Vectorized Emulation: fuzzing at 2 trillion instructions per second - muricula
https://gamozolabs.github.io/fuzzing/2018/10/14/vectorized_emulation.html
======
munin
It would be cool to see an evaluation of this against baseline. Software is a
giant warehouse filled with soggy pinatas. Swing any kind of bat in there and
you'll get some candy.

~~~
gamozolabs
I'll try to do a blog on this specific aspect. It can be hard to apples to
apples with other fuzzers and toolings but I can give it a try.

This tooling is specifically designed for "hard" targets, while "hard" is
subjective, think targets with fewer than 2 CVEs a year. Where getting even a
null deref is hard.

I have used this on some soft targets and it's just as if you ran AFL against
it, candy everywhere. The upside is that this tool usually "finishes" in an
hour (no more coverage, no more crashes). Making it a bit easier to develop
mutators/generators as you can run them to completion faster and have a more
effective development cycle.

------
TickleSteve
I may be getting this wrong... but how can this possibly work?

Correct me if I'm wrong, but he's trying to emulate 16 systems in parallel by
vectorizing the instructions.

Ok, but this assumes that all paths are identical. Once you start fuzzing by
varying their input, the paths will all vary at which point you're down to 16
non-identical paths again.

The whole point of fuzzing is to expose the different paths and this would
fail horribly for that, surely?

~~~
gamozolabs
trishume hits the nail on the head here. The mutation strategy is well aware
of how this system works.

Each core gets a completely unique fuzz case, where each lane of the vector
gets a small mutation. In _many_ cases this mutation doesn't even affect flow
(eg, the mutated parts are skipped over or never parsed due to errors).
Meaning all 16 run to completion. What's really important here is that when
the small modification you made to an individual lane does actually cause it
to diverge, you now know where and when that part of the input is used in the
program. This information is huge and can be used to tweak weights and other
parameters of mutators/generators, making them learn which fields to use when
and how often.

With some better logic (covered in a later blog) I'll talk more about handling
fully divergent cases by having graph analysis to find post dominators in
functions and run VMs until they can sync up. Rather than the current model of
"sync if you can", this will be a smart forward-looking sync that will ensure
that by the end of every function all VMs will be running again (even if that
means I have to insert artifical post dominator nodes to graphs).

~~~
TickleSteve
What kind of application are you running such that you don't get large path
differences based on the input?

All fuzzing I have done (with AFL and the like) the paths vary wildly and will
end up in completely different parts of the stack.

Surely the advantage you're gaining by parallelising is completely wiped out
when you lose sync (which to me must be most of the time).

I just can't see how you can possibly keep sync between parallel runs in
anything but the most trivial application-under-test.

To me, it seems that this will necessarily degrade to a single path being
active. Have you done any analysis on how many paths are active simultaneously
over a non-trivial run?

~~~
derefr
Crypto code, maybe?

------
aaaaaaaaaab
>The goal is to take standard applications and JIT them to their AVX-512
equivalent such that we can fuzz 16 VMs at a time per thread.

What happens if the executable already contains AVX512 (or other SIMD)
instructions?

Though I guess you can rewrite them into their scalar equivalent first, and
convert that into AVX512.

~~~
vnorilo
The real problem is when different VMs branch differently. You essentially
have to run both sides of a branch in serial with masked lanes. The author
mentions this and states that the method works only because the VMs are very
deterministic. I don't understand how he deals with indirect jumps.

~~~
DSingularity
You mean the possibility that the branch target is different on the different
VMs? He can fork the vectorization if the targets diverge. I’m not sure what
he does though.

~~~
vnorilo
Sure, but how does that not utterly destroy performance as the fork would
probably lead to multiple sub-forks down the line.. or lane? :)

~~~
DSingularity
When you are fuzzing the ultimate goal is to explore as much as you can of the
entire search space. Forks cant hurt performance. They are just another
potential execution path to be vectorized.

------
mpartel
This is supremely cool!

The masking stuff reminds me of how branching works (or at least used to work)
on GPUs, so I wonder, could all this be made to work on a GPU too?

------
jacobush
I don't know what this is except that it's something about optimization or bug
finding or both and that it's _wonderfully_ weird. How often do you see
someone use up 96 gigs of RAM and rewrite instructions to AVX512, use a
virtual MMU?

~~~
Double_a_92
Same. I have no idea what I just read. It's probably one of those things that
only make sense if you already understand them. _cough_ math topics on
Wikipedia _cough_

~~~
pjc50
"Fuzzing" is the technique for finding bugs by varying the inputs of a program
and looking at how this affects code paths. A simpler place to start would be
[http://lcamtuf.coredump.cx/afl/](http://lcamtuf.coredump.cx/afl/)

------
rbanffy
> if you cheap out on RAM and buy used Phis you could probably get the same
> setup for $1k USD.

I'd _love_ to lay my hands on a cheap 72x5 Phi based workstations...

But since Intel only sells those by the tray, I guess I'll have to wait until
someone decommission a large supercomputer.

~~~
crazysim
[https://www.ebay.com/itm/INTEL-ES-Xeon-Phi-7210-ES-
CPU-1-30G...](https://www.ebay.com/itm/INTEL-ES-Xeon-Phi-7210-ES-
CPU-1-30GHz-64-core-
LGA-3647-QNVN/192653335449?hash=item2cdb089b99:g:~rAAAOSw3G9blno~:rk:2:pf:0)

?

~~~
rbanffy
The ones that end in "5" have virtualization.

------
brian_herman__
Cool can this be in the next version of
[http://lcamtuf.coredump.cx/afl/](http://lcamtuf.coredump.cx/afl/)?

------
mckirk
Okay, I get the idea with kmasks for branching. But what about loops, where
one VM would still have to continue the loop and the others would have to
leave it?

~~~
sanxiyn
You loop until all masks are zero.

~~~
TickleSteve
....and different branches? i.e based in the outcome of that loop??

------
vectorEQ
This is really interesting. i'm looking forward to more blogs on the last few
topics / chapters!

------
lachlan-sneff
This is insane. and awesome.

------
rbanffy
I love to see the creative misuse of technology. We need more of that.

