
Neural Nets Can Learn Function Type Signatures from Binaries [pdf] - bmc7505
https://www.usenix.org/system/files/conference/usenixsecurity17/sec17-chua.pdf
======
dmix
TLDR:

> In this paper, we present a new system called EKLAVYA which trains a
> recurrent neural network to recover function type signatures from
> disassembled binary code. EKLAVYA assumes no knowledge of the target
> instruction set semantics to make such inference.

> [..] we find by analyzing its model that it auto-learns relationships
> between instructions, compiler conventions, stack frame setup instructions,
> use-before-write patterns, and operations relevant to identifying types
> directly from binaries.

> In our evaluation on Linux binaries compiled with clang and gcc, for two
> different architectures (x86 and x64), _EKLAVYA exhibits accuracy of around
> 84% and 81%_ for function argument count and type recovery tasks
> respectively.

~~~
self_awareness
> "On our x86 and x64 datasets, EKLAVYA exhibits comparable accuracy with
> traditional heuristics-based methods"

------
ma2rten
I've long had this idea to build a decompiler (a program that maps binaries
back to source code) using machine learning. The problem in decompilation is
that you loose information when you compile source code. Machine learning
could help even recover things like most likely variable names. There is also
tons of training data that can be easily generated.

~~~
noonespecial
I've had the same idea, but instead it could "decompile" spaghetti code with
horrible variable names into something readable.

It seems like a reasonable (almost believable even) first stage to software
finally coming to eat the jobs of its creators.

~~~
Swizec
Our jobs will just move an abstraction level higher. We used to have to worry
about registers and stuff, then we automated that. We used to worry about
memory, then we automated that. We used to worry about CPU stuff, then we
automated that for most applocations.

Eventually we're going to become technical PMs. As long as there will be fuzzy
unclear problem descriptions, there will be jobs for people who can codify and
standardize processes.

~~~
haikuginger
"First they came for the assembly programmers..."

~~~
Filligree
"And then they ate the universe, and there was no-one left to care."

Plenty of paperclips, though.

------
placebo
I've often wondered about the theoretical limit of a neural net to learn from
examples - seems like a fascinating subject with a lots of implications. From
a quick search, I found this paper:
[https://experts.illinois.edu/en/publications/computational-l...](https://experts.illinois.edu/en/publications/computational-
limitations-on-learning-from-examples) which is already very interesting. Are
there any other good pointers on this?

~~~
bmc7505
According to the universal approximation theorem [1], a multilayer feedforward
net with a single hidden layer can approximate any continuous function on R^n.
There's also a nice visual proof. [2]

[1]:
[http://www.sciencedirect.com/science/article/pii/08936080899...](http://www.sciencedirect.com/science/article/pii/0893608089900208)

[2]:
[http://neuralnetworksanddeeplearning.com/chap4.html](http://neuralnetworksanddeeplearning.com/chap4.html)

~~~
drdeca
Are there also theorems about what functions can be learned, or how quickly
they can be learned, by a multilayer feed forward neural net using
backpropagation?

~~~
jackpirate
Yes, they use what's called the VC-dimension. Chapter 20 of the book
Understanding Machine Learning (available for free online
[http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning...](http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/))
has a good introduction to the basics of this theory.

The short story is that VC dimension describes "worst case" behavior of a
learning algorithm. This is a good predictor of performance on really hard
problems, but lots of modern problems turn out to be theoretically easy. For
example, neural networks trained on images work much better than we would
guess based on the VC dimension of neural networks. This is because the
distribution of images is closer to a best case scenario than a worst case
scenario.

Lots of research is being done to determine when we are in a best case
scenario. The "simplest" is called the Rademacher complexity. (See Chapter 26
of the UML book.) Simplest is in quotes because most phd-level students in
machine learning that I've met don't even understand what a rademacher
complexity is unless they are specializing in theory. And the rademacher
complexity doesn't really even come close to capturing why neural networks
work well on images either.

------
incompatible
COTS = "Common off the shelf", in case anyone is equally mystified. Google
wasn't immediately helpful.

~~~
w_t_payne
Commercial Off The Shelf, I thought.

~~~
vog
Indeed, it is "Commercial Off The Shelf". I sometimes saw this in marketing
speech, too.

~~~
incompatible
Apparently "Consumer Off-The-Shelf" is another variant.

------
therandomoracle
The authors have put out the datasets:
[https://github.com/shensq04/EKLAVYA](https://github.com/shensq04/EKLAVYA)

------
haberman
This would only be useful for binaries where you don't have corresponding
source, but the binary does still have symbols. Is this a common situation?

~~~
jdblair
Its usually all you get if you are reverse engineering someone else's
executable or firmware image.

~~~
haberman
But they leave symbols in? They don't strip those also?

~~~
sillysaurus3
Indeed. If you start Emacs on OSX and run `symbols Emacs`, you get a lot of
info:

    
    
      100139190 (    0x90) verror [FUNC, EXT...
      100139220 (   0x210) Fcommandp [FUNC, ...
      100139430 (    0xf0) Fautoload [FUNC, ...
      100139520 (    0x90) un_autoload [FUNC...
      1001395b0 (    0x80) Feval [FUNC, EXT,...
      100139630 (    0x50) record_in_backtra...
      100139680 (   0x1d0) apply_lambda [FUN...
      100139850 (   0x390) Fapply [FUNC, EXT...
      100139be0 (   0x5c0) Ffuncall [FUNC, E...
      10013a1a0 (    0x90) Frun_hooks [FUNC,...
      10013a230 (   0x1d0) run_hook_with_arg...
      10013a400 (    0x20) funcall_nil [FUNC...
      10013a420 (    0x20) Frun_hook_with_ar...
      10013a440 (    0x20) Frun_hook_with_ar...
      10013a460 (    0x30) Frun_hook_with_ar...
      10013a490 (    0x30) funcall_not [FUNC...
    

So why not strip these? Well, you can't. But the reason is interesting. Most
programs run by using shared libraries, and these libraries need to provide a
standard API that other programs can use. Hence the function names. Which of
course is exactly the info that a hypothetical bad guy needs to make sense of
the program.

But it goes beyond this. You could imagine statically linking a program
against all of its dependencies, then stripping all possible symbols. The
thing is, there is plenty of info in the binary to make sense of the program.
Strings, for example. Error messages. Every piece of data you want to show to
the user is an indication of what the surrounding function is doing.

IDA Pro is pretty incredible in that situation. Each time you figure out what
a function is doing, you just give it whatever name you want. IDA updates the
whole interface so that it uses that name everywhere, instead of the hex
address. You end up with a neat, perfectly sensible program. Entire game
engines have been meticulously reverse engineered this way.

However, viruses use a crafty technique to prevent analysis. You encrypt your
program like an onion. Your outer program runs, and everyone can see that. But
it's carrying around a blob of functionality that was encrypted _using the
target 's information_. E.g. their MAC address, or a subset of their list of
installed programs, or anything that would uniquely identify your target
separately from everyone else. Any time the virus is installed anywhere, it
tries to decrypt itself using this information, which only succeeds when it's
installed on the target machine.

This defeats all attempts at analysis. You can't analyze what you can't
decrypt.

~~~
heavenlyblue
This is an interesting concept. Can you provide any links to the practical
info re. this?

It seems to be quite easy to do something similar; especially if you're
specifically targeting a single PC with a very specific configuration.

How do you ensure that the amount of entropy in the key is enough to stop a
person from finding the key? Assuming the person reverse-engineering the virus
already knows the key generation routine, since he has access to the binary of
the virus.

Hostnames and lists of installed programs should be prone to a dictionary
attack, mac addresses are not even close to having enough bytes.

~~~
sillysaurus3
Yeah! I have to run for a bit but I'll be back. There's an incredible article
about Flame I detailed a bit here:

[https://news.ycombinator.com/item?id=15046089](https://news.ycombinator.com/item?id=15046089)

It depends entirely whether you have persistent access to the target. If your
target is airgapped, your only option is to know something about the target
machine (i.e. have a spy on the inside) that gives you enough info to encrypt
the virus. For example, they could install a special program so that it's
listed in C:\Program Files, then the virus decrypts using the string
"${mac_addr}${x}" for x in [list of installed programs]. So as an analyst, you
won't have any idea what the magic program name was.

That technique was likely Stuxnet, not Flame, so that article might not
contain any info about it. But it's amazing in its own way. If you have any
other questions too, I love chatting about this stuff.

