
A Neural-Based Program Decompiler (2019) - globuous
https://arxiv.org/abs/1906.12029
======
ckastner
> _Empowered by the two-phase design, Coda achieves an average program
> accuracy of 82% on variousbenchmarks. While the Seq2Seq model with attention
> and the commercial decompilers yield 12%and 0% accuracy, respectively._

A commercial decompiler yeilding 0% accuracy sounds odd.

~~~
lunixbochs
I was thrown by that at first but it sounds about right if accuracy means any
of these things:

The same AST graph.

The same types.

The same text tokens.

I assume they’re talking about Hex Rays. Optimizing compilers result in
different code structure than your input, and Hex Rays takes many liberties on
output. It’s not _trying_ to match the input perfectly, it’s trying to emit
valid C that a human can understand. It’s full of casts and weird control
flow. A break might turn into a goto. Functions will be inlined. Structure and
class information is lost. A switch might turn into a few if statements, in
the wrong order.

Main caveats:

\- Hex Rays output is best when massaged by an experienced user (which is the
primary mode of use. It’s an interactive tool, not a one way transform). I
assume that didn’t happen here.

\- The accuracy is probably based on literal structure or tokens, and Hex Rays
doesn’t (and probably shouldn’t) try to guess the original structure to that
degree. Decompiler output is noisy.

~~~
rfoo
Even worse, they are talking about RetDec, which by no means is a commercial
decompiler. I mean, not even Ghidra.

I guess their accuracy means "the same AST", because according to Appendix F
of this paper they showed at least one case where RetDec did produced a
correct decompilation, but with optimization quirks so it doesn't look like
exactly same as the original.

~~~
lunixbochs
Well yeah RetDec, Hopper, Binary Ninja HLIL, Snowman, Ghidra... none of these
“direct” decompilers are going to match the input token for token or AST node
for node, even without symbol names. To do so you’d need a lot of heuristics
about the specific compiler, many of which are probably accidental parameters
(branch layout, register picking, order of operations, vectorization,
inlining). The sort of task that’s perfect for a neural network (lots of
hidden parameters to learn about each compiler version) and hard for a hand
written program.

Now that I’ve said it out loud I bet accuracy goes way down if you train or
evaluate against a wide spectrum of compiler versions, as these hidden
parameters will change.

...but those parameters also don’t entirely matter. So I think they probably
need a better evaluation method.

------
kamocyc
A curated list of awesome decompilation resources and projects.

[https://github.com/nforest/awesome-
decompilation](https://github.com/nforest/awesome-decompilation)

------
ackbar03
I've always wondered whether ml / deep learning can be used efficiently for
code deobfuscatuon

------
seek3r
The model gives you approximate code based on a set of inputs and outputs you
feed to it; each input and output helps the model shed light on a different
execution path and improves the approximate code.

Still, not bad.

------
sdmike1
Do they have a release of this? Github or otherwise?

------
latenightcoding
"Neural-based", what a clickbaity way to say ml-assisted.

~~~
carlmr
If it's a neural network it's kind of more accurate, although weird wording. I
would find ML-assisted just as attractive to click on though, maybe even more,
since it sounds more professional, but that's subjective.

