Hacker News new | past | comments | ask | show | jobs | submit login
A Neural-Based Program Decompiler (2019) (arxiv.org)
51 points by globuous 28 days ago | hide | past | favorite | 12 comments

> Empowered by the two-phase design, Coda achieves an average program accuracy of 82% on variousbenchmarks. While the Seq2Seq model with attention and the commercial decompilers yield 12%and 0% accuracy, respectively.

A commercial decompiler yeilding 0% accuracy sounds odd.

I was thrown by that at first but it sounds about right if accuracy means any of these things:

The same AST graph.

The same types.

The same text tokens.

I assume they’re talking about Hex Rays. Optimizing compilers result in different code structure than your input, and Hex Rays takes many liberties on output. It’s not _trying_ to match the input perfectly, it’s trying to emit valid C that a human can understand. It’s full of casts and weird control flow. A break might turn into a goto. Functions will be inlined. Structure and class information is lost. A switch might turn into a few if statements, in the wrong order.

Main caveats:

- Hex Rays output is best when massaged by an experienced user (which is the primary mode of use. It’s an interactive tool, not a one way transform). I assume that didn’t happen here.

- The accuracy is probably based on literal structure or tokens, and Hex Rays doesn’t (and probably shouldn’t) try to guess the original structure to that degree. Decompiler output is noisy.

Even worse, they are talking about RetDec, which by no means is a commercial decompiler. I mean, not even Ghidra.

I guess their accuracy means "the same AST", because according to Appendix F of this paper they showed at least one case where RetDec did produced a correct decompilation, but with optimization quirks so it doesn't look like exactly same as the original.

Well yeah RetDec, Hopper, Binary Ninja HLIL, Snowman, Ghidra... none of these “direct” decompilers are going to match the input token for token or AST node for node, even without symbol names. To do so you’d need a lot of heuristics about the specific compiler, many of which are probably accidental parameters (branch layout, register picking, order of operations, vectorization, inlining). The sort of task that’s perfect for a neural network (lots of hidden parameters to learn about each compiler version) and hard for a hand written program.

Now that I’ve said it out loud I bet accuracy goes way down if you train or evaluate against a wide spectrum of compiler versions, as these hidden parameters will change.

...but those parameters also don’t entirely matter. So I think they probably need a better evaluation method.

The commercial decompiler they compared to is RetDec, which is not really a competent commercial decompiler. By "the commercial decompiler" I was expecting to see IDA Pro, the de-facto standard of this industry. Or at least Ghidra if they need an open source & free one.

Also, I'd argue their benchmark is unfair. Checkout the appendix F, in the second example RetDec did actually produced correct result (though its output is not the most readable and it didn't de-optimize enough to remove noises introduced by compiler optimization), but they dismissed it as "difficult for human understanding". And in the first example I suspect they hand-picked a case where RetDec failed to perform function signature recovery because the function argument is floating point.

Edit: also, they run their benchmark on MIPS, because apparently their NN performs worse on x86 than MIPS. But no traditional decompiler was heavily optimized for MIPS.

There seem to be some copy&paste errors in their NN-decompiled version in appendix F as well, which strikes me as odd.

Their version (ii) cannot compile, since the variable names don't match (e.g. they return "c" instead of "v3" and multiply by "b3" instead of "v3").

They did use x86-64 ISA as well, though without any optimisation enabled, i.e. "-O0" (see Table 5 in appendix D). I doubt those results are very useful in practice due to the complete lack of optimisation.

A curated list of awesome decompilation resources and projects.


I've always wondered whether ml / deep learning can be used efficiently for code deobfuscatuon

The model gives you approximate code based on a set of inputs and outputs you feed to it; each input and output helps the model shed light on a different execution path and improves the approximate code.

Still, not bad.

Do they have a release of this? Github or otherwise?

"Neural-based", what a clickbaity way to say ml-assisted.

If it's a neural network it's kind of more accurate, although weird wording. I would find ML-assisted just as attractive to click on though, maybe even more, since it sounds more professional, but that's subjective.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact