
Decompiler Design - monocasa
http://www.backerstreet.com/decompiler/introduction.htm
======
jcranmer
Unfortunately, this presents only rather old, naive notes on decompilers, with
little discussion on even the advances made by Mike Van Emmerik's decompiler
work, let alone newer developments like Phoenix, DIVINE, or TIE. There's
actually been a lot of work on things like variable, type, and structure
recovery in the past decade, but it's usually not presented as decompilers but
rather static binary analysis.

~~~
monocasa
I linked to this with the idea that it would just be an intro. I really hoped
that someone with more recent knowledge would chime in. : )

Any chance you could recommend a few good recent papers that I could kick off
a good citation graph binge on?

~~~
jcranmer
Probably the easiest way to find relevant papers is to do reverse-citation on
Cifuente's work ("Reverse Compilation Techniques" and "Structuring Decompiled
Graphs").

The newest paper I've found (I was specifically looking at control flow graph
structuring, which ends up being nasty to search for) is <[https://net.cs.uni-
bonn.de/fileadmin/ag/martini/Staff/yakdan...](https://net.cs.uni-
bonn.de/fileadmin/ag/martini/Staff/yakdan/dream_ndss2015.pdf>).

Edward Schwartz's PhD thesis
(<[https://users.ece.cmu.edu/~ejschwar/papers/arthesis14.pdf>](https://users.ece.cmu.edu/~ejschwar/papers/arthesis14.pdf>))
covers in detail the Phoenix decompiler, which is another good jumping-off
point.

All of the papers I've mentioned can be found in citations from those two
sources.

As for the state-of-the-art, the basic disassembly algorithm has been well-
known for a while: recursive disassembly, which is the sort of obvious
algorithm you'd think of yourself if left to your own devices. The practical
answer is generally "use IDA," because indirect control flow boils down to
"know every single pattern that compilers generate," and IDA already knows
every single pattern. Structuring control flow graphs is more or less complete
in the same sense as parsing is, which is to say that we know a few different
ways to do it which all suck in their all special way, but you're not going to
find fertile ground to do new things, as the existing solutions are generally
good enough.

The real hard part of decompilation remaining, then, is variable and type
recovery. We sort of know how to do this if you assume structs, unions, and
arrays don't exist (which is to say, it's more or less solved if you were to
try this on CIL or JVM bytecode), but those are a massive stumbling block in
native executable decompilation. I rather suspect this is going to remain
unsolved until we get a good pointer and dataflow analysis on binaries that
can handle integer/pointer duality and type unsafety, which puts it at least a
decade away in my estimation.

------
paulfr
There is some very shady SEO going on here. The French translation link goes
to a Wordpress blog with a very bad translation of the introduction. The root
of the translation website is a completely unrelated Dutch used car website.

Other pages on the translation blog follow the same pattern: an old webpage
(this HN entry is from 2009, but other pages go back to 2000) is edited years
later to link back to a bad translation in some random language.

This is very far from my area of expertise, so I'm curious to know if this is
well-known, and if those pages were hacked or if this was inserted by the
owner himself.

EDIT: This is fascinating. I'm finding dozens of similar blogs, with links
spanning at least a thousand hosts, which I'm assuming were hacked. This is a
big network, with sophisticated wording designed to look as innocent as
possible.

~~~
jcranmer
The content seems to be written by a hobbyist decompiler developer.
Unfortunately, the state of the art that it presents seems to rely heavily on
the Program Transformation Wiki pages, which appear to have become rather
defunct (its pages, on further research, appears to have been primarily
maintained by Mike Van Emmerik, and both Van Emmerik and Cristina Cifuentes
appears to have moved on from binary decompilation a while ago).

As a result, the academic state of the art that it's relying on is a decade
out of date. While decompilation is in nowhere near a solved problem, there
have been advances in variable and type recovery in academic literature that
the author of these pages doesn't seem to be aware exists.

The content of the pages mostly seems to be the sort of basic material that I
wouldn't even bother citing in a research paper. I'm not sure there's any
information in there that you wouldn't find in, say, the dragon book.

------
mdaniel
I'll bite: is there anything novel presented over these book-like pages, or
it's just an introductory text?

The lack of navigation syncing up with the page I'm on makes it hard to tell
if I'm 10% or 90% through the content. If there was a printer-friendly view,
it would be much easier to evaluate whether this content is relevant to me or
not.

------
tropo
The hex-rays info is obsolete/wrong. Corrections...

Platform: Windows/Linux/Mac

Binary Format Support: ELF, PE-COFF exe, DOS exe, a.out, MachO, raw binary,
etc.

Interactive Batch: both

Recompilable Output: yes, of course

Structuring: very good

Variables: very good

Types: very good (after you mark parameters/globals)

~~~
jcranmer
My recollection of IDA is that it completely keels over if you give it
structures or arrays (unless you take the time to manually teach it about
every little thing), and a colleague of ours has confirmed that Hex-Rays does
a horrible job here as well in a similar situation. The exact example was:
typedef struct { int x, y; } Point; int main() { int i; Point p[5]; for (i =
0; i < 5; ++i) { p[i].x = 1; p[i].y = 2; } return p[0].y; }

which Hex-Rays turned into: int __cdecl main(int argc, const char __argv,
const char __envp) { int v3; // ebp@0 signed int v4; // ecx@1 int v5; // eax@1
v4 = 0; v5 = v3 - 40; do { _(_DWORD_ )(unsigned int)v5 = 1; _(_DWORD_
)(unsigned int)(v5 + 4) = 2; v5 += 8; ++v4; } while (v4 < 5); return _(_DWORD_
)(unsigned int)(v3 - 36); }

(Apologies for any typos, I'm transcribing this from a second computer).

To be fair, most decompilers or binary analysis tools tend to have massive
problems in this regard. It's only really in the past decade or so that the
research community has started to shift from heuristic-heavy, pattern-matching
based approaches to more principled, general-purpose algorithms, and IDA Pro
is best characterized as "every heuristic ever developed."

~~~
burfog
Well yes and no. You normally need to teach IDA once, and sometimes it needs a
reminder. This means setting a proper prototype on the function before you
attempt to decompile it.

------
kungfooman
Actually release real source code to work with, instead of this useless pseudo
shit.

------
tux
Interesting, can we please have PDF version of this ?

~~~
iuyoynp
Dear hacker.

Save as PDF.

