

How decompilers work - drx
http://archfinch.com/item/21ace/i-wonder-how-decompilers-work#topcomment

======
demallien
The cynical part of me wants to respond with one word: badly!

But I've spent enough time hand-reversing code to know that that doesn't do
justice to the work done by decompiler writers, it's just that it's a problem
that requires strong AI to do it properly.

~~~
mjb
I don't thing strong AI would solve the problem, because one crucial piece of
data is lost when the code is compiled: the intention of the programmer. Good
code clearly indicates its intended function through structure, most of which
does not translate into assembly. When decompiled, this structure needs to be
inferred from the assembly, which contains an incomplete view of the required
information.

This missing information includes, but is not limited to, the extra
information that the programmer put in deliberately: class names, variable
names, comments, etc. They also encoded a large number of assumptions about
typical program inputs, the execution environment, expected and unexpected
branches, chunks of logic (classes, functions, etc) into the code. With all
this lost, even a strong AI would not be able to piece together the original
program.

Instead, I think, the best that could be hoped for is a sort of uncanny-valley
zombie of the original code. Certainly better than nothing, but totally not a
replacement for your dead wife.

~~~
demallien
No, strong AI will solve the problem - just ask geohots... The system that is
uncrackable has not yet been invented, the best we've been able to do is to
make the process slower.

~~~
mjb
I think it comes down to which problem you are trying to solve: 1) Recover
enough to understand the program and make small changes (this is the cracking
case). 2) Recover the original code of the program or something close to it,
that can be 'stolen' and maintained without much more overhead than if you had
the original code.

(1) is not easy, but is becoming easier. (2) is still a long way off, and
isn't solvable without a good understanding of the domain the original code
was written for.

~~~
spc476
There are legitimate cases for (2). I've worked in companies that have lost
the source code to programs developed. If you're wondering how source code can
be lost---think of no revision control, a poor backup policy, and a program
that goes a long time between modifications.

------
xcallemjudasx
As a college student I found this article very easy to read and understand. It
gave a simple enough overview with enough detail to explain but not confuse.

------
T-R
Have you tried applying any of this to (segments of) 68k binaries?

You mentioned that a decompiler would have been only slightly useful for older
games written in C/C++; is flow analysis not helpful in practice for
simplifying code that was originally hand-coded in assembly?

Edit1: Thanks for the link to the paper[1], by the way. Do you know of any
other good resources?

Edit2: Nevermind, I see you have them in github[2]. Thanks again.

[1] <http://www.ci.tuwien.ac.at/~grill/decompilation_thesis.pdf> [2]
<https://github.com/drx/ocd>

~~~
drx
I've started writing a 68k ASM module a while ago but I never finished.

Hand-coded assembly code is better in that it's less optimized and mangled.
But worse in that it has much less structure -- I can't convert ASM code into
C. Each programmer has its own way of implementing certain programming
patterns, etc.

IDA Pro is actually quite good at flow analysis of ASM code.

------
shasta
Decompilation is undecidable?

    
    
       "you cannot write an algorithm that would decompile every possible piece of code"
    

I'm not sure what claim is actually being made here, but decompilation of any
particular compiler is trivially decidable: try every input program until you
find the one that produces the observed output.

~~~
drx
The compiler isn't part of the input. You could get code that was compiled by
a compiler not available to you.

When the compiler is part of the input, then yes, the problem is decidable.

~~~
demallien
I don't think it is decidable even when you have the compiler available to you
- what happens if I decide to insert a bunch of assembly code in my gcc-
compiled program using the gcc embedded assembly extensions? How do I decide
that a particular sequence of generated op-codes is actually impossible from
pure C code, and hence must have been assembly in the original program?

~~~
peti
In that case, the GCC extension is part of the input language (in a broader
sense). Still, your example clearly shows the difficulties involved in
practice.

