Hacker News new | past | comments | ask | show | jobs | submit login
How decompilers work (archfinch.com)
82 points by drx on May 5, 2011 | hide | past | favorite | 59 comments



The cynical part of me wants to respond with one word: badly!

But I've spent enough time hand-reversing code to know that that doesn't do justice to the work done by decompiler writers, it's just that it's a problem that requires strong AI to do it properly.


I don't thing strong AI would solve the problem, because one crucial piece of data is lost when the code is compiled: the intention of the programmer. Good code clearly indicates its intended function through structure, most of which does not translate into assembly. When decompiled, this structure needs to be inferred from the assembly, which contains an incomplete view of the required information.

This missing information includes, but is not limited to, the extra information that the programmer put in deliberately: class names, variable names, comments, etc. They also encoded a large number of assumptions about typical program inputs, the execution environment, expected and unexpected branches, chunks of logic (classes, functions, etc) into the code. With all this lost, even a strong AI would not be able to piece together the original program.

Instead, I think, the best that could be hoped for is a sort of uncanny-valley zombie of the original code. Certainly better than nothing, but totally not a replacement for your dead wife.


"because one crucial piece of data is lost when the code is compiled: the intention of the programmer"

I've worked on many projects where that's lost when the code is written.


No, strong AI will solve the problem - just ask geohots... The system that is uncrackable has not yet been invented, the best we've been able to do is to make the process slower.


I think it comes down to which problem you are trying to solve: 1) Recover enough to understand the program and make small changes (this is the cracking case). 2) Recover the original code of the program or something close to it, that can be 'stolen' and maintained without much more overhead than if you had the original code.

(1) is not easy, but is becoming easier. (2) is still a long way off, and isn't solvable without a good understanding of the domain the original code was written for.


There are legitimate cases for (2). I've worked in companies that have lost the source code to programs developed. If you're wondering how source code can be lost---think of no revision control, a poor backup policy, and a program that goes a long time between modifications.


Can I ask what the deal is with the downvotes on my parent post? I made the claim that strong AI can solve the decompilation problem, noting that humans - an example of strong AI (well, except we're not artificial...) - can and do decompile successfully, giving geohots (a pre-eminent cracker) as an example. What more do people want???


I don't think you connected your arguments very well. Just because a human can do something doesn't mean that AI's are the best solution to the task. I'm not sure how your specific example plays to the general case.

Both AIs and Humans both attempt to ape some fundamental truism or formalism about logic to compile and decompile programs. To me, it's better off knowing what those formalisms are than aping them with intelligent behavior, though finding formalisms is hard so sometimes we make do.


> The cynical part of me wants to respond with one word: badly!

You are right.

Decompiling seems hopeless when you look at the problem. Which is one of the reasons why I wrote my decompiler -- to see what can be done.

That also depends on what problem you think decompilers are to solve. Decompilers are somewhat good at telling you additional information about the original structure of the program. They suck at recovering the original source code (which, with enough optimization and mangling, is impossible).


Applying AI to a decompilation problem is just like applying AI to a compilation problem. You're taking a potentially nondeterministic approach to a problem that should be formalized and deterministic. I think AI applied to compilers is something to avoid, in general.


>Applying AI to a decompilation problem is just like applying AI to a compilation problem

I think they're very different. Compilation strips semantic information from code. Decompilation attempts to recover semantic information from program structure and, in the AI case, domain knowledge. They are fundamentally different processes.


That's fine that they're different. My point is that they're both formal processes. Adding AI to the mix will not help you arrive at formal conclusions. The conclusions will be probabilistic, instead. If an AI could use formal routines to show you formal conclusions, then you wouldn't need an AI, you'd just need your routines.


And those formal processes will tell you, in the most formal of manners, that when information is actually removed from a system that by definition of information you are incapable of formally reconstructing it. If you can reconstruct it, by definition it was not removed in the first place.

Intent is gone from compiled code. You can not formally reconstruct it. All you could formally get are reasonable guesses, and actually that's what the decompiler is giving you; it is not hard to interpret its output as very precisely vague on exactly the points it does not know about.


Decompiled code can end up looking like high-level assembly code. Like a programmer just learning a new language, using all the habits of their old one.

You have to call decompiling a success if the result executes the same algorithm. Much more than that is very hard.

I wrote a similar conversion tool - PL/M to C - years ago. It did quite a good job, because it had semantic clues (not to mention the original variable names). And even that can be done badly - the 1st pass result folks called "CL/M" because it was obviously not written by a C programmer.


why do you think that ? I actually would think that just statistics would also work.


Because for two years it was my job to make sure that decompilers didn't work on code generated by my company... It's so easy to throw automatic analysis tools of the track. It's easy to throw humans, and we're way better at this game than automated tools, at least for the time being.

To give you an idea of how easy it is, if you want the decompiler to drop the ball on a for-loop, all you need to do is something simple like increment by two and then decrement by one at the start of the loop body... The tool might be looking for the use of an INC instruction, but because you increment by two, the compiler generates an ADD... and now your tool doesn't know how to recognise the loop.


What was the reason to do that ? Apart from rendering your own code completely unreadable ?


That is exactly what I though. I don't know enough about decompilation to argue with OP, but it is my impression that people are too focused on making something that is always guaranteed to work. If you would use some kind of statistical machine learning instead and learn from data (like machine translation for human languages), I would think that you can come a long way, or am I mistaken?


I don't think anyone uses decompilers to analyse machine code.


Err, what do you think decompilers are for then?


Well in reversing applications, most people use disassemblers and debuggers to step/trace/analyse the machine code.

Just because they exist doesn't mean what they provide is of any practical value, as you said it yourself they decompile badly.


The Hex-Rays decompiler is very good, and I've used it for reversing several times. In particular, it was extremely useful for our analysis of the Green Dam censorware system (http://www.cse.umich.edu/~jhalderm/pub/gd/).


Thanks for exposing that shipwreck of government-mandated Horrorware.

From the Wikipedia article: "As of 30 June 2009, the mandatory pre-installation of the Green Dam software on new computers has been delayed to an undetermined date."

Source: http://en.wikipedia.org/wiki/Green_Dam


The decompilation quality really depends on how obfuscated the original code was, and how much time you care to invest. For example, I'm using Hex-Rays to decompile Command & Conquer: Red Alert 2 and extend it, and now its output looks like [1] , which is a lot better than raw assembly menemonics. Though I've been digging into this game since late 2008, which is a long time.

[1] http://img824.imageshack.us/img824/1188/hexrayscpp.png


After playing around with the Hexrays C decompiler we got I decided to stay with analyzing disassemblies.


As a college student I found this article very easy to read and understand. It gave a simple enough overview with enough detail to explain but not confuse.


Have you tried applying any of this to (segments of) 68k binaries?

You mentioned that a decompiler would have been only slightly useful for older games written in C/C++; is flow analysis not helpful in practice for simplifying code that was originally hand-coded in assembly?

Edit1: Thanks for the link to the paper[1], by the way. Do you know of any other good resources?

Edit2: Nevermind, I see you have them in github[2]. Thanks again.

[1] http://www.ci.tuwien.ac.at/~grill/decompilation_thesis.pdf [2] https://github.com/drx/ocd


I've started writing a 68k ASM module a while ago but I never finished.

Hand-coded assembly code is better in that it's less optimized and mangled. But worse in that it has much less structure -- I can't convert ASM code into C. Each programmer has its own way of implementing certain programming patterns, etc.

IDA Pro is actually quite good at flow analysis of ASM code.


As I can't edit my post any more, and you don't have an email address in your profile, I will double post.

> Do you know of any other good resources?

The thesis has many references, I would start with that.


Decompilation is undecidable?

   "you cannot write an algorithm that would decompile every possible piece of code"
I'm not sure what claim is actually being made here, but decompilation of any particular compiler is trivially decidable: try every input program until you find the one that produces the observed output.


I'm not sure what claim is actually being made here, but decompilation of any particular compiler is trivially decidable: try every input program until you find the one that produces the observed output.

How do you know that you'll find a program that works? What if the program wasn't generated by the compiler and that compiler can't generate that particular sequence of instructions? At what point do you know, in general, "this instruction sequence isn't derivable from this compiler!"...?


You're right. I initially took the problem to be "given this (specific) compiler output, determine the corresponding compiler input." That seems like a reasonable interpretation of "decompiling". If the problem is instead "determine whether this chunk of machine code could have come from this compiler", then my proposed algorithm obviously doesn't work.


Obviously?

Iterating over all possible input 'programs' (including buggy ones) is easy, even if the language allows for multiple input files of different types.

=> if you can give an upper bound for the length of the source code of a binary of size N, you are all set. I do not see how to proof that such an upper bound exists, but it seems fairly reasonable that one must exist. I cannot really think of any real-world compiler that would need, say, more than a GB of source code for every bit of output (it is easy to design such a language, but it would be quite esoteric)


Well, I think "obviously" was right, since my algorithm would never terminate if no source is ever found that matches the generated binary (there are infinitely many possible programs for most idealized languages).

For your idea of modifying my algorithm to bail after some number of inputs have been checked, clearly there does exist a bound on the source length necessary to generate all possible binaries of size N, since there are only finitely many such binaries (and thus they can be produced by finitely many sources). But to turn this into an algorithm, you'd need a computable bound. For a language like C, I think such a bound probably exists and is a reasonable function of the input size. For a language like Coq, though, I doubt the bound would be computable. Basically, I think that there are programs that work for reasons that are very difficult to prove.

Also, I think we're far enough from practical that appeals to the real world are somewhat silly. Either we're interested in understanding the theoretical situation, in which case lack of real world examples doesn't help, or we're interested in practical decompilation, in which case this whole conversation is stupid.


It depends on what you want from decompilation. Decompilation by its basic nature is just as decideable as compilation is. It's a translation from one language to another in addition to semantics-preserving translations. Now what a lot of people seem to want from a decompiler that makes them intractable is a guarantee that the output source will look just like the source used to compile the source program for the decompiler.

Program1 -> Compiler -> BinProg1 (compilation)

BinProg1 -> Decompiler -> Program2 (decompilation)

Even if the source language of the Compiler and the target language of the decompiler are the same, you're usually still screwed because the compiler is going to manipulate your program to optimize it or fit it onto your target architecture. The result is going to be a program that looks way more general case than your original program.

Is this undecideable? No way.


We typically reserve the term "undecidable" for the case where there is a mathematical function we are interested in, but it isn't computable by any algorithm. Inverting a many-to-one function is just impossible.


That's not what decidable means.

The code could be self-modifying. Figuring out what the code turns into, to generate a meaningful translation of it, turns into the Halting problem pretty quickly. At the limit, the best you can do is produce assembly output, and even that may be problematic if there are ambiguities in the encoding but the program uses code as data or vice versa.


Self-modifying code doesn't change my argument (see my response to kenjackson for my understanding of the problem) so long as it was produced by the compiler.


Well, you have no guarantee that a sane compiler was used, obfuscation and encryption may also play a part. Things worth reverse engineering are often convoluted in such ways.


The set of all possible inputs is infinite: Halting Problem.


The set of all possible inputs for a compiler is infinite, too. Does that mean that compilers are all harangued by the halting problem as well? Nope. Having an infinite number of possible inputs (infinite input set size) does not prevent you from showing that a program can actually halt. I think most compilers are written in a way that can be shown to halt, as well.

The Halting Problem does apply in the general case, but if you carve up your programs and reason about them, you can still show that you can have a halter. The Halting Problem just states that there does not exist a method that will take an arbitrary program and show it to be a halter.


"The set of all possible inputs for a compiler is infinite, too. Does that mean that compilers are all harangued by the halting problem as well? "

No, because unlike this "try all possible inputs" plan for decompilers, compilers only operate on any particular single input.

"I think most compilers are written in a way that can be shown to halt, as well."

Not C++ compilers (Turing Complete macros) or Lisp dialects (same issue, but even moreso)


The point about the macros being Turing Complete is trivial in this instance because if you wrote a macro that never terminated, you'd never compile anything do decompile ;-).


No, you are incorrect about the point being trivial. First off, the macros used in the original program will of course terminate, but if you simply "try every input to the compiler" like is being suggested, you will be attempting to compile any number of non-terminating macros (and you will of course not know which terminate and which do not (Halting Problem)).

The point is that "compilers terminate for every input" is trivially false.


Non-termination of some compiler inputs is a red herring. The only modification required to the original algorithm is that the search proceed in parallel (such that every input eventually makes arbitrary progress).

I agree with your point in a sister thread that decidable doesn't imply practical, but the claim of the original article was undecidable, which has a technical meaning.


I would qualify that statement better by saying "compilers with Turing Complete macros or type systems terminate for any input is trivially false". The point may not be trivial (which I'm not ready to concede), but it still doesn't matter. The author's means implicitly relies on an evasion where programs aren't written in this (rather strange) way. It's a safe assumption to say that you can eliminate programs written in such a way that will not compile, then the Halting Problem as you've stated it doesn't apply.

I think you may be right about the point your trying to make, but the point I'm making is that your point covers a negligible section of programs that the author would hope to analyze, so it doesn't matter.


My point is not about checking if the given program will halt or not.

The parent comment's suggestion was: (1) for each possible input program, (1.a) compile it, and (1.b) check if the result equals the given compiled code.

Agreed, steps (1.a) and (1.b) terminate deterministically (for a given compiler).

However, the search space for this search procedure is infinite.

Similarly, it would be impossible, in general, to exhaustively test every possible input of a compiler.


And yet, if the output you are decompiling is known to come from a particular compiler, the search will terminate.


If you know the compiler (and every other tools involved in the process), then yes, there exist at least one solution (the original input). The search will eventually find it.


Assuming all possible inputs to the compiler cause it to halt. Not true for many high-profile languages.

And while not strictly related to "is it decidable", in the real world we have to keep in mind the complexity of the solution. The presence of the naive solution for a particular language doesn't say much about how well we can* actually do it. Is the problem actually decidable within the confines of the physical observable universe for real world inputs?


"Assuming all possible inputs to the compiler cause it to halt."

If the compiler did not halt, then we don't have anything to disassemble in the first place.

Formally, it is obvious that a diagonalization-based search will eventually find a correct input, regardless of the halting status of any given input. In practice, none of this matters very much.


As I explained in my other comment, the compiler certainly does halt for the correct input, but that does not mean it will halt for all inputs.

You are however correct that this can be circumvented with diagonalization.


The compiler isn't part of the input. You could get code that was compiled by a compiler not available to you.

When the compiler is part of the input, then yes, the problem is decidable.


I'm still doubtful. I think you're trying to invoke Rice's theorem here, but I don't think it really applies. Compilers don't apply arbitrary transformations that preserve behavior - they only employ transformations that are understood to preserve behavior.


I don't think it is decidable even when you have the compiler available to you - what happens if I decide to insert a bunch of assembly code in my gcc-compiled program using the gcc embedded assembly extensions? How do I decide that a particular sequence of generated op-codes is actually impossible from pure C code, and hence must have been assembly in the original program?


In that case, the GCC extension is part of the input language (in a broader sense). Still, your example clearly shows the difficulties involved in practice.


The problem only makes sense if the compiler is fixed. Otherwise, it's trivial: for any input x I can say the source code is x. It's source code written in a programming language called MachineLanguage. The compiler is called cat.


I think he had in mind that the language is fixed, but not the compiler.


In this context, "programming language" and compiler are really the same thing: a mapping from source code to object code.

In any case, the problem with your original argument is not the choice of compiler, it's that your algorithm migh not halt, as explained by kenjackson.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: