
A native code to C/C++ decompiler - fla
http://derevenets.com/index.html
======
userbinator
No source? The first thing I did was try it on itself... which I suppose is
somewhat of an "acid test". It took a few minutes and an enormous amount of
memory, but finally it told me that the function at the very beginning of the
executable, definitely a nontrivial one, decompiles to...

    
    
        void fun_401000() {
        }
    

I'm sure I hit upon some edge case and much better output can be had from this
tool if I play with it some more, but for a first impression, not so good. But
I'm definitely going to keep this one around, it looks promising.

~~~
webkike
For all we know, that result is a complete facsimile of the original source.
But then again, we don't have the source.

------
os_
The original and the most powerful disassembler is IDA Pro. The project was
started in the 90s and has been used for security analysis, antivirus work,
protection analysis/research, hacks as well as normal dev work in the closed-
source ecosystems.

[https://www.hex-rays.com/products/ida/index.shtml](https://www.hex-
rays.com/products/ida/index.shtml)

The author has implemented a decompiler plugin over the top of IDA and it
works on the real-world code. The point here is to annotate the disassembly
bottom-up and then decompile.

[https://www.hex-rays.com/products/decompiler/index.shtml](https://www.hex-
rays.com/products/decompiler/index.shtml)

I don't want to bash the author of Snowman - this kind of research is serious
fun. Yet, IDA has an insane lead.

~~~
DigitalJack
It's also expensive. I'm sure it's worth it from all I've heard, but it's
unlikely I'll ever have the money for hobby work.

~~~
tptacek
For what IDA does, it's incredibly _inexpensive_ , so much so that it's
distorted the market for reverse engineering tools. Consider that people who
use IDA on a day-to-day basis have $250/hr+ bill rates, and if they use IDA,
they rely on it. Meanwhile, the set of people who use IDA on a day-to-day
basis is very small relative to the whole industry.

I'm not saying you should buy IDA, just that I think IDA is severely
mispriced.

~~~
os_
I hear you. For the sake of completeness, here is the other side of the
argument from students that use IDA for reverse engineering. Those activities
are really about cracking freemium/shareware apps and the associated
subculture is a little... well... special. I have heard countless times that
IDA should cost $200 and the author should work with the community to improve
the tool...

My stance is that the tool is very specialized, unique and the cost is
reasonable for professional usecases.

~~~
tptacek
I know I'm repeating myself here, but I want to make sure I communicate this:

IDA's price is so low that it actually harms the market for professional
reverse engineering tools. Most useful products you can build --- tracers,
visualizers, emulators, pattern matchers, debuggers --- fit into IDA's orbit.
As products, as "feature/function/benefit" statements, they are subsets of
IDA. But they're chained down by IDA's price. Just like IDA, they have to
serve a small market of users who make tens of thousands of dollars _per week_
using the tools, but the market optics make it hard to charge even a
significant fraction of the (low) total cost of IDA.

It's sort of hilarious to me to see what Hopper is doing to the market.
"Ruining it entirely" wouldn't be far from the truth. I'm only sort of
complaining. Viscerally, I'm thrilled that Hopper exists.

~~~
duckingtest
Who is making tens of thousands per week? Outside of selling exploits to
government agencies or cybercrime. Or is that what you meant?

AV jobs pay shit.

~~~
tptacek
The software security bill rate for people who are competent with IDA and can
find bugs black-box with it exceeds $3k/day. Source: until Friday, I'm a
principal at a very large software security consulting firm.

That's for projects denominated in billable days. Talented specialists do even
better, on fixed-price projects for specific targets. Rates get higher for
cryptographic work, as well.

I am _not_ talking about selling vulnerabilities. I have never sold a
vulnerability to anyone, nor have I (to my knowledge) done any software
security work for any division of the USG or any other government, nor would
I.

Don't work in AV. There are worse things about AV than the pay scale.

------
jordigh
C and C++ are different languages. I only saw C examples. How would you even
decompile to C++? The only C++ information you have in the object code is the
mangled names. How do you use that to get C++ code?

~~~
_wmd
And vtables, rtti data, systematic sequences of ops for invoking base classes,
intrinsic/library functions emitted by the compiler in specific situations,
exception handling tables, ...

~~~
jordigh
But none of those is something only a C++ compiler would specifically do,
right? How would you distinguish a vtable emitted by a C++ compiler from one
hand-rolled in C? I suppose you could just offer a reasonable guess. I wonder
if decompiled GTK+ code would end up turning into C++.

~~~
mieko
Most C++ platform ABIs are pretty trivial to recognize. A tool like this could
distinguish a C++ binary by looking at its initialization/housekeeping
sections, static constructors, __cxa_atexit, exception handling tables, linked
libraries, etc.

For example, the Itanium C++ ABI ([http://mentorembedded.github.io/cxx-
abi/](http://mentorembedded.github.io/cxx-abi/), perhaps Itanium's only real
legacy) adopted by Linux/ELF and other platforms, leaves a huge amount of
fingerprints on a binary. It'd take a very conscious effort, including hand-
crafting linker scripts, to generate a C binary that a tool would incorrectly
think was C++.

------
barrkel
Interesting that gcc removes the \n from the string and calls puts() directly
- this avoids the overhead of parsing the string for non-existent format
specifiers.

The decompiler could do with a bit of work making dynamic library imports more
symbolic. Following the puts call chain quickly disappears into a non-local
jump to an address with no further references.

------
nes350
Another native code decompiler, although apparently abandoned long ago:
[http://boomerang.sourceforge.net](http://boomerang.sourceforge.net)

~~~
RDeckard
Thanks for sharing, did not know. Of course, aware'ing everyone on IDA Hex-
Rays on this thread too: [http://www.hex-
rays.com/products/ida/](http://www.hex-rays.com/products/ida/)

------
72deluxe
That "hello world" decompilation is complex!

[EDIT: Very informative replies below, thanks!]

~~~
qzc4
It's because of the #include <stdio.h>, isn't it?

~~~
ctz
No, it's because its decompiling from _start downwards. From main downwards
it's actually very straightforward.

You can also see that GCC did strength reduction of printf("thing\n") to
puts("thing").

------
stinos
Pretty impressive! I haven't used any disassembly tools in years and only
remember last time I did I found it useless. Not sure if that was due to my
lack of understanding or the generated output or rather a combination of both.
This thing however: I fed it with an OpenGL test app which doesn't do much but
still has hundreds of lines of modern C++ spread over different libraries and
could clearly recognize lots of my functions in it and follow some program
flow starting from main. Still hard but at least I didn't feel completely lost
like years ago.

------
Someone1234
This is extremely useful for analysis. Even if you understand x86 ASM, it
allows you to quickly jump around a lot more efficiently than you otherwise
would.

It won't, for me, recompile back into the source application. So that is a
limitation, but even with that limitation it is extremely useful (and the fact
it looks the C/C++ back to the ASM, makes altering the ASM directly trivial).

~~~
fla
This and also the fact it's available as an IDA plugin.

------
3rd3
I’m wondering whether one could use machine learning and C/C++ code from
GitHub to find reasonable variable names automatically.

------
ntoshev
I wonder if statistical machine translation approach could be usefully applied
here. Get tons of source from github, compile with every compiler available,
train on the result. It would be challenging to compile automatically at
scale, to align the code with the source, or to get a source representation
invariant to identifiers, but should be doable.

~~~
fiatmoney
It's both harder and easier. There is a mechanical transformation, without un-
or approximately translatable idioms like natural language. On the other hand,
the dependency chain is much more complex - with something like link-time
optimization, a change to one part of the code can completely change the
result (for instance, if it suddenly allows inlining of a function
everywhere). There is also the problem of, if not "idiom translation", "idiom
generation" \- people write code in a particular style that may not be
captured by the generated output, even if it compiles the same.

Targeting something like Clang specifically, where you have access not only to
the assembler & a potential source, but also a whole AST & intermediate data
structures, would be pretty interesting.

------
daguu
I'm brand new to C, but wouldn't this from the hello world example always eval
true?

if (__JCR_END__ == 0 || 1) { return;

~~~
fnordfnordfnord
If __JCR_END___ was always a boolean, yes.

~~~
tpush
Am I missing something? '==' has a higher precedence than '||', so it should
always evaluate as true.

~~~
DSMan195276
I think the catch he's getting at is that || imposes an ordering that the left
side is checked before the right side. This also implies that any side-effects
of the left side have to happen before the right-side is evaluated.

That said, I still don't know how you could get this code generated. If you
make an equivalent piece of code with _JCR_END__ as a volatile int, you still
get an infinite loop which has the mov op for reading the _JCR_END__ value but
it doesn't bother to test it. IE. gcc still reads the variable but optimizes
the loop to a while (1). I can't think of anyway to trick gcc into generating
asm like this.

------
J_Darnley
I'm kind of disappointed that there isn't a version available for IDA 5.0.
Yeah, I'm cheap.

------
TickleSteve
This will totally fail for optimised code if it is just using object code
without debug information. There is no information in the resulting machine
code that can indicate whether some code has been inlined or not. Basically
any optimisations performed by the compiler will throw this decompilation off.

I question whether you can get any real use out of this...

~~~
anemic
I don't think the market for this is to get the _actual_ original code. It's
more like understanding what a particular program does: when you see it on a
higher level it's much easier to understand the code than reading raw
assembly.

~~~
TickleSteve
Absolutely... I get that, but its only functional for non-optimised code. My
point is that for anything non-trivial, its not going to be terribly useful.
You're still gonna need to understand what really is going on, optimisers
mangle the code out of all recognisability for this decompiler.

~~~
pjmlp
Unless the Assembly is using clever tricks like code rewriting, it is always
possible to at very least decompile into some form of pseudo-code.

Just the fact of giving symbolic names to memory addresses and replacing
Assembly opcodes by more meaningful instructions can make wonders trying to
understand some code.

~~~
TickleSteve
I disagree... inlining will remove all evidence that a function call existed.
loop unrolling will remove all evidence that a loop existed (potentially).
Those are basic optimisations, comiplers will transform the code out of all
recognisability for a decompiler to be worthwhile.

------
m00dy
Why only windows ? i couldn't get it.

~~~
schoen
There's an enormous community of people who spend all their time worrying
about the contents or behavior of Windows binaries. I've met some of them
through my work, like malware analysts who deal with malware that's part of
phishing attacks. The phishers will often prefer to create Windows-only
attacks because Windows has such a commanding market share lead among most
populations of phishing targets; in turn, that's what the people trying to
defend against or mitigate those attacks will study. To folks in that sort of
field, "binary" is virtually synonymous with "Windows binary"!

I guess also historically most of the tools for creating, modifying, and
examining binaries for a given platform have been native to that platform,
rather than cross tools. That's surely because most people (with the exception
of embedded developers) do much more native development than cross
development. I can get a small number of packages on my Linux machine that
will deal with Windows executables in some relatively shallow way, but I have
_tons_ of programs already installed that do complicated and specific things
to Linux ELF binaries even though I don't typically use those programs on a
day-to-day basis.

