
Show HN: Assembly to C code Decompiler - zandorg
http://www.decompiler.org/
======
emcq
Just this week I was exploring C decompilers and stumbled upon the open source
Snowman[0], which worked well for my purposes and can run in a self contained
mode with a dependency on Qt5.

[0] [https://github.com/yegord/snowman](https://github.com/yegord/snowman)

~~~
ultramancool
Interesting, Snowman is also integrated into x64dbg.

------
openasocket
How do decompilers work in general? I'm imagining the normal compiler pipeline
in reverse: convert the machine code into some intermediate representation,
add some 'de-optimization' passes to make the control flow more clear, then a
back end which converts that into a C AST, which is then printed out into
valid C code.

~~~
zzzcpan
I don't think this can get you to a meaningful code, but more of assembly-
looking C. It would require some sort of machine learning to guess code
fragments with proper variable names, properly nested loops, etc.

~~~
openasocket
variable names are almost certainly a lost cause, but IDA pro is pretty good
at recovering control flow, which could be turned into C constructs (didn't
Dijkstra prove something about expressing any control flow using structured
programming constructs?). You'd probably need a bunch of heuristics to tell
the difference between a while loop and a for loop. More ambitious would be
trying to un-inline functions, which would require liberal use of the de-
optimizer and recognizing common sub-structures.

------
umanwizard
Anyone know how this compares to the industry-standard Hex-Rays Decompiler?
(Sold as an add-on to IDA Pro)

~~~
CraigJPerry
Can you actually buy it though? I have it in the back of my mind that they
restrict sales of ida pro and the decompiler as an anti piracy measure.

~~~
khedoros
They certainly don't have a process where you can just "add to cart" on their
website and hand them a couple thousand dollars. I'm not sure what extra
checks they do through their sales team; maybe they're just trying to make
sure that large businesses aren't buying up single-dev "Named" licenses and
installing it across their organization, or maybe they're trying to verify
that buyers are known security researchers, or something.

~~~
cynix
Umm, I bought my license by doing exactly that — add to cart and fork over a
few thousand dollars. I don't remember them doing any verification other than
requiring a business email address.

------
zxv
Closed source, windows only, diss-assmebles 6502 only, and expires next month.
No thanks.

~~~
zandorg
I don't like to reply to my own posts, but here goes.

It's closed source just at the moment. HexRays is closed too.

It can run on Linux or MacOSX as CLisp runs on those platforms. I just haven't
started work on porting to new platforms.

It decompiles 6502 as a proof of concept. It can decompile other CPUs, but not
fully.

The expiration is temporary. I intend to eliminate this.

Thanks for your comments, it's great to get feedback.

Just another thing: Check out the 'Samples' page to see what it can do with
different CPUs.

~~~
khedoros
I have a few technical questions, as well as a few about your intentions with
the software.

How does it handle interrupt calls to the OS? It's not an issue for Windows
(because it's all done through library calls, right?) But DOS int21 and Linux
int80, for example?

With the x86 work, is the logic all built around protected mode? I've been
using IDA to examine/document the assembly of a DOS game, so I'd be interested
in the behavior if it's fed real mode code. Further (and tying in to my
previous question), the game uses Borland overlays through int3f (it seeks in
the binary itself and loads new sections of code into memory, while running,
before jumping into the newly-retrieved code). Would that kind of thing be
possible to handle automatically? IDA seems to be hard-coded to look for the
offset+length tables that are used, and finds the function entry points that
way.

More on the business side, you've got a way to request a quote, and the
impression I get is that your aim is to run a decompilation business. Where
does that leave the software itself? As a proprietary technology that lets you
differentiate your business? Or is it your plan to sell the software, release
binaries, release code, or some combination? My perspective is that of a
hobbyist with a curiosity for reverse engineering and a (strictly non-
commercial) project to apply it to, and I'm trying to figure out where this
software fits into my world.

~~~
zandorg
About interrupt calls... If it finds such a call, all it has to do is look at
the call's input registers (assuming they are in registers) and output the
call with the 'logical' contents of the registers.

The x86 code is basically 32-bit Windows. I haven't got a 16-bit compiler.
However, it should work for 16-bit. The key thing is that you can write
specialised modules for each CPU, and the rest (loops, variables) are
standardised.

On the business side... Basically, my 6502 decompiler works the best so I
thought I could sell that. And as for x86, there are issues to do with structs
& arrays, that I haven't had time to figure out.

As for the decompiler software, I plan to finish the x86 and ARM decompilers
and sell them for a reasonable price (say, $150 for all CPUs, not just one).
As noted, arrays & structs are a problem.

As for 'release code', anyone can write a CPU module, but I need to document
it first.

Thanks for your comments.

------
Frondo
I'd be curious to see what C it produced if you fed it hand-written assembly.

~~~
khedoros
The game that I mentioned elsewhere in this thread seems to be a mix of C and
assembly (as one would expect from an early-90s game), so I'd be curious about
the same thing.

~~~
kw71
Many instructions of machine code can be written as one C statement. There are
some cases of things the cpu can do that are not standardized in C, like
"arithmetic shift vs. logical shift." Maybe these instructions would be
inlined.

~~~
khedoros
> Many instructions of machine code can be written as one C statement.

I'm quite aware =) Nested loops, some pointer calculations (dealing with real
mode pointers), conditionals with more than one condition, and multi-step math
statements seem particularly verbose, compared to their higher-level
equivalents. If a piece of code gets kind of complicated, I usually hand-
decompile it, and it's usually much shorter, even in fairly-naive C code.

