Hacker News new | past | comments | ask | show | jobs | submit login
Decompiler Explorer (dogbolt.org)
376 points by todsacerdoti on July 13, 2022 | hide | past | favorite | 82 comments



Sorry for the outages, friends. We're actively working on getting it able to handle higher load but we knew that if we hit HN we'd be swamped no matter what we did. We're spinning up more workers and fixing obvious perf issues as we see them, but if it's not available when you try it make sure to check back later!


I really appreciate these kinds of websites.

But i wonder if we eventually go full circle and it becomes easier and cheaper to send a wasm linux kernel with virtual disk access over websockets instead of processing stuff server side.


IDA Pro and Binary Ninja listed here is proprietary and require expensive licenses to run.


And both companies behind those licenses (hi, I'm one of them!) donated licenses to support this.

That said, you're right. It's unlikely we'd ship our entire binaries plus code to live in-browser though the amount wasm stuff people are doing lately is fascinating.


I feel like the decompiler space is a little stuck? I mostly go with Hex-Rays out of habit and because I'm used to IDA, but I haven't really seen x64 decompiler output noticeably improve in recent releases.

A lot of my colleagues use Ghidra a lot now and complain about its decompiler regularly.

Is there any new approach in the works? Maybe something ML-based for optimization? Would be sad if Hex-Rays output is "as good as it's gonna get".


> A lot of my colleagues use Ghidra a lot now and complain about its decompiler regularly.

Are your colleagues decompiling obfuscated code (for example malware)? Publicly available decompilers are not working well for that, but I assume that many specalists have their own little improvements and plugins that they don't share with others because it's their core business.

For non-obfuscated code, Ghidra has served me very well, even for entire applications. Often, it has to be pushed into the right direction (for example, by manually specifying the type of a variable) and it sometimes misses some obvious simplifications especially when arrays are involved, but I think those issues could be solved relatively easily by polishing/extending its heuristics. Nothing where I would say that ML is needed, although it would be possible. At the end, most programs contain the same patterns and an ML-based system could help identifying them.

But yeah, obfuscated code, that's something else. There are some academic publications about the usage of ML for that. No idea what's happening inside the company labs, though.


I haven't used Ghidra "seriously" but i fed it some non-trivial programs i wrote in Free Pascal and i was very surprised to see that it recreated a C++ program that was incredibly similar to what the Free Pascal program looked like.

Of course it wasn't obfuscated and there were a couple of mistakes here and there but overall it'd work perfectly fine for someone to understand what the program was doing if they didn't had access to the source code.


From my small experience of Ghidra, it didn't do great once the code was not using standard calling conventions (i.e it was probably compiled with optimization flags )

Sometimes it would just straight up ignore (functional) assembly for apparently no reason. Or it would turn simple code into a myriad of nested conditionals and loops, achieving the same goal, but looking nothing like a human would write.

It was still very helpful in understanding blocks of assembly much faster than I otherwise would, and it's possible I was lacking some configuration that a more experienced user could do to help the decompiler out.


>Sometimes it would just straight up ignore (functional) assembly for apparently no reason.

probably code it thinks is unreachable. (Jmp or ret right in front of it and no jmp/call into that address, probably a computed jump/call)

> Or it would turn simple code into a myriad of nested conditionals and loops

Ran into that myself, usually a switch case. (Dunno how to get ghidra to deal with that properly myself)

The biggest help you can give ghidra is defining structs, naming the fields, and setting the right types.


> probably code it thinks is unreachable.

Or dead assignments. I've seen it with HexRays: if you don't tell it that e.g. var_16 is actually 32 bytes long, not 4, it will completely ignore any code that reads/writes stack between var_16 + 4 and var_48 (which is at var_16 + 32). It's quite an amusing sight to see: you have an 8 lines-long decompiled function from 300 lines of assembly, you edit a variable's annotation, boom, the decompiled function is now 40 lines long, with all kinds of interesting computations in its body.


Rellic [1] implements an algorithm that generates goto-free control flows (citation in README), which would be a significant improvement against what Ghidra/IDA generates currently.

Unfortunately it looks like the maintenance state of the pieces around Rellic isn't very good, and it's quite rocket science to get it building. It doesn't have as much UI/GUI as Ghidra either so it's a bit far from accessible right now.

[1]: https://github.com/lifting-bits/rellic


> that generates goto-free control flows

...note: from LLVM bitcode.


https://github.com/lifting-bits/remill is linked from there, which I guess is where you get your bitcode from.


Oh that's cool, thanks!


What happens with code that uses lots of gotos(incl. computed gotos)?


From reading the paper, it basically does jump unthreading. Basically, if you imagine code like this:

  bool found = false;
  for (...) {
    if (...) {
      found = true;
      break;
    }
  }
  if (found) {
    // A
  } else {
    // B
  }
Jump threading is an optimization pass that replace the break statement with a goto A. After that replacement, found is always false, so the boolean variable and the if statement is deleted. The resulting code would look something like this [1]:

  for (...) {
    if (...) {
      // A
      goto end;
    }
  }
  // B
  end:;
What the lifting is doing here is essentially running this pass in reverse. If you find a branch pattern that doesn't meet any preordained schema (such as a loop with multiple exits), just synthesize a variable that tells you which target you're going to jump to. Were the compiler to optimize the resulting code, jump threading would convert it back into the gotos present in the compiled binary.

[1] This kind of optimization pass runs at a stage when the code is basically treated entirely as a CFG and there's no such thing as if statements or jumps or gotos, just conditional and unconditional branches terminating basic blocks. Any reflection of the code outside of this form is therefore somewhat imprecise.


Can you think of an example of "goto"-based code that cannot be translated into conventionally structure code?


[EDITED to say explicitly:] You can translate any goto-laden code into "conventionally structured" code mechanically, if you don't care about having the structure of the resulting code actually indicate what it does. Here's an example of the sort of code for which that might be the best you can do.

Suppose you implement a state machine with gotos. So, for a simple (and contrived) example, suppose you have something that absorbs the decimal digits of a number and keeps track of the value of the number modulo 3 by having three states. Something like this (pseudocode):

    def mod3():
        state0:
            d = getdigit()
            if d == FINISHED: return 0
            if d is 0, 3, 6, 9: goto state0
            if d is 1, 4, 7: goto state1
            if d is 2, 5, 8: goto state2
            return ERROR
        state1: 
            d = getdigit()
            if d == FINISHED: return 0
            if d is 0, 3, 6, 9: goto state1
(etc.) You've got three stateN labels each of which can jump to any of the stateN labels (as well as being able to return from the function).

If you have tail-call optimization you can turn this into conventionally structured code, more or less:

    def state0():
        d = getdigit()
        if d == FINISHED: return 0
        if d is 0, 3, 6, 9: return state0()
        if d is 1, 4, 7: return state1()
        if d is 2, 5, 8: return state2()
        return ERROR
with similar definitions for state1() and state2(), and then the top-level function just calls state0. But this depends on knowing that all those tail calls will get optimized, or else on never having enough digits to overflow your stack.

Or else you can have an explicit state variable:

    def mod3():
        state = 0
        loop:
            if state == 0:
                d = getdigit()
                if d == FINISHED: return 0
                if d is 0, 3, 6, 9: state = 0
                else if d is 1, 4, 7: state = 1
                else if d is 2, 5, 8: state = 2
                else: return ERROR
            else if state == 1:
                ...
            else:
                ...
which works pretty well for the special case of state machines but badly for most other things a goto might be used for. (Though obviously you can translate literally any goto-using code into this sort of thing. You might want to call the analogue of the "state" variable here "program_counter" in that case.)


The translation can always be done, but for dense spaghetti control structures duplication of code may be required. One can construct artificial cases where the size increase is exponential, but that's unlikely to be an issue in even the worst real code.


This is actually why we chose _not_ to implement no-more-gotos for Binary Ninja's HLIL! Code is actually more readable with gotos in some situations and trying to force their elimination hurts readability.


> Rellic [1] implements an algorithm that generates goto-free control flows

Doesn't WebAssembly implement that already, via Relooper?


> Is there any new approach in the works? Maybe something ML-based for optimization?

I'm doing a PhD on this.

My goal is to detect known functions from obfuscated binaries.

The biggest challenge by far is building a good dataset. Unlike computer vision (millions of pictures with the label "dog") the number of training examples for a typical function is one. For now I'm focusing on C standard libraries, since there are a handful of real-world implementations plus some FOSS or students samples available for things like strlen and atoi.

If anyone wants to collaborate, feel free to message me.


I'm not sure I follow - wouldn't many statically linked programs have much of some version of libc within them? So you could take any program, change it to be statically linked and use that for training?

That said I assume I'm missing something here.


Could a best guess + fuzzing + compiling the decompiled code work towarda a heuristic?


Not sure exactly what you mean by "best guess + fuzzing", but I have compiled code that was first decompiled by Ghidra. The problem is there are lots of invalid identifiers in the decompiled output.

The worst are symbols that are used inconsistently within the same function, like a parameter which is passed in as a long and then used as a pointer to a struct or even as a function.

The Ghidra community basically says you should not expect the exported decompiled code to be valid [1,2]. Which is fine, since rount-trip compile-decompile-compile is not exactly Ghidra's purpose.

Maybe there's a setting to make Ghidra export asm literals when it can't figure out a valid disassembly, but I am pretty new to Ghidra so it could just be my own ignorance.

[1]: https://github.com/NationalSecurityAgency/ghidra/issues/236

[2]: https://github.com/NationalSecurityAgency/ghidra/issues/3553


> The worst are symbols that are used inconsistently within the same function, like a parameter which is passed in as a long and then used as a pointer to a struct or even as a function.

Split into new variable. Sounds like ghidra has trouble telling whether it is a reused stoarge location or actually the same variable.

Best guess = something that looks approxinately fitting for the relevant assembly

fuzzing = tweaking the source code to get what it compiles to closer to the actual assembly.

as in, generate a function, see how close its compilation resembles the assembly, tweak until you find a match


Yeah, I ended up creating new variables to get the compile to succeed.

As for generating functions, I'll have to think about what that loss function would look like. I've been looking at asm2vec[1] and structure2vec[2] for inspiration. I'm currently looking at different kinds of graph embeddings, because even answering the basic question of "are these N bytes of assembly semantically similar to these other N bytes" is a challenge.

[1]: https://ieeexplore.ieee.org/document/8835340

[2]: https://arxiv.org/abs/1603.05629


Maybe start with a simple fixed size instriction set? To get some methology down to later be refined. something like early 8 bit micros


That's not a bad idea. My first crack at this has been with a linux x64 target, but I have the infrastructure in place for mips, armv7, thumb, etc. I haven't tried compiling to very old/simple targets but I was considering using the MOVfuscator as one of the compilers.

Or maybe I can figure out how to tell LLVM to do some extreme strength reduction and target an ultra reduced subset of some ISA. Great food for thought, thanks!


I hear good things about Binary Ninja!


I always found it odd that ida pro was such a pile of poop when it probably made sooo much money


Decompiler space probably has a few tens of millions in revenue yearly, yet writing a good decompiler is quite a lot of engineering effort, and you are not going to spend tons of money and effort to capture a measly 10m market, you'll rather be the next uber type thing that targets a much bigger market.

Hence HexRay can get away with not doing much and just collecting license fees from existing customers yearly, as there isn't a better alternative anyway.


One thing Hex-Rays has that Ghidra doesn't (and cannot) is amazing support. Back when I had a license at work, I could report bugs and literally get a fixed binary back a couple of hours later.

They're both amazing, they're both quirky, and they're both buggy. But one is free and the other has its support. Pick which one matters to you :-)


It's been a while since I needed to do any serious reverse engineering work, but I do remember IDA pro having support for a range of obscure CPUs, not sure how ghidra compares on that front these days.


Ghidra still mostly relies on community plugins for obscure platform support. It's ~OK but less than ideal.


Ghidra outstrips IDA Pro in the number of architectures it supports.


One of the major categories of users is people in the warez scene, all of whom are pirating it. The only other one is security researchers, which is a pretty small market.


And backward engineers, including porters and students, and code recoverers.

And especially, software tweakers and improvers. Not all software is open source.


IDA Pro was an amazing disassembler and accompanying set of tools - top of the pack for quite a while.


Love the joke in the URL. :)

(For anyone that doesn't get it; it's a play on Godbolt)


What is really funny about this is that Godbolt is the last name of the Compiler Explorer's author. But it seems like it is a brand, a word now.

Being able to swap two letters from a name and get something nice like this is lucky.

Godbolt is quite a name.


Not just any swap: it's mirroring - for the reverse direction.


It could be argued that Matt Godbolt is a good boy, further complicating the Dogbolt issue.


Thanks! We debated it some internally and I'm glad it won out, I think it's worth it. Plus, it has a nice logo that goes with it.


Can any of these decompilers make effective use of a Microsoft PDB file, if I have one, to include original symbols in the decompiled output? What I'd really like to do with a decompiler is feed it a final compiled EXE or DLL of my own code and see what it looks like after it's been run through whole-program optimization. In that case, of course, I have a PDB file.



IDA does.


Binary Ninja can as well (sorry for the delay, been on vacation this week) though none of the tools will download and use PDBs that might be available via public servers or otherwise by default in the configuration we're using on dogbolt. It would potentially be possible but our goal isn't to provide a test of all tools in all possible configurations as much as it is to get a good overview. Once you start tweaking each tool differently you're better off running that sort of analysis locally.


Ha! that was funny, I wonder though, getting fed tons of code, couldn’t Godbolt leverage code—-> Compiler Obj —-> Assembly as a mean to train an AI decompiler ? Food for thought.


I've always wondered about this. Compilers do a LOT of irreversible stuff. For example, symbol names usually aren't needed (unless you have a reflective language).

Where AI would really shine is reversing the (only seemingly reversible) optimizations. For example, GCC converts "x * 14" into "(x << 4) - x - x". Of course, you can never be 100% sure the programmer didn't actually want "shift left by four followed by two subtractions", but I'm convinced that 99% of the code I write is fairly predictable and statistically similar to whatever giant codebase you train it on.


Symbol names could be inferred from context


Throwing AI at the problem might not actually be the worst suggestion. I wonder how the likes of copilot model the AST. Heh, you might even be able to build an approximation of a compiler using AI.


I think it would be easier and faster to just take the millions of open source projects on github for that :)


...which don't have binaries. It's easier for Godbolt, since the whole purpose of the website is to compile and show output. If you crawl GitHub you need to compile the projects yourself, much more difficult.


Binaries are freely available from package management repos, with the benefit of having a known toolchain you can tag your ML inputs with. All the package managers I've worked with have a strongly structured "upstream" or "repo" field or similar that you can use to get to the source.


Fair enough!


Some projects do publish binaries with releases.


Just take all of Debian packages, or something like that.


Maximal size of executable is 2MB. So, it is not possible to torture it with the ghc-compiled Haskell program.


IDA license sponsored by "Yiang Ling Personal License"?

EDIT: Site has changed in multiple ways in the last 30minutes I've been trying to submit my sample. Best of luck in keeping up with demand.


Nope, Ilfak gave us a license for it and as Binary Ninja devs we're using a legitimate licensed copy of Binary Ninja as well. All above board and we're hoping to add more commercial decompilers in the future as well as we can integrate them and the companies behind them are willing.

RE: Demand. We just got 2x the workers but as the easy coast wakes up I'm not confident it'll hold up too well, several of the decompilers are... VERY resource intensive so there's really no good way without an exorbitant amount of compute to scale to heavy demand.

Eventually a better queue system with better pre-processing to filter invalid things is on our todo list



Unrelated, but it's amazing that over the years I have seen all of misspells of "Jiang Ying" and the "ang ing" part is always right. :P


HN crowd decompiled the website


Yeah, sorry about that. We're working on getting it up again but no promises. I'm on vacation in Europe while the rest of the team is about to head to sleep so might be a bit before we have it more stable.


Long, long ago a friend lost the source to a CP/M program, and wrote ReSource to help re-create the 8080 assembler source from the executable. I ported it as Com2Asm, back in the MS-DOS days... I wonder how good things are now.

How long should I give this thing to run? My upload was 250k.


Some nice symmetry there:

Decompiler Explorer: dogbolt.org

Compiler Explorer: godbolt.org


A few years ago have tried Hex-Rays/IDA, and it gives me reasonable information in terms of program control flow, and help me with doing hot reverse-engineering without source code. A few years later, Hex-Rays/IDA seems to still be the one to give the most useful information out there, even for hello-world examples.

I remember one of the project came up on my GitHub homepage, but never tried it. Probably this is the only space where I don't feel left out without having to constant following the update, comparing it to JS space, etc..


Incredible name


Hugged to death...


Doesn't this violate (at least) the Hex-Rays license? Fun project, but how is this legal?


Nope, not when you ask them and they provide the license.

This is being run with the permission of all the commercial products. In fact, we (Binary Ninja) and Hex-Rays (once I figure out the exact mechanism with Ilfak) are the ones actually paying hosting costs! It's both good for the community and hopefully shows off the value in commercial decompilers. :-)


That probably means they received permission ;)


Hilarious name and fascinating concept haha I like it


Are there any decompilers for old x86 DOS com files?


The commercial versions of IDA still do COM and MZ executables as far as I'm aware, although it's been dropped from the free versions since 5. (Which is still available from ScummVM's reverse engineering page, but is only a disassembler and doesn't come with the decompiler)

Ghidra does a vaguely OK job with MZ executables, providing they've been unpacked first. It really struggles to represent DOS function calls properly, you'll find arguments go missing from the decompiled code. There are some third-party plugins which improve things a bit. And it doesn't have signatures for any of the libraries so the output will just be a lot of `if ((var26 & 0xFEEE) && var42 > 0xE0) { ... }` and it's up to you to work out when one of the variables is actually a pointer to video memory or whatever.

Reko can also decompile this era of code, it does have a tendency to crash on any more complex program but will be fine for simple files. Similar problem that the decompiled pseudo-C code doesn't really illuminate what's going on any more than just reading the disassembled x86 assembly language and walking through any tricky sections in the DOSBox debugger does. Without all of the Win32 API calls modern programs make there's a lot more work needed to figure out what's going on.

Personally I find I also end up needing to use a vintage tool like Sourcer alongside the modern ones, because the newer stuff doesn't annotate things which were common in the era like directly referencing the BIOS data area or reading the interrupt table from memory rather than using the DOS calls for it. It's that or spending a lot of your reverse-engineering time discovering how things were done in the DOS days.


Thanks. I had tried Reko and your description matches my impression.

I hoped to try hunt down the DOS Tetris easter egg described on the original authors website. https://vadim.oversigma.com/Tetris.htm

The executable is linked in the image.

I managed to decipher some functions (video memory pointers were actually kind of helpful) but didn't get far.


You can use old versions of IDA, IDA for DOS or i think Sourcerer.


Sourcer is pretty amazing if you need to patch old BIOS, nothing can touch it in that niche https://corexor.wordpress.com/2015/12/09/sourcer-and-windows...


BinaryNinja: Error decompiling: Traceback (most recent call last): File "decompile_bn.py", line 66, in <module> main() File "decompile_bn.py", line 13, in main t = tempfile.NamedTemporaryFile() File "/usr/local/lib/python3.8/tempfile.py", line 531, in NamedTemporaryFile prefix, suffix, dir, output_type = _sanitize_params(prefix, suffix, dir) File "/usr/local/lib/python3.8/tempfile.py", line 117, in _sanitize_params dir = gettempdir() File "/usr/local/lib/python3.8/tempfile.py", line 286, in gettempdir tempdir = _get_default_tempdir() File "/usr/local/lib/python3.8/tempfile.py", line 218, in _get_default_tempdir raise FileNotFoundError(_errno.ENOENT, FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/home/decompiler_user']

Hex-Rays: Error decompiling: /tmp/tmpanbyzjw9/tmpqx8sjhpv: is not decompilable

angr and Ghidra still waiting at 150seconds and counting....

320seconds and counting....

Boomerang: Error decompiling: Traceback (most recent call last): File "decompile_boomerang.py", line 57, in <module> main() File "decompile_boomerang.py", line 14, in main with tempfile.TemporaryDirectory() as tempdir: File "/usr/local/lib/python3.8/tempfile.py", line 780, in __init__ self.name = mkdtemp(suffix, prefix, dir) File "/usr/local/lib/python3.8/tempfile.py", line 347, in mkdtemp prefix, suffix, dir, output_type = _sanitize_params(prefix, suffix, dir) File "/usr/local/lib/python3.8/tempfile.py", line 117, in _sanitize_params dir = gettempdir() File "/usr/local/lib/python3.8/tempfile.py", line 286, in gettempdir tempdir = _get_default_tempdir() File "/usr/local/lib/python3.8/tempfile.py", line 218, in _get_default_tempdir raise FileNotFoundError(_errno.ENOENT, FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/home/decompiler_user']

RecStudio: Error decompiling: Traceback (most recent call last): File "decompile_recstudio.py", line 59, in <module> main() File "decompile_recstudio.py", line 14, in main with tempfile.TemporaryDirectory() as tempdir: File "/usr/local/lib/python3.8/tempfile.py", line 780, in __init__ self.name = mkdtemp(suffix, prefix, dir) File "/usr/local/lib/python3.8/tempfile.py", line 347, in mkdtemp prefix, suffix, dir, output_type = _sanitize_params(prefix, suffix, dir) File "/usr/local/lib/python3.8/tempfile.py", line 117, in _sanitize_params dir = gettempdir() File "/usr/local/lib/python3.8/tempfile.py", line 286, in gettempdir tempdir = _get_default_tempdir() File "/usr/local/lib/python3.8/tempfile.py", line 218, in _get_default_tempdir raise FileNotFoundError(_errno.ENOENT, FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/home/decompiler_user']

Reko: Error decompiling: Traceback (most recent call last): File "decompile_recstudio.py", line 59, in <module> main() File "decompile_recstudio.py", line 14, in main with tempfile.TemporaryDirectory() as tempdir: File "/usr/local/lib/python3.8/tempfile.py", line 780, in __init__ self.name = mkdtemp(suffix, prefix, dir) File "/usr/local/lib/python3.8/tempfile.py", line 347, in mkdtemp prefix, suffix, dir, output_type = _sanitize_params(prefix, suffix, dir) File "/usr/local/lib/python3.8/tempfile.py", line 117, in _sanitize_params dir = gettempdir() File "/usr/local/lib/python3.8/tempfile.py", line 286, in gettempdir tempdir = _get_default_tempdir() File "/usr/local/lib/python3.8/tempfile.py", line 218, in _get_default_tempdir raise FileNotFoundError(_errno.ENOENT, FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/home/decompiler_user']

RetDec and Snowman are the only ones that work on a sample app supplied.

If I get time, I'll upload another app to test it, which will introduces a new technique.

I know for a fact Ghidra should work because I've used it myself.


Was this due to load or server restarts or are you still seeing errors? Pass me a GUID either publicly or privately (my handle on twitter accepts DMs or an email address at my handle.com as a domain) if you don't mind and I can take a closer look.


Nice phish! That will come in handy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: