Hacker News new | past | comments | ask | show | jobs | submit login
Avast open-sources its machine-code decompiler (avast.com)
354 points by matt_d on Dec 13, 2017 | hide | past | web | favorite | 60 comments



Is there any way to use this on a DOS EXE? This would be a lovely tool to port/remaster old games (i.e. decompile, replace rendering methods with modern equivalents, compile again -- an "offline" version of the "online" idea I was going for with http://www.gabrielgambetta.com/remakes.html)


Note that it was not rare for some old DOS games to have bits of assembly in them.

This was usually done to overcome some performance bottleneck but with today's hardware you might not need that at all.


Yesteryear's C compilers weren't great, optimization-wise. But anything remotely modern would just brute force its way through a DOS-era program, well optimized or not.


Also yesterday's instruction sets were a bit more limited.


It probably needs a profile for the MZ format. And probably for 16-bit x86. I'm considering trying it on some .com binaries, to see what it does (since it said it can handle raw binaries). I doubt it'll work immediately, because I doubt that their code handles memory segmentation properly, 16 bit pointers and data, etc.


Cool. Finally some free alternative to IDA decompiler plugin. IDA is still better due to interactive nature, i.e. you can explore the code and rename variables/functions as you keep on exploring. I hope this evolves into something like that.


You might be interested in the free decompiler Radare2 https://en.wikipedia.org/wiki/Radare2


radare2 is a disassembler, not a decompiler.

Disassembly is a much easier task than decompilation, since it's a mostly mechanical process. Decompilation requires you to undo the optimizations/transformations the compiler did as it generated the binary, which is much harder.

That said radare2 is still cool, and a GUI (Cutter) is in the works.


Exactly what I was thinking. I remember my sophomore year of CS undergrad I emailed IDA's support asking if there was a student license or some equivalent to learn how to use it. Props to them for at least replying, but their answer was a firm no and was disappointing to say the least.


When I was a student, I once tried to buy a copy of IDA Pro Advanced, as the Standard version didn't have the features I needed.

They refused to sell it to me, unless I bought the standard version and used that for a year (which in retrospect, I perhaps should have done)


It's crazy that companies don't fully recognizes this. Any lost revenue to students would more than made up for when they join the workforce.


Backward R in the logo: this drives me nuts. To any readers of the Cyrillic alphabet it says "Ya". So this says "Yaetargetable Decompiler"

I can't be the only one!


I'm not even Russian and I find this "funny" usage of "Я" and sometimes "И" rather grating.


It's not written in a Cyrillic language. It's written in English.

Nobody comes along and points out that in Spanish the "g" works differently, and it bugs them to see words with "g" in them.

It's one thing to do faux-Cyrillic and get the letters wrong. It's quite another to do something silly to a latin letter, and get complaints that it resembles a non-latin letter.


It doesn't resemble the Cyrillic letter---it is it. "R" is an English letter but not a Cyrillic one, and "Я" is a Cyrillic letter but not an English one, and by flipping them horizontally you transform them into each other. I imagine for many of the 7,574,303 people in Russia who speak English and probably also a fair chunk of the 854,955 Americans who speak Russian (and presumably have mastered both alphabets), it's annoying. Not a huge deal, just annoying.

https://en.wikipedia.org/wiki/List_of_countries_by_English-s...

https://en.wikipedia.org/wiki/Russian_language_in_the_United...


There are so many symbols from different languages that resemble each other. If I use a smiley face, that doesn't mean I used a "ü" or a "ツ" from another alphabet just because it looks similar. A backwards R is visually the same as a Cyrillic character, but that doesn't mean I'm writing in Cyrillic, just like a "P" is visually the same as a Cyrillic character but doesn't mean I'm writing in Cyrillic.


Does anyone know why intel discontinued their tamper protection toolkit? They had an obfuscation compiler that would turn compiled C code into a self encrypting/decrypting code. The idea was if you dissassembled the code at any point you wpuld get mostly garbage instructions. I always wondered how a de compiler could get around that.

https://software.intel.com/en-us/articles/intel-tamper-prote...


Google for anything Rolf Rolles has published on the topic, believe it or not there are general approaches to solving this. Someone already mentioned dumping the text segment, that only works for silly 90s-era obfuscators.

Contemporary obfuscators _rewrite_ the protected code as a series of instructions executed on a virtual machine whose bytecode (and bytecode semantics!) are randomly generated at build time. The solution (AIUI) is symbolic execution of the instructions to determine their underlying architectural effect, synthesize some compiler IR that is equivalent to those effects, run an optimization pass (like a regular compiler) over that IR, and finally generate x86 from the result.

The optimization passes are necessary to remove side effects that do not impact the state of the program ("noise"), which modern obfuscators like Themida insert a ton of into the instruction stream

In other words, rather than attempt to dump some particular part of the program, the binary as a whole is statically analysed to determine, regardless of the indirections inserted by any obfuscation pass, what machine instructions are ultimately executed for a given program input. The abstract representation is then compiled to an equivalent new program which is much easier to read, because all of the indirections and noise have been optimized away.

When I was reading about Rolles' work initially, I couldn't help but imagine this is the kind of approach Geordi La Forge would have come up with if cracking an encrypted binary were ever the plot for an episode of Star Trek :)


instructions executed on a virtual machine whose bytecode (and bytecode semantics!) are randomly generated at build time

Like the one built into Windows: https://github.com/airbus-seclab/warbirdvm


iirc it wasn't very advanced. They would 'mov' the decrypted instructions to a region in memory (always the same one), executed it, then save register state and go on to decrypt the next set of instructions.

Breaking it involved monitoring the memory for the decrypted instructions, and dumping them right before they were executed. I don't remember if there were any additional complications with stuff like conditional jumps.


This is something that code protectors and viruses have been doing for ages. Not surprised Intel couldn't make this into a sellable product.


How would this work if you were to run it under QEMU and just dump the code segment after decryption?


The point is that there is no "after decryption", at any given moment in time only a small portion of the code is decrypted.


There has been some malware that did just that - it was still possible to record the trace of instructions being executed along with the current instruction pointer to be able to reconstruct the binary quite well.


>"As we announced in our Botconf 2017 presentation at the beginning of December (slides), RetDec, our machine-code decompiler, is now open, which means anyone can freely use it, study its source code, modify it, and redistribute it."

These slides linked in the above looks like this was a really fascinating talk.

Does anybody know when or if this presentation was recorded or if it will be made available? I would love to watch this.


There was a live stream, but nothing appears to be available yet.

https://twitter.com/Botconf


Yeah I also checked youtube with various search terms and came up empty. Hopefully someone from Avast or Botconf will read this and post them :)

It looks like some conference presentations from years past have made it to youtube.


Really cool stuff. I don't like being negative when it comes to fantastic moves like this, but I'm still really disappointed that it doesn't support 64bit executables.


A reddit comment had a pretty good explanation.

Most malicious code is still written in 32-bit since 64-bit Windows supports running 32-bit code.

Write something in 32-bit - target 100% of devices. Write something in 64-bit - target ~50% of devices.


But if security software only treats 32-bit executables as suspicious, wouldn't it make sense for malware creators to switch to 64-bit?


Is any security software that easily bypassed?


Reddit implies that it's on their roadmap.

https://www.reddit.com/r/programming/comments/7jhk6p/avast_o...


Maybe that's why they're open sourcing this one? They have their own internal new one that does support AMD64.


Agreed.

x86, ARM, MIPS, PIC32, PowerPC, but not x86-64. Impressive list, but an odd choice.


IDA gives away x86 and charges for x86-64, maybe they're going for a similar freemium model?


I think the problem is technical rather than commercial : https://github.com/avast-tl/retdec/issues/9.

x86_64 have calling conventions (namingly __fastcall) which are more inconvenient to decode than x86 _cdecl or __stdcall where every arguments are passed on the stack. Most symbolic engines usually works only on x86 for the same reason.


I've used retdec before, its output is quite nice. I even had some problems with it (doing dumb stuff like putting in executables that were beyond the limits they imposed on their website) and whoever they had supporting it were quite friendly in helping me anyway.


Looks like it's also relying on LLVM for disassembly? Ouch; that's an incredibly bad idea if you're trying to analyze malicious or unusual code (it's not designed for that), but I guess it's the easiest for a proof of concept like this.

Although, there's no way an AV company doesn't have its own disassembler, but those are almost always treated as trade secrets (especially the stuff that isn't in the spec / the spec is wrong). They'll probably hook it up to that before doing any real work with it themselves.


> proof of concept

They've been working on this for 7 years they said so I don't think it counts as just a PoC.


> Looks like it's also relying on LLVM for disassembly?

[wild speculation here] I suspect they're using llvm to go from an ast to c(++) code since they have tooling for stuff like that.

Now I have to find me a binary-blob kernel module that manufactures like to put out and see what the C code it spits out looks like -- another wasted day methinks...


You could just ..uh.. acquire a copy of IDA that has the AMD64 decompiler. It's more mature and spits out C code of wildly varying readability, though only for one function at a time.


Isn't it using capstone?


Doesn't look that way? https://github.com/avast-tl/retdec/blob/master/src/bin2llvmi...

Capstone would probably be the best open-source choice for something like this though.


There is also capstone2llvmir in https://github.com/avast-tl/retdec/tree/master/deps


Super cool move. Always interesting to look at techniques for this type of stuff. It always feels like black magic.


This is awesome! I had been using retdec.com since before Avast bought AVG (where RetDec was originally developed). I'm very excited to do away with the limits the website imposed.


Why they do it? This is their competitive advantage.


Modern AV software uses a virtual machine and a decompiler in tandem. This is not their only competitive advantage.


Now the usual for large projects like this one (this is just for the main repo):

  $ loc .
  --------------------------------------------------------------------------------
   Language             Files        Lines        Blank      Comment         Code
  --------------------------------------------------------------------------------
   C++                    587       202592        23441        43727       135424
   C/C++ Header           450        34934         6371        11733        16830
   Bourne Shell            10         2363          247          518         1598
   Plain Text              16          827           46            0          781
   Autoconf                 1         2507          551         1635          321
   Python                   1          195           32           22          141
   Markdown                 2          162           45            0          117
   ASP.NET                  2            2            0            0            2
  --------------------------------------------------------------------------------
   Total                 1069       243582        30733        57635       155214
  --------------------------------------------------------------------------------


> Now the usual for large projects like this one (this is just for the main repo):

I don't understand what you're trying to say. The usual... what?


Oh, I thought that we do this sort of thing whenever there's a new open source project posted here. People seem to want to know what it's made of and potentially its (S)LOC.


I've never seen anyone do that on here.


But I like it. GitHub is all like "98.6% C++, 1.4% Other", which is... an understandable heuristic, but very meh.


I wish GitHub let me customise which directories to include. My project shows up as 74% Ruby, but a lot of that is some specs that aren't really part of the project, and in reality it's 60% Ruby.


To some extent, you can influence GitHub's classification via gitattributes:

https://github.com/github/linguist


Didn't know about loc or this thing, thanks for bringing it up.


I wonder if there is a website/api that will give you loc of pypi or npm packages.

Maybe in the form of a forum badge or certificate like pdf.


So there are two lines of ASP.NET code? Do I read this correctly?


Two single line files that have a file extension making them look like ASP.net code is how I read that.

It could be place-holder scripts that redirect to the correct location, for instance a default.asp/index.asp file that does nothing but redirect to index.html some other default that IIS doesn't recognised out of the box. This would catch cases where someone has just dumped the web assets on a IIS share (IIS doesn't, or at least didn't used to, consider index.html as a potential default document). In classic ASP this would be something like the line:

    <% response.redirect "index.whatever-extension" %>
It could also be false positives in whatever test is being performed. For a project like this, I suspect that is more likely to be the case.


Probably a misclassification. It's not 100% accurate but it gives a good idea of the size and languages of the code base.




Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: