Hacker News new | past | comments | ask | show | jobs | submit login
BARF – A Binary Analysis and Reverse Engineering Framework (github.com)
91 points by galapago on Apr 8, 2015 | hide | past | web | favorite | 27 comments

Neat! I started with radare2 recently - quite a steep learning curve, but once you understand that it works like VIM (: command etc..) , you are flying ! Worth spending time on learning it. Just make sure you are pulling from git daily as they make several commits/day. Did few CTFs with it (i.e. - trying to learn on tool, but know it well), it can binary patch, detect variables/arguments to functions, and you can rename both args and function names which makes it even easier to read. The only problem is that documentation sucks. You have "cursor mode" so you can essentially move line by line in your asm view, hit return on jump and it moves you to jump location (I like that). It also has ascii-based graph, I do VV (that's the command), it prints the graph, from there you can move between jumps/calls by pressing tab, but even better - if you press t (for true) it moves you to the block that would be followed if jump/call was taken, and f (for false) if it wasn't - also very neat. And that's probably like 1% of what it can do, amazing project.

"Not in here mister! This is a Mercedes!"

Can anyone recommend a good tool for exploring and decoding other binary formats? I am interested in analysis of non-code binary data.

Maybe binglide (https://github.com/wapiflapi/binglide). A few alternative are also detailed in its README

What a great project. Thanks for sharing.

Sweetscape's 010 Editor [1] is very nice, but it's not free.

[1] http://www.sweetscape.com/010editor/

What binary formats are you interested in? For my graduate thesis I had to read a lot about JPG, PNG, SQLite3 and MS-OLE file formats and I could give you some references to read from and shamelessly link to my github project where you can find some tools related to this (and others) file formats.

How would you diagnose an unknown compression format? That's a problem I've encountered recently and I'm hoping there is an easier way than stepping through the reference compressor in a debugger.

If you're looking to find out what it is and what's in it, there are some very flexible unpackers out there. Titanium Core from Reversing Labs is excellent.

I've already know what it is, and it isn't maleware or a packer. What I'm looking for is something that can do analysis on a bit of data and say what compression scheme is likely used.

Ah. Well, the product I mentioned can usually identify packing schemes and unpack them. Commercial close-source, though.

I work in and around embedded software. Custom API's and internal dumps often contain custom binary formats.

A tool that interactively generates a parser for these would be great.

For "free exploring" of formats I usually dive with the IPython interpreter and a bunch of modules, struct and array are helpful for packing/unpacking of values.

With struct you write a string that describes the data and the modules packs/unpacks the bytes. For example you can do: struct.unpack(">4sBhLq", data) and you are asking struct to parse a 19 byte long string, data (it has to be strictly 19 btyes), because you expect the following:

* ">" big endian data * "4s" a 4 byte field to read as a string * "B" one unsigned byte * "h" one signed short (2 bytes) * "L" one unsigned long (4 bytes) * "q" one signed quad (8 bytes)

I haven't read of anything that's automatic, if you have the API or some docs this is the closest I can think of. Of course, the online help of Python covers this module to greater detail: https://docs.python.org/2/library/struct.html

https://github.com/construct/construct This is the best thing I have ever used for parsing binary file formats.

Construct is lovely. I've used it on everything from MS debug symbols [1] to talking with USB devices [2].

[1] https://github.com/moyix/pdbparse

[2] https://github.com/moyix/fbtools

That looks very interesting!

The examples are really cool, checkout the elf executable binary parser https://github.com/construct/construct/blob/master/construct... or the png parser https://github.com/construct/construct/blob/master/construct....

"Hachoir" is also very good, It allows you to "browse" and edit any binary stream just like you browse directories and files. A file is split in a tree of fields, where the smallest field is just one bit. Has parser for jpeg, png and many more. Written in python:


Interesting. A few questions:

Is it possible to backtrack using this framework? (I.e., execute code, stop it, go back a bit, continue, etc.)

Is it possible to save/restore the machine state to a file?

And can it deterministically (re)run multi-threaded code? And other (normally non-deterministic) OS functions?

> Is it possible to backtrack using this framework? (I.e., execute code, stop it, go back a bit, continue, etc.)

In the develop branch we have a fully integrated debugger (using python-ptrace), so this should be doable.

> Is it possible to save/restore the machine state to a file?

Nop. If you want to do this, you should try to find a more VM-oriented framework (pyqemu maybe?)

> And can it deterministically (re)run multi-threaded code? And other (normally non-deterministic) OS functions?

Not yet, but if you instrument some functions using the debugger in the develop branch, you could do it.

edit: PANDAS is also a very good option (https://github.com/moyix/panda)

I myself reverse engineered two file formats: the Movie Magic Scheduling database files - it's an application-specific project management tool for motion picture and TV production - as well as the Zeni 4 Electronic Design Automation (ie. Electronic CAD) Physical Design documents.

I've been thinking just in the last few days that it would be cool to write up how I did it.

The basic way is to create a document with nothing in it, then a second document with a very small difference. For example in Movie Magic I had one new document that I saved without putting anything in it, then a second where I entered the letter "A" into a certain field.

I made hex dumps of them both, then compared the hex.

Then I would make some modest changes and additions, such as the letter "B" in a different field, or I would change the "A" in that first field to "AB".

This approach only really works if you've had experience implementing file format code; it's particularly helpful to have designed original formats completely from scratch. For these specific reasons I am better than most at file format reverse-engineering, however I don't really have a clue about doing that for network protocols as the Samba folks did.

Once I have some guess about the file format, I start writing a file interpreter, typically to dump the binary format into human-readible text. "Actor's Name: A", "Movie Title: B".

Rather more importantly, that binary file dumper is chock-full of assertions. It's rather more important to get the assertions right than it is to display the actual file payload values.

Loosely speaking, a set of correctly-implemented assertions is, in itself, the documentation for the file format you just reverse-engineered.

It's also important to round-trip. So I come up with some very simple text input format, then a filter that creates documents that are readable by Zeni or Movie Magic, then I try to open the files. If that doesn't puke on my shoes, I edit the file a bit, then run that edited file through my human-readible dumping program.

Lather, rinse and repeat. It is very tedious and slow but it is quite cool when you discover something that works.

I've been puzzling over how I could offer a consulting service where I would reverse-engineer documents as a service to other companies. No doubt there is lots of need for this but I am unclear as to how to market it.

The DMCA only covers Digital Restrictions Management. If the DMCA doesn't apply, as it did not in the case of files that you create yourself as with Movie Magic Scheduling or Zeni 4, while there are some procedures one must take care to follow, reverse-engineering is completely legal.

Another way to put it, is that among the reasons we have patents, is so that reverse-engineering won't be necessary.

That tip about assertions is great! It would tell you immediately when one of your assumptions is wrong...

One tip I have, which might be obvious: Write docs as you work. While I was working on an app that reads MS Access file, I simultaneously worked on a guide that documented everything I found out: http://jabakobob.net/mdb/

That guide will be useful when I need to fix a bug in the future, and I hope it will also be useful for other people.

Indeed. I didn't mention that, but that's something I do myself.

Really the written documentation is far more important than any code. Consider someone who wanted to write a file filter in some other programming language. I use C or C++, they don't look a whole lot like Python or Haskell.

Really the written document should be regarded as authoritative, and not any of the code.

However, it's not clear how to go from a written specification to a working file filter.

One way that would work well but that would be laborious, would be to give the doc to someone who had no prior knowledge of the format in question, so they could write their own filter.

If I gave such a doc to you, then you and I could round-trip our files back and forth between my implementation and yours, then we could be quite confident that the document was correct.

To this very day, OpenOffice will not round trip my resume between .odt and .doc. The are is subtle but I can see it every time I try.

I just pulled this out of my ass a little while ago. I can do much better and go into a lot more detail, but not just now.

I'll post a far better discussion later after I write up an article for my website.

However I figured it would be helpful to post this particular piece so as to stimulate discussion that I can reflect in my upcoming article.

I'm looking forward to it. Hopefully I can learn from your post.

At the moment I proceed like you outlined above, but I'm stuck reverse engineering a save game file format from an old sports simulation game written in VB6. The programm seems to write the savefiles with the put command (__vbaPut3 in the assembly) and reads the savefiles with get (__vbaGet3).

Even though this is my first try at reversing a file format, thanks to sweetscapes 010 editor I was able to make sense of one third of the file, but i'm stuck ever since. :D

What version of Windows does it run on?

If it's one of the DOS-based versions, you'll find junk in your files. That is, suppose your savefile was 512 bytes long, however it actually wrote 510 bytes starting at an offset of 2; the first two bytes might be, say, 0x1234, rather than the 0x0000 that anyone would reasonably expect.

There are two ways to deal with this: the most straightforward way would be to use an NT-based Windows - NT4, Win2k, XP &c.

But if you game won't run on NT - that's the case for many as they write directly to the video card and sound card - an acceptable solution would be to write a tool that fills up all the free space on your filesystem with zeroes.

That's what I myself actually did for Movie Magic, as I was using Mac OS 8.1 at the time - long before we had Mac OS X.

And that right there was my most-significant competitive advantage; as far as I have ever been able to tell, no one else has ever figured out to do that.

I am willing to spill the beans as nowadays everyone uses at least allegedly "secure" operating systems.

Finally you have to be concerned about junk returned by the system memory allocator. Even in *NIX, malloc doesn't initialize its buffers, however the kernel initializes memory to zero when you grow the process memory with setbrk.

Loosely speaking, when malloc runs out of memory, it asks the kernel for more memory by increasing the break. The break is the boundary between mapped memory and unmapped memory; it grows upward. The stack starts towards the top of the address space and grows downward. Between the two is unmapped memory; while strictly speaking it is addressable to actually do so will cause a segvio.

On a secure system, decrementing the stack pointer or incrementing the break will result in zero-filled memory.

On MS-DOS, Windows 98, Mac OS 8 &c. that memory will be full of junk.

I don't know how I'd do this on Windows but on the Classic Mac OS it was not hard at all to patch the system, so I'd hack the Memory Manager to always initializes the buffers it returns.

Thanks for the insight. At the moment this sounds a little bit too advanced for me, as I am a beginner at reversing and a novice programmer, but are nice ways I can explore.

The game runs on Windows XP upwards. Luckily it seems like it does not compress or encrypt the savefiles.

Another possibility - and lots of folks do this - would be to reverse-engineer formats all on my own, without pay, but then to sell either format source code, binary executables or specification documents.

Each of these work really well for various kinds of applications.

The AutoCad binary format is apparently quite complex. To get the written specification, last I checked, while it had been reverse-engineered, the people who accomplished that work charged quite a lot for the spec. On the other hand, that spec is doubtlessly worth lots of money to the kind of people who need it.

A while back my ex asked me to buy here a compact digital audio recorder. I picked one up at Fry's for eighty bucks however I did not realized that it only wrote to WMA - and Bonita had an iMac.

I asked her to send it back to me so I could exchange it for one that saved to MP3 or Ogg Vorbis or AAC - the "iTunes Audio" format - but no Bonita is quite resourceful, so she purchased an inexpensive WMA to AAC converter utility.

That would be quite a lot of work but no doubt the guy who sells the utility makes out like a bandit.

I am very much a fan of Free Software, so much so that I personally regard Open Source as The Enemy. So I am uncomfortable with selling executables but not the source.

I need a way to pay the rent, so I can't really do all that work then just give away the source for free. However there are many ways to - oh how I hate this word! - "monetize" source code.

One way might be to publish it under the Affero GPLv3 license, but then sell proprietary licenses. That's what Trolltech did with Qt. That worked real well for them; we all got KDE, whereas Qt has been for many years a popular framework for purely proprietary applications.

Another would be to offer RansomWare. I expect it would be best were I to come out and call it "RansomWare" right up front. Something like a KickStarter campaign, here is a binary, you can see that it really does work, for fifty grand you get the source code.

Another idea would be to sell binaries for a year or so, but then publish the source under a Free Software license.

Another - and this is what I presently do - would be to just publish the source without expecting any money from it, but with the expectation that hosting the download on my own website would attract consulting clients.

There have been times that has worked real well for me but not always, and not lately so I want to get away from doing that.

Another take would be to write directly competing applications. One specific reason that it's legal to reverse engineer, is to enable interoperability. Movie Magic Scheduling is quite costly; it would not be hard at all to write a far better competitor in C++ with ZooLib. I wouldn't even need SQLite as ZooLib has - in some respects - a far better database.

I expect I would do that in the long run but in the short run, writing a complex application then selling it is a HUGE PITA.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact