Can anyone recommend a good tool for exploring and decoding other binary formats? I am interested in analysis of non-code binary data.
A tool that interactively generates a parser for these would be great.
With struct you write a string that describes the data and the modules packs/unpacks the bytes. For example you can do: struct.unpack(">4sBhLq", data) and you are asking struct to parse a 19 byte long string, data (it has to be strictly 19 btyes), because you expect the following:
* ">" big endian data
* "4s" a 4 byte field to read as a string
* "B" one unsigned byte
* "h" one signed short (2 bytes)
* "L" one unsigned long (4 bytes)
* "q" one signed quad (8 bytes)
I haven't read of anything that's automatic, if you have the API or some docs this is the closest I can think of. Of course, the online help of Python covers this module to greater detail: https://docs.python.org/2/library/struct.html
Is it possible to backtrack using this framework? (I.e., execute code, stop it, go back a bit, continue, etc.)
Is it possible to save/restore the machine state to a file?
And can it deterministically (re)run multi-threaded code? And other (normally non-deterministic) OS functions?
In the develop branch we have a fully integrated debugger (using python-ptrace), so this should be doable.
> Is it possible to save/restore the machine state to a file?
Nop. If you want to do this, you should try to find a more VM-oriented framework (pyqemu maybe?)
> And can it deterministically (re)run multi-threaded code? And other (normally non-deterministic) OS functions?
Not yet, but if you instrument some functions using the debugger in the develop branch, you could do it.
edit: PANDAS is also a very good option (https://github.com/moyix/panda)
I've been thinking just in the last few days that it would be cool to write up how I did it.
The basic way is to create a document with nothing in it, then a second document with a very small difference. For example in Movie Magic I had one new document that I saved without putting anything in it, then a second where I entered the letter "A" into a certain field.
I made hex dumps of them both, then compared the hex.
Then I would make some modest changes and additions, such as the letter "B" in a different field, or I would change the "A" in that first field to "AB".
This approach only really works if you've had experience implementing file format code; it's particularly helpful to have designed original formats completely from scratch. For these specific reasons I am better than most at file format reverse-engineering, however I don't really have a clue about doing that for network protocols as the Samba folks did.
Once I have some guess about the file format, I start writing a file interpreter, typically to dump the binary format into human-readible text. "Actor's Name: A", "Movie Title: B".
Rather more importantly, that binary file dumper is chock-full of assertions. It's rather more important to get the assertions right than it is to display the actual file payload values.
Loosely speaking, a set of correctly-implemented assertions is, in itself, the documentation for the file format you just reverse-engineered.
It's also important to round-trip. So I come up with some very simple text input format, then a filter that creates documents that are readable by Zeni or Movie Magic, then I try to open the files. If that doesn't puke on my shoes, I edit the file a bit, then run that edited file through my human-readible dumping program.
Lather, rinse and repeat. It is very tedious and slow but it is quite cool when you discover something that works.
I've been puzzling over how I could offer a consulting service where I would reverse-engineer documents as a service to other companies. No doubt there is lots of need for this but I am unclear as to how to market it.
The DMCA only covers Digital Restrictions Management. If the DMCA doesn't apply, as it did not in the case of files that you create yourself as with Movie Magic Scheduling or Zeni 4, while there are some procedures one must take care to follow, reverse-engineering is completely legal.
Another way to put it, is that among the reasons we have patents, is so that reverse-engineering won't be necessary.
One tip I have, which might be obvious: Write docs as you work. While I was working on an app that reads MS Access file, I simultaneously worked on a guide that documented everything I found out: http://jabakobob.net/mdb/
That guide will be useful when I need to fix a bug in the future, and I hope it will also be useful for other people.
Really the written documentation is far more important than any code. Consider someone who wanted to write a file filter in some other programming language. I use C or C++, they don't look a whole lot like Python or Haskell.
Really the written document should be regarded as authoritative, and not any of the code.
However, it's not clear how to go from a written specification to a working file filter.
One way that would work well but that would be laborious, would be to give the doc to someone who had no prior knowledge of the format in question, so they could write their own filter.
If I gave such a doc to you, then you and I could round-trip our files back and forth between my implementation and yours, then we could be quite confident that the document was correct.
To this very day, OpenOffice will not round trip my resume between .odt and .doc. The are is subtle but I can see it every time I try.
I'll post a far better discussion later after I write up an article for my website.
However I figured it would be helpful to post this particular piece so as to stimulate discussion that I can reflect in my upcoming article.
At the moment I proceed like you outlined above, but I'm stuck reverse engineering a save game file format from an old sports simulation game written in VB6. The programm seems to write the savefiles with the put command (__vbaPut3 in the assembly) and reads the savefiles with get (__vbaGet3).
Even though this is my first try at reversing a file format, thanks to sweetscapes 010 editor I was able to make sense of one third of the file, but i'm stuck ever since. :D
If it's one of the DOS-based versions, you'll find junk in your files. That is, suppose your savefile was 512 bytes long, however it actually wrote 510 bytes starting at an offset of 2; the first two bytes might be, say, 0x1234, rather than the 0x0000 that anyone would reasonably expect.
There are two ways to deal with this: the most straightforward way would be to use an NT-based Windows - NT4, Win2k, XP &c.
But if you game won't run on NT - that's the case for many as they write directly to the video card and sound card - an acceptable solution would be to write a tool that fills up all the free space on your filesystem with zeroes.
That's what I myself actually did for Movie Magic, as I was using Mac OS 8.1 at the time - long before we had Mac OS X.
And that right there was my most-significant competitive advantage; as far as I have ever been able to tell, no one else has ever figured out to do that.
I am willing to spill the beans as nowadays everyone uses at least allegedly "secure" operating systems.
Finally you have to be concerned about junk returned by the system memory allocator. Even in *NIX, malloc doesn't initialize its buffers, however the kernel initializes memory to zero when you grow the process memory with setbrk.
Loosely speaking, when malloc runs out of memory, it asks the kernel for more memory by increasing the break. The break is the boundary between mapped memory and unmapped memory; it grows upward. The stack starts towards the top of the address space and grows downward. Between the two is unmapped memory; while strictly speaking it is addressable to actually do so will cause a segvio.
On a secure system, decrementing the stack pointer or incrementing the break will result in zero-filled memory.
On MS-DOS, Windows 98, Mac OS 8 &c. that memory will be full of junk.
I don't know how I'd do this on Windows but on the Classic Mac OS it was not hard at all to patch the system, so I'd hack the Memory Manager to always initializes the buffers it returns.
The game runs on Windows XP upwards. Luckily it seems like it does not compress or encrypt the savefiles.
Each of these work really well for various kinds of applications.
The AutoCad binary format is apparently quite complex. To get the written specification, last I checked, while it had been reverse-engineered, the people who accomplished that work charged quite a lot for the spec. On the other hand, that spec is doubtlessly worth lots of money to the kind of people who need it.
A while back my ex asked me to buy here a compact digital audio recorder. I picked one up at Fry's for eighty bucks however I did not realized that it only wrote to WMA - and Bonita had an iMac.
I asked her to send it back to me so I could exchange it for one that saved to MP3 or Ogg Vorbis or AAC - the "iTunes Audio" format - but no Bonita is quite resourceful, so she purchased an inexpensive WMA to AAC converter utility.
That would be quite a lot of work but no doubt the guy who sells the utility makes out like a bandit.
I am very much a fan of Free Software, so much so that I personally regard Open Source as The Enemy. So I am uncomfortable with selling executables but not the source.
I need a way to pay the rent, so I can't really do all that work then just give away the source for free. However there are many ways to - oh how I hate this word! - "monetize" source code.
One way might be to publish it under the Affero GPLv3 license, but then sell proprietary licenses. That's what Trolltech did with Qt. That worked real well for them; we all got KDE, whereas Qt has been for many years a popular framework for purely proprietary applications.
Another would be to offer RansomWare. I expect it would be best were I to come out and call it "RansomWare" right up front. Something like a KickStarter campaign, here is a binary, you can see that it really does work, for fifty grand you get the source code.
Another idea would be to sell binaries for a year or so, but then publish the source under a Free Software license.
Another - and this is what I presently do - would be to just publish the source without expecting any money from it, but with the expectation that hosting the download on my own website would attract consulting clients.
There have been times that has worked real well for me but not always, and not lately so I want to get away from doing that.
Another take would be to write directly competing applications. One specific reason that it's legal to reverse engineer, is to enable interoperability. Movie Magic Scheduling is quite costly; it would not be hard at all to write a far better competitor in C++ with ZooLib. I wouldn't even need SQLite as ZooLib has - in some respects - a far better database.
I expect I would do that in the long run but in the short run, writing a complex application then selling it is a HUGE PITA.