
Show HN: Binspector – A Binary Format Analysis Tool - MontagFTB
http://binspector.github.io/blog/2014/10/13/binspector-a-binary-format-analysis-tool/
======
danbruc
I tried several times to build such a thing but gave up every time. It is easy
for simple file formats but if you want something universal it gets quite
complex pretty quickly. There are file formats where the endianess is
specified in the file and not statically known, there are file formats where
the bit width of numbers is not statically known, there are (binary) strings
delimited with some special marker and there are many different ways to escape
the delimiter within the string, there are data blocks starting at fixed
offsets, at offsets relative to some other block or the end of the file, there
are offsets computed from different values found in the file, compressed
blocks, unions of different block types discriminated by a tag, tags that need
to be computed from multiple values, ...

I would really love such a tool, but a language to describe most of the file
formats in the wild is probably going to be (close to) Turing complete and is
unfortunately nothing you can hack together over a weekend.

~~~
MontagFTB
Let me start out by saying I do not claim the current binspector grammar can
cover all the cases you've described. Yes, at some point binary formats can
get so unwieldy the only way to read them is with some kind of Turing-complete
system.

What I have discovered, however, is that there are still a vast array of
binary formats that can be well described with the format language as it
stands. I have tried to devise workarounds to skip past the parts of formats
that cannot be handled well. I hope to double-back on those limitations and
extend the feature set of the language, but I hope what's in there is enough
to get people started.

Of the issues you mentioned, Binspector's language can handle dynamic field
endianness, dynamic field size, terminators and delimiters,
absolute/relative/dynamic offsets, tag-discriminated unions, and lambda
calculations. It doesn't do everything, but it does allow a lot.

~~~
danbruc
You already made it far beyond what I ever achieved. I aimed at a XML-based
file format description and that gets messy pretty quickly once you have to
express expressions. After the initial friction of having to write a parser,
going with a custom language seems way more promising. I always wanted to peek
under the hood of Wireshark and see how they do it - do they have specialized
code for every protocol or do they use an abstract protocol description, too?
I never did it. I hope your project matures, there is a lot of space in this
niche.

------
davelnewton
I wrote a tool I called the "data file disassembler" somewhere around 15 years
ago that is basically this, along w/ a user interface that allowed you to drag
a selection area and define a "section" and its type.

I used this primarily for reverse-engineering proprietary formats both for
personal entertainment, but also while doing work to determine when/if patents
were being violated while working for a legal firm.

I had a variety of supported formats, including several microcontroller
instruction sets, as I spent a lot of time disassembling ROMs. These had full
support for labels etc.

This seems very similar.

~~~
kennytm
On Mac there's a GUI tool ["Synalyze It!"][1] which one could apply grammar to
binary files. But it's not free anymore.

[1]: [https://www.synalysis.net](https://www.synalysis.net)

~~~
MontagFTB
There was another tool called General Edit
([http://www.quadrivio.com/ge.html](http://www.quadrivio.com/ge.html)) which
used to be available for the Mac but is now defunct. It was one of the
inspirations behind Binspector and sounds very similar to the tool you
mentioned.

edit: grammar

~~~
delhanty
The "Synalyze It!" grammar format is XML, and the developer has some sample
grammars for download:

[https://www.synalysis.net/formats.xml](https://www.synalysis.net/formats.xml)

Having wanted a binary grammar format for a while, and just purchased the Pro
version of "Synalyze It!" as a starting point, it would be interesting to hear
a short non-specialist's summary of the relative merits of your approach to
expressing grammars compared with an XML based approach like "Synalyze It!"
...

Overall, "Synalyze It!" seems mostly fairly stable and certainly a useful tool
well worth the purchase price for the Pro version. However, it is closed
source, by a small developer, on a single platform: would much prefer to rely
on something with source code available ...

~~~
MontagFTB
What if you compared two descriptions side by side? What about PNG, for which
a grammar exists for both:

[https://www.synalysis.net/Grammars/png.grammar](https://www.synalysis.net/Grammars/png.grammar)

v.

[https://raw.githubusercontent.com/binspector/binspector/mast...](https://raw.githubusercontent.com/binspector/binspector/master/bfft/png.bfft)

Anything I'd say at this point would be biased, but I would be interested in
continuing the conversation.

~~~
delhanty
A comparison w.r.t. the PNG grammar sounds like an excellent starting point.

I could see educational value (and more) in attempting a bi-directional
translator between the two grammar formats.

------
steventhedev
Link to the repo for the lazy:
[https://github.com/binspector/binspector](https://github.com/binspector/binspector)

Interesting project. Is there support for importing/exporting from a C header?
Or generating file import code given a bfft file?

~~~
MontagFTB
There isn't any support for importing from a C header. In order to get it
right you would need a decent C parser as well as details about the compile-
time environment (e.g, char being (un)signed.) At this point it's probably
faster to code up the format grammar manually.

Generating import code given a bfft is a thought I had considered. One of the
features I would like to implement is to separate the parse tree generator
from the analyzer and expose an analyzer API. Applications would be able to
use the Binspector core as a library, then, and read file contents directly by
providing their own bffts and hooking the API to populate their own internals.

edit: spelling.

~~~
steventhedev
Very good point. Perhaps you could use libclang?

Regarding the import code, even if you start with just emitting a C header, it
would make it easier to use your tool to design file formats.

~~~
MontagFTB
Good idea leveraging Clang for struct round-tripping. Yet another thing to
look into!

------
fdb
Looks really cool! I'd love to see an open grammar definition format for all
kinds of tools, both command-line and GUI.

I've worked on parsing TrueType files, and they have some really "interesting"
grammars. For example, they have lookup tables that define offsets at the
beginning of the file. Also, a format flag at the beginning of a table might
define the structure of the rest of the table.

It seems that once you decide you want to be able to parse all of this, you're
grammar will turn into a Turing-complete language. Have you considered this?
Where do you stop?

~~~
MontagFTB
Fonts have been notorious for being ill-defined, and this can wreak havoc on
applications that do not handle them properly. I would be very interested in
seeing some font related bffts surface for the purposes of font validation and
analysis.

I have concerns about Turing-completeness, and am interested in maintaining
the declarative nature of bffts. The grammar is has drifted from that goal,
and I am not entirely sure how to bring it back.

~~~
fdb
I have an interest in helping out. I've created opentype.js, an OpenType
parser and generator
([https://github.com/nodebox/opentype.js](https://github.com/nodebox/opentype.js)).

I understand your concern for keeping the format declarative, but I'm not
entirely sure you can handle TTF files that way, as you'll run into walls very
quickly.

For example, to know the number of entries in the `loca` table, you have to
parse the `numGlyphs` field from the `maxp` table. The `maxp` table doesn't
have to come first...

~~~
MontagFTB
That sounds positively diabolical. I'll have to think through this one some
more, but off hand I wonder if something can be done with Binspector's
slot/signal mechanism to detect when all the necessary pieces are in place.
Something to ponder.

------
zwischenzug
Attempting to dockerize this...

Build script here:

[https://github.com/ianmiell/shutit/blob/master/library/binsp...](https://github.com/ianmiell/shutit/blob/master/library/binspector/binspector.py)

Get this error:

./smoke_test.sh: line 8: ./bin/debug/binspector: No such file or directory

anyone know why?

~~~
MontagFTB
I'm not familiar with Docker, but something might have failed in the configure
or build phases. What kind of output are you getting from those scripts?

~~~
zwischenzug
Thanks - I'll pastebin output later when I'm off the tube.

~~~
zwischenzug
I think it's that ./b2 was not run.

~~~
MontagFTB
That's strange - the build.sh script should run b2 as long as the $BUILDMODE
is either not set or is set to 'bjam'.

I write rudimentary bash. If there's a problem in build.sh, debugging the
script should be straightforward.

~~~
zwischenzug
[http://pastebin.com/x9iHg5e0](http://pastebin.com/x9iHg5e0)

~~~
MontagFTB
According to your pastebin, cstddef cannot be found. This is a standard c++
header. An environment issue, perhaps?

[http://en.cppreference.com/w/cpp/header/cstddef](http://en.cppreference.com/w/cpp/header/cstddef)

~~~
zwischenzug
OK, done:

docker pull imiell/binspector

[https://registry.hub.docker.com/u/imiell/binspector/](https://registry.hub.docker.com/u/imiell/binspector/)

Bit of a hack required to get it compiled; there's probably a better solution:

[https://github.com/ianmiell/shutit/blob/master/library/binsp...](https://github.com/ianmiell/shutit/blob/master/library/binspector/binspector.py)

------
wlievens
Looks cool, but in my view a key feature is: can it _save_ a file back, using
the same grammar?

~~~
MontagFTB
(This response is a repost from here:
[http://binspector.github.io/blog/2014/10/13/binspector-a-
bin...](http://binspector.github.io/blog/2014/10/13/binspector-a-binary-
format-analysis-tool/#disqus_thread))

I have given a lot of thought into the question of output. Throughout the
parse tree there is an implicit DAG of dependencies - this value affecting
that read operation over there, etc. The real trick in making a generic output
routine from a bfft is reversing this DAG, so e.g., if I add a pixel down here
the parse tree can re-stabilize automatically into something valid. This also
opens the door to generic editing of binary formats. My understanding is that
DAG reversal is an NP-complete problem, but I suspect with file formats we're
dealing with a subset of the space and it might not be as difficult as I am
imagining.

~~~
wlievens
I would think that for some formats it's trivial whereas for others it's neigh
impossible. There's nothing wrong with offering that feature to a subset of
grammars though.

similar idea: generating a nicely formatted spec document from the grammar.

