
Reverse engineering the Notability file format - zdw
https://jvns.ca/blog/2018/03/31/reverse-engineering-notability-format/
======
userbinator
To summarise, it's a .zip file containing a binary serialisation of XML, which
then contains the base64'd raw coordinates...

Reminds me of this (it gets even better in the comments):
[https://thedailywtf.com/articles/XXL-
XML](https://thedailywtf.com/articles/XXL-XML)

 _Nothing in here was really complicated – it was just some existing standard
formats (zip! apple plist! an array of floats!) combined together in a pretty
simple way._

It's rather telling that wrapping points in 3 layers of encodings is
considered "pretty simple", almost the norm today, when years ago doing such a
thing would probably elicit reactions more towards _WTF_.

~~~
derefr
• Very likely the files are only .zip files for network portability, and when
in their native environment (iOS?) they’re unzipped document bundles (see e.g.
the RTFd format)

• Likely this file isn’t literally a PList, but the output of a data storage
framework (e.g. CoreData) that uses PLists as one possible backend. In the
code, it just looks like adding annotations to native data structures to make
them “storable” within a database-like object.

• Only the text-formatted PLists contain base64’ed binary data. The binary-
formatted PLists just contain length-prefixed binary data. There isn’t the
“binary in base64 in XML in binary” that you’re imagining.

~~~
jvns
thanks for this!! I've never worked with iOS so I was definitely fumbling in
the dark. the CoreData hypothesis explains why the PList file looks like a
huge weird unstructured array! So in the code there's just some object with
all the data in it and then CoreData serializes that object?

This explains the feeling throughout of "this doesn't feel like a format the
app developer _designed_ ".

------
wpietri
Julia Evans is the best. I've been following her on Twitter for years because
she brings so much enthusiasm about learning technology that she makes me see
familiar things anew. By the end of an exploration, even it it's something I
know well, I end up saying, "Holy shit, strace is cool."

To get a feel for her work, a bunch of her posts are here:
[https://jvns.ca/categories/favorite/](https://jvns.ca/categories/favorite/)

------
Anderkent
Reminds me of when I had to reverse-engineer the configuration and language
format for an audiobook reader - I think it was the Milestone 312 [1]. We
bought it for our granpa who is hard of seeing; but installing the polish
language pack would stop the device from working correctly.

In the end I had to compare a working english pack and the disfunctional one;
the actual issue was something pretty simple - the names of the properties in
their ini file were bad for some languages, and some checksums were wrong, or
something like that. But it made for interesting black-box experience; there's
of course no feedback about what went wrong in the device, nor any accessible
logs :P

[https://bones.ch/milestone312.php](https://bones.ch/milestone312.php)

------
UncleEntity
Totally randomly I came across a paper that directly addresses this, _The Next
700 Data Description Languages_ [0].

    
    
      Today, many programmers tackle the challenge of ad hoc data by writing scripts in a language
      like PERL. Unfortunately, this process is slow, tedious, and unreliable. Error checking
      and recovery in these scripts is often minimal or nonexistent because when present,
      such error code swamps the main-line computation. The program itself is often unreadable
      by anyone other than the original authors (and usually not even them in a month or two)
      and consequently cannot stand as documentation for the format. Processing code often
      ends up intertwined with parsing code, making it difficult to reuse the parsing code for different
      analyses. Hence, in general, software produced in this way is not the high-quality,
      reliable, efficient and maintainable code one should demand.
    

[0] [https://www.cs.princeton.edu/~dpw/papers/ddc-
jacm.pdf](https://www.cs.princeton.edu/~dpw/papers/ddc-jacm.pdf) (pdf)

------
acemarke
_Way_ back in the day, I built a PocketPC HTML editor. It used a plain textbox
control. Meanwhile, a lot of people were building apps and had expressed
interest in having some kind of a rich text edit control, similar to the
RichEdit control available as part of the desktop Win32 API. Microsoft's
PocketWord app used a rich text control, but it wasn't actually part of the
mobile Win32 API, or documented at all.

After finding a couple hints on what WndProc messages the control would
respond to, I spent a few weeks reverse-engineering it, and wrote up my
findings here:
[https://groups.google.com/d/msg/microsoft.public.dotnet.fram...](https://groups.google.com/d/msg/microsoft.public.dotnet.framework.compactframework/MOk776gh5Fs/9VUDZhFJTokJ)
.

I never did wind up building anything useful with that information, but it was
a pretty fun exercise.

------
ggambetta
Great fun :) Good insight into the process of reverse-engineering and re-
generating. I'm working (on and off) on something similar, to convert Kdenlive
project files to Final Cut Pro, so that I can collaborate with people who
don't use Linux. Both formats are XML, but figuring out the semantics can be
tricky!

------
gravypod
Does anyone here know of a native linux drawing toolkit? Something like the
apple pencil? I've wanted a portable not taking device and all of the native
Linux not taking software is not up to spec with OSX/Windows counterparts.

It's kind of a tangent but it's somewhat related.

~~~
killjoywashere
There's a developer at SUSE who has done a lot of work on tablet mode, fwiw.

------
nchelluri
I kind of want to see that ACID comic now.

~~~
jvns
[https://drawings.jvns.ca/acid/](https://drawings.jvns.ca/acid/)

------
anomie31
Nice write-up.

