
In praise of... text files and protocols - jgrahamc
http://blog.jgc.org/2012/04/in-praise-of-text-files-and-protocols.html
======
strags
So... for the one occasion out of a million where somebody needs to "debug" a
piece of data, it's necessary to suffer the bloat of a text format for every
other piece of data we transmit?

How about we just standardize on a binary data representation (eg.
MessagePack), and use common tools to export/import to/from a human-readable
format? Best of both worlds.

And, as an aside - why are we using XML? It's ok as a markup language, I
guess, but as a container for data? We could hardly have picked a worse
format:

It's _crazy_ verbose - even when compared to other text formats (eg. JSON).
Compare:

    
    
      values: [1,2,3]
    

with:

    
    
      <values>
        <value>1</value>
        <value>2</value>
        <value>3</value>
      </values>
    

Its verbosity makes it hard to read, and hard to edit.

It has a poor mapping to the structures we actually use while programming - it
has no built-in notion of arrays. It has superfluous node "attributes" that
don't map well to common run-time constructs.

~~~
specialist
plaintext + gzip is preferable to binary.

The early adopters of XML saw their choices as semistructured text vs nice
parseable XML. The third choice, creating grammars, was ignored.

Grammars for most configuration or data transfer or protocols are trivial.
Certainly since ANTLR 2.x. Much more trivial than any equivalent XML-based
parse and validation tool stack.

FWIW, ASN.1 is worse that XML. Not a defense of XML; all uses of XML are
incorrect.

For my own work, I use a format descended from VRML that I call ARON (A
Righteous Object Notation).

It concisely describes groves (trees annotated with key/values) and supports
most commonly used datatypes. So it's a bit more concise that JSON or YAML and
a lot more strongly typed.

Here's an example test file:

[http://code.google.com/p/aron/source/browse/trunk/test/cronk...](http://code.google.com/p/aron/source/browse/trunk/test/cronk/test1.aron)

I use ARON for all my own projects, as you can see, it's not really polished
enough for others (yet). As this example shows, I mostly use it to loft Java
object graphs. I haven't reimplemented VRML's prototyping (DEF / USE)
functionality in this branch (yet).

~~~
strags
>>> plaintext + gzip is preferable to binary.

I'm not sure I agree with you on that. You're imposing a load of extra CPU
(and, I suspect some bandwidth) overhead where it's not necessary for all
except the most infrequent cases, and you're inheriting all the weaknesses of
text as a data format.

Plus, since your data is now gzipped, it's no longer human-readable on the
wire. In order to read it, you need to pipe it through a decoder (gunzip) -
why not use a sensible binary protocol, and pipe it through the decoder for
that?

~~~
derleth
> In order to read it, you need to pipe it through a decoder (gunzip) - why
> not use a sensible binary protocol, and pipe it through the decoder for
> that?

I don't have your decoder. I have gunzip.

gunzip is not threatened by a patent. gunzip doesn't cause a Drama Meltdown.
gunzip won't be a proven attack vector for remote execution exploits. gunzip
does not require a contract in my hand or money in my bank. While your decoder
is being debugged, gunzip will be live.

(To the tune of "The Revolution Will Not Be Televised")

~~~
strags
Oh, I'm just advocating using a _standard_ binary serialization format like
MessagePack that is far more efficient, can easily represent binary data, and
has a far more obvious mapping to runtime data structures.

~~~
seanp2k2
>"standard binary serialization format"

Mmmm, yes, good luck with that :)

~~~
strags
A man can dream.

------
mwexler
This reminds me of the angst that came when Windows shifted from ".ini" files
with clear name=value lines to the "registry", which paid many the bills of
consultants, utility programmers, and "fixit-guys" via the fun that is
Regedit.

+1 for more simple text files... and hey, devs: if you don't need a nested
object format, perhaps even leave out the JSON or XML and just make a simple
file...

~~~
Anderkent
JSON looks pretty good even if it's a flat list, so I don't see a problem
there.

~~~
mechanical_fish
One annoyance with JSON as a config-file format is that you apparently can't
put comments in it. At least not out-of-band comments:

[http://stackoverflow.com/questions/244777/can-i-comment-a-
js...](http://stackoverflow.com/questions/244777/can-i-comment-a-json-file)

You can put in in-band comments, by defining a JSON format that has a bunch of
'_comment' keys sprinkled through it, but that's annoying because you've got
to design that in, and decide which elements deserve comments and which don't,
and that's just a big hairy yak which sits there _daring_ you to shave it,
practically _begging_ you to do a bunch of YAGNI up-front design of the
comment protocol, and then even when you're done you still have to fret about
how those keys might get misinterpreted by clients from now until the end of
time. Even _human_ readers will have to intuit the the semantics of your
'_comment' key ("oh, the computer never reads this, this is just for me!"),
whereas humans generally recognize the significance of real, standard comment
syntaxes like # or ;; or // or what have you.

Or you could use a JSON preprocessor that strips the comments, but that's a
little trap that you build for yourself, because now your JSON is no longer
universally-parseable by any language's JSON library. You could strip the
comments at build time, but now your production JSON files on the server don't
have the comments, and this is likely to be just where you need them - you put
them there for the troubleshooter, who will find them at 3am when the server
just crashed.

But I've sometimes used JSON for these things anyway. Life is too short for
perfectionism, and it _is_ a nice clean format. Almost too clean.

~~~
d0mine
You could try YAML as a superset of JSON with comments.

~~~
mechanical_fish
Indeed. This is pretty much my plan for the future.

------
ejames
I've seen the value of text formats personally.

I work mainly on an iPhone project. In order to handle customers requests for
this or that custom UI feature, I invented a tool that generates something
like a simplified XIB file from an image and a chunk of CSV. I made the end
result of the tool a text file that the iPhone code parses.

Working in text saved a lot of time. Since I invented my own tool, naturally
there were things to debug and tweak in the results, but I could do that with
a text editor and commit the changes as easily diff-able deltas in git.
Although I work on iPhone code, everything also has to be implemented in
Android - but it's no problem for the Android developer to use the text file
since it's just text.

I've also written tools for migrating chunks of customer data from an old
back-end system to a new one using the new server's customer-facing API. This
was partly for dogfooding purposes, since the API was new and had very suspect
stability.

I wrote the tools to generate text files where each line is a JSON payload
that would be sent to the server. It made everything easier to debug -
examining the payloads lets you distinguish between errors in the export tool
and errors in the API. The text files themselves could then become test cases
if there was a bug in the API, or be quickly hacked to contain correct
payloads via find-and-replace if the export tool was wrong but we still needed
the migration to finish right away.

~~~
tedmiston
I'm building a tool very similar to what you describe for Android right now.
It involves a simplified XML description of the GUI which is sent over a
network to an Android device, then "inflated" to actual GUI code.

Could you elaborate about your system a bit?

------
ajuc
Also important - you can version control text files easily.

One of the more brain damaging einvironments I've worked in (oracle forms)
uses binary format for source code. There is some utility app to change binary
files into source code, but the primary files deveopment is using are binary.

It means there's no way to merge changes, no easy way to see what changed in
which commit, simple text search in the whole project is hard - you have to
open all files at once in IDE and search from there, and it's slow and
awkward.

If text files are mediocrity - let's us wait before something better comes,
because binary formats are not better.

~~~
arethuza
What can be even worse that using binary files for development artefacts is
storing code in the underlying database with no straightforward way to map
to/from files.

------
antirez
One of the main goals of the Redis protocol was to try to find a point in the
middle between text and binary, so the protocol is completely text based, but
designed to also handle binary payloads without problems. So far we never saw
the protocol as a bottleneck in Redis performances.

------
viraptor
I must disagree here. Once you define complex enough text protocol, there are
ways to mess it up, or be incompatible with. Say what you want about ASN.1 and
similar formats, but if you have a correct parser, you know the values are
exactly the ones you expect.

Real-life example I kept running into was SIP implementation. (think someone
knew only HTTP and decided to create internet telephony) First, you have line
length when parsing - some implementations will limit it, some won't. If it's
limited, some implementations will allow you to wrap the lines according to
the protocol. But then others will say it's too complicated and unlimited
line-length is the way to go. Then you have alternative names for sip uris:
you can say "me" <number@ip>. Or just <number@ip>. Or number@ip. Or
<number@ip>;some_parameter. Or someone else decides to go with
<number@ip;some_parameter>. Some parameters have associated values, some don't
- guess how many implementations don't support both ways...

Before you know it, there's 1000 conventions and everyone supports some
minimal common core, but fails for at least one other specific implementation.

So you say - there's always structured text - json/xml/... But look at the
example in the blog post:

    
    
        <dict>
          <key>Bounds</key>
          <string>{{25.4278, 76.3008}, {104.75, 91.8751}}</string>
    

So how long until someone comes up with implementation that doesn't use spaces
between the numbers? What's the supported precision? What happens when one
tuple is skipped? Do you have to parse exponential numbers correctly? How do
you deal with duplicate keys in the dict?

I see the appeal in text formats and then remember... no - it's not the right
way. JSON is a bit less-wrong than XML here, because you know what's a number,
what's a string, what's a list. You'd probably do {'Bounds': [[25.4278,
76.3008], [104.75, 91.8751]]}.

Until someone comes out with a popular implementation that does case-
insensitive matching for keys... and switches from 'Bounds' to 'bounds' in
some version.

~~~
zwp
> JSON is a bit less-wrong than XML here, because you know what's a number,
> what's a string, what's a list

<http://www.w3.org/TR/xmlschema-2/#typesystem>

~~~
viraptor
True. But since it's optional, what is the number of people who actually use
it, or parsers that care? It's a bit like with xml namespaces.

------
pornel
Sadly, we're moving away from text protocols. SPDY and WebSockets are binary.

I've telneted many times to debug HTTP issues, and I wonder what am I going to
do with SPDY problems.

~~~
zokier
note:

> Only use binary protocols where the performance is so sensitive that it's
> worth the implementation and debugging downside

SPDY was created explicitly to improve performance.

~~~
pornel
There are surely huge wins from having single TCP/IP stream and gzipping all
of it including headers, but I wonder how much difference binary format makes
on top of that?

Maybe you save 100B per request? On a large page that makes 100 requests?
10KB. When you've got one stream that eliminates roundtrip time/TCP/IP slow
start that's simply _80milliseconds_ on a basic 1Mbit broadband.

~~~
kijin
At Google's scale, you might want to multiply that by a trillion or two.

------
peterwwillis
In some cases, yes, text is very nice to be able to read. But the only reason
we find it easier than binary data is because we don't have built-in arbitrary
binary parsers for the data.

When you read a text format or protocol your brain is doing the job that a
good debugger or parser should be doing. Because these tools don't exist for
your protocol, you think _"boy how handy that I can just parse this file with
my brain! It's a good thing I learned <INSERT NATURAL LANGUAGE> and that this
format/protocol was written with it, and that I have decades of experience
groking it."_ If your only option for debugging is to open the raw format or
protocol and pick at it by hand or try to eyeball it looking for some anomaly,
you're just lacking a real tool to help you do the job quicker and more
effectively.

Consider two separate formats: ELF and PostScript.

ELF is a binary format which is flexible and extensible. The data it comprises
is almost entirely non-human-specific; you aren't going to read it because
it's not for you, it's for the computer only. Yet a wide range of platforms
have adopted it as their code file format. It could have easily been
implemented in text, but what would be the point? Easier debugging? Plenty of
tools exist to examine, dump and compare the properties of these files, making
an inherent textual representative redundant.

Now look at PostScript. A (relatively) simple, readable text control language
to determine the output of a complex document. The problem is PostScript is
more of an interpreted programming language than a file format. Instead of
using a method to craft a pre-processed document for the printer, the printer
had to add a costly interpreter for the document format in order to produce
documents on the fly. Early laser printers had microprocessors faster than the
Macintosh computers that connected to them. Was it easy to grok, edit, debug?
Yes. But it also made for a more costly device to handle all that built-in
textual flexibility.

To me, the best way to tell if you need a text format/protocol or not is
determining the humanity of your program. Will I need to interface with the
internals of this on a regular basis? Might it become so complex and large in
the future that a tool to debug it might become necessary? And how much work
would it really be to just write it once and enjoy it forever?

------
npsimons
See also "The Power of Plain Text" in "The Pragmatic Programmer." Why people
think binary formats are the be-all-end-all (or even the correct solution for
the majority of problems) continues to confound me. But then again, I don't
get NIH syndrome either, and many who insist that binary is better usually
want to invent their own binary format.

------
ezy
One thing that seems to be missed here is that text formats tend to be self
documenting. That is, if I'm handed a blob of text vs a blob of binary, I am
quite a bit more likely to be able to hack the text than the binary. This is
usually couched in terms of being able to process the text form using standard
tools, but it goes way, way beyond that. It matters most in the situation
outlined above -- when third parties need to get at the format without having
to rely on some provided tool.

Most protocols and file formats are _not_ documented sufficiently. Encoding
data in binary (unnecessarily) is unfriendly because there is a huge
difference between "ACK" and 0x06 when you're a third party looking at the
data with no reference. Sure, you can probably figure it out given enough
time, trail and error. Or perhaps beg the developer for specs, but it's not
particularly efficient. Most developers don't have time or the inclination to
publish a public spec for all binary formats used in their product.

You can make illegible text formats, of course, but I'd argue that then you're
simply making a binary format that's confined to the range of 7-bit ASCII.
Similarly, when the goal is to obfuscate (e.g. algorithm IP), binary formats
work well to dissuade casual investigations.

------
ralph
Text files are great, especially if one's the Unix shell skills to quickly
manipulate and query them without having to write a program. But XML is often
overkill compared to a simpler text format.

------
yason
Of all those improvements on which we spend our newly gained processor power
and memory bandwidth each year, textual formats are probably one of the most
efficient priorities.

The window where binary formats are absolutely required has shrunk down to the
most low levels. With data formats we're basically where we've been with
"scripting" languages for years: we can afford to start with the highest
possible level and trickle down towards compiled code only where necessary.

Further, whenever I've ever had to create a binary format I've written a
translator first. The translator is a program that can read the binary format
and write the same information out in editable text form, and that can also
parse the text and write out the equivalent binary. Then I only work and debug
using the text format and just convert to binary when needed.

Usually this approach implies defining an API that you can use to construct
the text or binary message. The API becomes automatically tested when I start
by working with the text representation and when I finally move on to writing
natively binary only (for performance reasons, obviously), I can trust that it
works as well as during development. And I can use the API to generate text,
too, so I can easily compare and see what's going on.

------
gbog
The "text" part is important and has been under attack by all proprietary
formats for a long time, but right now it is the "file"part that is under
attack, by the cloudy services we are using more and more. I think it may be
an even greater threat under individuals' control over their own properties.

------
read_wharf
I have a great big hammer in the *nix text tools, and I try very hard to buy
nothing but text nails.

------
mseebach
Off-topic, but still: Probably the reason JGC is having problems with CMYK
colours is that he specifies them in RGB. Those are not the same.

~~~
jgrahamc
No. Those are examples from the samples that OmniGraffle give away. I didn't
want to show the actual file I was working on.

------
seanp2k2
I think about how great text-based protocols are every day, and every time I
use them (multiple times per day.)

This is my biggest turn-off to Microsoft products and the Windows world. I
actually like XML sometimes over undocumented JSON because it's so easy to
figure out what everything does (in the case of succinct XML, which is
admittedly rare.)

------
keithpeter
Disclaimer: I am not a programmer

I've written little scripts to produce diagrams in DXF and in PS format for
astronomical maps and in maths teaching. Often quicker, and neater, than
plotting with a mouse.

Your typical end user hacks, nothing that would be of any use for published
software.

------
tbsdy
100% agreed! Currently I'm trying to get access to a iTSM tool that uses HTTP
as it's transport mechanism. It uses ActiveX controls to gather data into a
grid-like mechanism.

As I particularly hate ActiveX, I have started reverse engineering what these
controls do. So far so good, except that the format used for the data that the
controls receive is an application/octet-stream binary format.

Now I've worked out how the format works, and by using JDataView I'm parsing
the format. But you know what? Internet Explorer takes null characters in
strings and will not go any further, even though ECMA-262 states that:

"The String type is the set of all finite ordered sequences of zero or more
16-bit unsigned integer values... All operations on Strings (except as
otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned
integers; they do not ensure the resulting String is in normalised form, nor
do they ensure language-sensitive results."

If they had passed the data back in something saner like JSON, or heck even
XML!, then things would have be fine. As it is, I've decided to skip Internet
Explorer as it's just not worth my time to get around this issue, and every
other browser works fine with JDataView.

------
shasta
Let's hear it for 1970s technology. May we ever be stuck with Unix mediocrity!

~~~
bryanlarsen
In the 1970s when a megabyte cost many thousands of dollars, there was much
more reason to use binary formats than there is now. If it made sense then, it
makes much more sense now.

