
The Internet is running in debug mode (2014) - sciurus
http://java-is-the-new-c.blogspot.com/2014/10/why-protocols-are-messy-concept.html
======
kentonv
OK, we're going to see a lot of rehashing of the old arguments:

Pro-text: Human-readability saves immeasurable human developer time!

Pro-binary: Text parsing wastes a lot of _very measurable_ machine time!

Problem is, half of this argument is based entirely on anecdotal evidence and
gut feelings. We really have _no idea_ how much developer time is saved by
having messages be human-readable. You will find smart people who believe
_very strongly_ both ways. In a seemingly very high fraction of these cases,
it seems that people are really just taking the practice they are more
comfortable with (because they've used it more) and rationalizing an argument
for it because it makes them feel good about their choices. When hard evidence
is lacking, confirmation bias, unfortunately, takes over.

As the author of Cap'n Proto and former maintainer of Protobufs, I obviously
come down pro-binary... but I won't bore you with my argument as I don't
really have any hard facts either.

~~~
wmil
You're missing a major feature. Text formats don't have any natural integer
size limits.

That's been a big win for HTTP. It's easy to imagine the original HTTP authors
thinking that 24 bits is plenty of space for file size (anything bigger should
use FTP) or using seconds since 1970 in 32 bits instead of text dates.

Those are a bit obvious, but a developer could easily put in something like a
255 chunk limit without thinking too hard about it.

Even John Carmack accidentally screwed over OpenGL driver developers by
copying the GL_EXTENSIONS list into a 1000 char buffer in GL Quake (that's way
too small).

A binary format makes hard limits too easy.

~~~
kentonv
All good formats -- text or binary -- are capable of being extended over time
in a backwards-compatible way, in order to fix mistakes. Formats that don't
allow this are _bad_ and should not be used. Protobufs and Cap'n Proto in
particular have very strong support for forwards- and backwards-compatibility.

Choosing a too-small fixed integer size is one way that a format can screw up
and need correction -- one which happens to be (somewhat) exclusive to binary
formats. But there are plenty of ways text formats can screw up too, like
trying to pack fundamentally structured data into an ad-hoc hard-to-parse
string format in order to make it look more pleasant to humans (I'm looking at
you, MIME type specifications), or failing to use consistent parsing rules
(cookie and user-agent headers totally flout otherwise-consistent HTTP grammar
rules, confounding well-written parsers).

~~~
mcv
Is this an unbiased argument for binary formats in general, or is it a sales
pitch for Protobufs and Cap'n Proto?

If a fairly recent tool is the only way to get binary formats right, then it's
entirely understandable that they've fallen out of use.

~~~
kentonv
Obviously I am biased, but my intent is to make a technical argument, not a
sales pitch. I don't make any money from these (and I no longer have anything
to do with protobufs).

Protobufs was open sourced in 2008 (after being in use inside Google since
around 2001). I suspect other binary formats prior to that had a solution for
compatibility as well, but I haven't done a survey.

------
Udo
The problem I see with binary protocols is the same one that plagues XML: as
soon as you build something to be "not human-readable", any potential
efficiencies gained will quickly be overshadowed by increasing bloat.

When you as a developer look at JSON messages, and you see endless walls of
irrelevant text scroll by, you're disgusted. This disgust drives minimalism.
With a machine-centric format, the excuse quickly becomes "oh, this is
intended for machines anyway, so who cares". If you can access something
through tools only, the bloat becomes hidden and is encouraged to grow.

At that point, you'll start seeing articles asserting that, sure, the new
super-efficient binblob messages are 10x-20x as large as JSON used to be, but
_look at all the things we gained_ , like automatic protocol negotiation,
contracts, actual serialized objects. Any of these sounds reasonable at first
but in reality will only benefit tool vendors in a vicious feedback cycle
where the format slowly evolves itself to death.

I take that 3-5x overhead of parsing JSON any time over the non-human-readable
alternatives. That doesn't mean it's the right choice for all protocols. But
it's a reasonable default for a lot of systems.

~~~
Jweb_Guru
XML was never intended as a memory-efficient format (or if it was, it was an
utter failure before a single tag had been transmitted over the wire). I don't
understand why you believe that binary formats naturally lead to increased
bloat; while it may indeed be the case, your post largely reads like
speculation to me. One could just as easily posit that the rigidity required
of such a format makes it much harder to change (and examples supporting this
viewpoint abound, including things like IPv4). Can you provide some examples
of actual binary protocols that underwent this transformation?

~~~
Udo
_> XML was never intended as a memory-efficient format_

I never said it was. I apologize if it wasn't clear from my comment, but it
was centered solely on the aspect of human readability. You're not wrong about
it being speculation though, I just don't see the harm since we're talking
about a speculative source article in the first place.

 _> Can you provide some examples of actual binary protocols that underwent
this transformation?_

My entire comment was addressing an example where I felt a format had
degenerated because it transitioned from a human-readable to a decidedly
machine-readable form. My criticism is not based on the notion that binaries
are bad in of themselves.

------
sirsar
I wouldn't be so quick to judge the energy lost by using JSON over some binary
format as "waste". Standardization on JSON has saved countless developer hours
by being so easy to read, write, implement, and debug* . This translates
directly and indirectly into economic benefits for everyone. Similarly, I
would have taken much longer to start learning HTTP if it was not inspectable
with near-zero friction. People "waste" electricity for a reason.

* Is it perfect? No. Is any protocol? No. Would a binary format be better? Very likely, no.

~~~
nutate
Not to mention most importantly JSON is firmly on one side of the
undecidability cliff in terms of parsing limits. The nesting limits and length
limits make parsing it safe, whereas a given binary protocol may have
recursion or buffer size definitions which make for ripe security holes. (Even
XML has parser bombs)

------
delinka
The Internet _is_ running in debug mode. Because it's large and complex and
humans have to _debug_ the problems. Humans need to be able to read the data
as it whizzes by to spot the problem[1]. Often, this needs to happen in
environments with minimal tooling; e.g. you're staring down a problem on a
production server, and you're forbidden to install the newest analysis script
and its requirement of the latest version of Python.

We've already agreed on a binary protocol: UTF-8 (previously it was ASCII.)
But we've also built redundancy into it for the humans to make sense of it
with their high-level brains. Instead of a single byte representing an HTTP
header, we use a string of bytes. Now the human involved can tap the wire and
watch the request in real-time without processing anything.

Now, if you'd like to remove redundancy without the need for a compression
library, we'll just need to agree on shortening those strings. And we'll need
a new diagnostic/parsing tool for each [binary] protocol that's invented --
unless you can convince the grep/sed/awk developers to add every protocol to
their tools. Or maybe we could all agree on a single binary encoding for every
potential combination of strings; something like an index into a dictionary.
It might be better (i.e. higher compression ratios) if we let the computer
decide on the dictionary for each message.

Do you see where this is headed?

1 - This, of course, is only the case until the machines can accurately gauge
human intent and respond appropriately, preventing us from making mistakes to
begin with.

------
andrewstuart2
Let's not forget about TCP, IP, and all the other protocols involved in the
internet that are binary by design. It's more than just HTML, JSON, and
JavaScript over HTTP.

> Standardize on some simple encodings (pure binary, self describing binary
> ("binary json"), textual)

Maybe like gzip [1], hpack [2], bson, or others?

I realize the point he's making about doing unnecessary work, but there's also
a reason we haven't expanded human language past written characters or spoken
syllables. It's efficient for our brains, and for preserving and transmitting
knowledge.

There's just no way to create a binary format (character encodings aside) that
can encompass all the possible ideas that can be communicated. Instead, the
common text protocols eventually get optimized into binary (HTTP/2) without
compromising the ability to express the rest.

[1]
[https://www.ietf.org/rfc/rfc1952.txt](https://www.ietf.org/rfc/rfc1952.txt)

[2] [https://tools.ietf.org/html/draft-ietf-httpbis-header-
compre...](https://tools.ietf.org/html/draft-ietf-httpbis-header-
compression-02)

~~~
quotemstr
> Let's not forget about TCP, IP, and all the other protocols involved in the
> internet that are binary by design.

It's funny that you mention IP and TCP. I's sad that we've degraded the end-
to-end principle those protocols embodied. Today, it's not really practical to
use IP protocol numbers other than 6 (TCP), 17 (UDP), and 1 (ICMP) and their
IPv6 equivalents. Middleboxes, out of misplaced caution, reject packets that
look unusual. We can't even use ECN. To work around this problem, we run
everything over TCP and UDP.

That doesn't even work, though, because other middleboxes, working one level
up the stack, reject TCP on anything but ports 80 and 433.

So now we run everything over multiplexed encrypted connections on one TCP
port on one IP protocol. That's just silly.

The low-level protocol situation is like a fossilized idiom in human language.
In English, we can that something "wreaks havoc", but what about "wreaks
happiness"? Nobody understands "wreaks" anymore. It's no longer a productive
rule. Likewise, IP and (arguably) TCP protocol details aren't productive
either, except in specialized cases, or on closed networks.

The internet isn't just running in debug mode. It's also flinging around lots
of bytes that do nothing other than reflect past aspects of its evolution,
like an embryo's tail and gills.

------
kijin
The fact that the Web (not the "internet") runs in debug mode is directly
responsible for its popularity and accessibility.

Anyone can look at a web page and learn how it works. Anyone can copy bits and
pieces of code from various places and put it together into a web page of
their own. Most "web programmers" manage to make a living without ever having
to learn a complicated protocol or trying to figure out what a long string of
hexadecimal digits means. Easy to learn = more casual tinkerers = more people
learning how to code, at least at a basic level.

You could design perfectly efficient protocols and encoding formats, but if
people don't use it, what's the point?

HTTP/2 is a nice compromise. The web server and browser abstract away all of
the binary layers, so most programmers only need to care about human-readable
text.

~~~
Sirenos
> Anyone can look at a web page and learn how it works.

Not so true anymore. It's true that HTML/CSS are easy to inspect on a browser,
but the meat of most websites today (JavaScript) is heavily minified to the
point that you'd be hard-pressed to call it plain text.

~~~
kijin
True, but minified scripts are often accompanied by sourcemaps. Technically
sourcemaps don't need to be public, but they often are because they make
debugging easier for in-house developers.

So even if not all websites are in debug mode, there are incentives to make it
easy to get them into debug mode.

------
serve_yay
I know I'm not making any big revelations here, but the funny thing is there's
no "plain text" at all. It's just that our tools all know how to decode ascii.
ascii bytes are just as non-human-readable as any other until they're decoded
as ascii and displayed on screen. Another encoding could be just as
transparent if all our packet sniffers, editors, etc, spoke it.

~~~
fapjacks
This here is really the only revelation that's ever made me think twice about
text-encoded protocols. Someone else here mentions TCP, IP, etc, and it's true
that it's pretty readable in Wireshark, when you spend a little time learning
where to look for the information. BUT... You still essentially need Wireshark
to read it.

------
nchudleigh
The bit about global warming - come on. Reminds me of HBO's Silicon Valley.
"And we're making the world a better place, through standardized binary
encoded web protocols"

~~~
quonn
Especially since most request will originate from a Desktop or mobile client.
For each requests, the client will surely display the result for a few seconds
or even minutes, using way more energy than parsing those text protocols.
Therefore, switching to a binary protocol can't make a big difference, not
even for the comparatively small part of human energy consumption that's used
for computing devices.

------
ChuckMcM
From last year (Oct 2014) which describes textual protocols as "debug mode"
since they consume cycles to parse.

Its an interesting claim, and it is certainly true that encoding numeric
information into UTF8 consumes CPU cycles. But what isn't quite so clear is
"What percentage of packet latency is dedicated to encoding and decoding
packets?"

Back in the old days I was the ONC RPC architect for Sun and we spent a lot of
time on "RPCL" (RPC Language) which was a way to describe a protocol textually
and then compile that description into library calls into XDR (the eXtensible
Data Representation). We did that because you burned a lot of CPU time trying
to parse a C structure out of the network, and more importantly the way in
which it was represented in memory was an artifact of the computer
architecture (bit endianness, did structures get packed or were they word
addressable, etc) XDR solved all of those problems by putting data on the
network in a canonical format, and local libraries could always convert from
the canonical format into the local format correctly.

That actually works quite well. It almost became the standard way to do things
on the Internet, but politics got in the way. The big argument was that if you
converted things into big-endian form on the network, then a little-endian
processor had to convert to send and convert to receive, but a big-endian
processor got a free pass without "painful" conversion steps.

Later, rather than converting big endian to little endian people just convert
to text (which has the same effect of a canonical form) but it hides the
religious argument behind the "hey its just text, we all know how to parse
text right?" sort of abstraction. At least then it penalized everyone equally.

But the truth, which came out in the RPC wars, and is even more true today, is
that you have to burn a few billion CPU instructions to have any impact at all
on latency. That is because computers are so much faster and the network?
While faster isn't a million times faster, it isn't really is just barely, and
only on a good day a hundred times faster than it was back in 1985. What that
means in principle is that if it takes 1 uS or 30 uS to queue your packet for
the internet it doesn't even show up on the carve out of the 5,000 uS it takes
to send a small packet from here to there, or the 200,000 uS it more typically
takes.

If you're a supercomputer sending data around to simulate fluid dynamics, that
stuff adds up. If you're sending ajax calls from here to there, not so much.

~~~
nomel
> Back in the old days I was the ONC RPC architect

I just recently used ONC+ for controlling a low spec, low-as-possible-latency,
embedded Linux controller. The ridiculously low memory footprint and CPU
usage, compared to all of the others, along with the whole 5 minutes it took
me to write the server and client stubs with rpcgen made it an easy winner. So
I thank you. :)

------
rbanffy
We have realized, long ago, that computers are cheap and programmers (the
humans who most often read "human readable" stuff) are expensive.

We also tend to forget, or never learn, what was discovered before our careers
began.

~~~
guhcampos
Could not agree more.

Not only hardware is cheaper, but is much more easily replaceable, have a well
defined behavior and usually don't go nuts and need to take prescription
drugs. Also they mostly don't have a life and family outside work.

This is why technology usually converges to serve the humans, not the
machines.

Or else we would all be writing assembly code instead of garbage collected,
duck typed and compiler aided languages.

------
nickpsecurity
I used to use Sun XDR [1] for this reason. Pushed Juice [2] applets briefly
while they existed with efficiency neither JS nor Java delivered. Other
benefits authors didn't see, too. Plus, client-server or P2P architecture with
native code gave me performance, portability, and security benefits Web can't
deliver.

Whole Web is ridiculously inefficient. Even in 90's, there were better
architectures [3] to choose from. It's unsurprising that "Web" companies such
as Facebook have moved back to Internet apps where possible (esp mobile) and
often avoid vanilla Internet/Web protocols within their datacenters. There's
better stuff in every area. Here's to hoping that more of it gets
mainstreamed.

[1]
[https://en.wikipedia.org/wiki/External_Data_Representation](https://en.wikipedia.org/wiki/External_Data_Representation)

[2] ftp://ftp.cis.upenn.edu/pub/cis700/public_html/papers/Franz97b.pdf

[3] [http://www.cs.vu.nl/~philip/globe/](http://www.cs.vu.nl/~philip/globe/)

~~~
tptacek
[http://cr.yp.to/sarcasm/modest-proposal.txt](http://cr.yp.to/sarcasm/modest-
proposal.txt)

~~~
nickpsecurity
That's hilarious. Strange enough, there's people in the country without access
to fast Internet that would've appreciate proposals like this to shave much
time off their email downloads. So long as the client automatically
decompresses the message during viewing. I think we can settle for a more
conventional algorithm for that, though. ;)

------
nitwit005
Actually, the internet is dominated by binary. The wire formats are binary,
and a huge hunk of all bandwidth is being eaten up by binary formats: video,
bittorrent, voice, JPG images, etc.

Sure, HTML, CSS and JS are "text", but it's usually optimized, compressed
text, and pretty soon we'll be shipping that around in HTTP2.

People optimize what matters. Look at the HTML source for this page. It's
mostly user content. More efficient packaging for the HTML structure wouldn't
make a measurable difference in load time.

------
jpatokal
Google has an elegant solution for this in the form of protocol buffers, which
are human-readable when expanded but compress down to a very efficient binary
format:

[https://developers.google.com/protocol-
buffers/](https://developers.google.com/protocol-buffers/)

Unfortunately it was made public only after JSON had already become "the XML
replacement".

------
deepuj
The more readable and accessible data is, the more "future-proof" it will be.
Unless you have a specific requirement for performance, it is never a good
idea to sacrifice accessibility. JSON has won the developer community
primarily because it is simple, readable and easy to implement.

------
bcheung
Accessibility and ease of use contribute to the advancement of technology. The
easier something is to work with the more people will use it.

I used to be obsessed with performance and refused to write in anything other
than assembly. When I started working with teams of other people I realize
there is efficiency for the computer and there is efficiency for the developer
and the company. The latter 2 trump efficiency of the computer pretty much
every time.

I'm more than happy to sacrifice some efficiency if it means getting the job
done faster, cranking out more features, greater compatibility, more leverage
with existing tools / standards, and so on.

If you want to talk about efficiency though. Why don't we get rid of time
zones and daylight savings time?

~~~
PavlovsCat
Also, I like the idea of someone getting interested in coding because made
something that is interesting to them and in unminimized javascript, for
example. I know when I started digging into things it was mostly inspired by
random things like that; they didn't even have to be great, they just had to
be _there_ at the moment. First I experimented with stuff others made, then I
learned, and if playing around with stuff hadn't been that fun, I might never
have gotten to the learning more part.

------
chillingeffect
> 99% of webservice and webapp communication is done using textual protocols.

Yep. <img src="whatever.jpg"> is indeed all text. and zip would have turned
those 25 bytes into 191. And then a 300KB highly-compressed binary data file
would be transmitted.

aka "Penny-wise and pound-foolish."

------
wtbob
One of the many nice things about canonical s-expressions[1] is that they
combined many of the features of binary and textual protocols. Fast to parse,
easy to write a parser for, easy to examine by hand, they were ahead of their
time.

[1]
[http://people.csail.mit.edu/rivest/Sexp.txt](http://people.csail.mit.edu/rivest/Sexp.txt)

~~~
nickpsecurity
Good points. I used them as a text data structure in the past. Saying the
parser was easy to write is an understatement: years of papers and code from
LISP/Scheme community had done it dozens of times. Even in hardware. We used
it for architecture-neutral data format and mobile code for agent-oriented
programming. Fun times.

I've considered going back to that in custom designs. It's just that Galois
built a high assurance ASN.1 parser, INRIA has a verified parser generator,
and ZeroMQ is pretty solid. Too many neat things to consider these days...
haha

------
bcheung
To counter my earlier statement though I think there are scenarios where you
can have your cake and eat it too.

JSON is easier to use, debug, and is more efficient than XML.

WebAssembly is going to reduce data transfer sizes and load times while
increasing developer productivity as the ecosystem and tools surrounding it
expand.

~~~
Sirenos
> WebAssembly is going to reduce data transfer sizes and load times while
> increasing developer productivity as the ecosystem and tools surrounding it
> expand.

That's being a little too optimistic. And assuming that does happen, it's
going to take years before it surpasses the current toolset.

------
nemasu
I just recently implemented something using binary websockets, so it's easily
possible now.

------
choward
So he wants HTTP2?

------
mcv
Isn't it the entire fact that the Web uses plain text for everything from html
to css, to javascript and json, that contributed so enormously to its success?

Suppose HTML had been a binary format. Would it have gotten anywhere near as
far?

------
floatboth
JSON+UTF-8 is just another binary encoding...

CBOR is great, it's similar to JSON & already an RFC
[http://cbor.io](http://cbor.io)

------
contingencies
"Human society is running in debug mode!"

------
smegel
> textual encodings like xml...have become very popular

Written by someone with "Java" in the URL. Why did I even click.

------
fapjacks
You can read the author's inexperience between every single line.

~~~
dang
Perhaps. But if you know more, why not teach us? A good HN comment would
convey relevant information and not be snarky.

~~~
fapjacks
What's there to say to this person that clearly has made a decision that
they're going to stand by forever? People don't change their minds even in the
face of reason, let alone their religious convictions. I'll be downvoted for
this comment, but there is absolutely no point is trying to reason here,
because of the barrier of human nature, and what I'm saying is the truth,
unfortunately.

~~~
dang
It's not only a question of convincing the other person but of providing clear
information to everybody else.

Perhaps you feel there's no point to doing that either, but then the only HN-
appropriate thing to do would be to just not comment. Posting empty swipes
doesn't help anyone, and breaks the site guidelines.

~~~
fapjacks
I've posted two other comments in this thread with the information you're
looking for. I don't think it's an empty swipe. It's how I feel about the
content in the link. Part of discussing technology is using social cues to
reinforce or reprimand certain kinds of behavior, just as you're doing now. If
some new person came to this thread and saw nothing but chipper, positive,
glowing comments about this guy's wonderful insight, there would be nothing
here indicating the enormous friction around this very topic. Comments like
mine reinforce that the community needs to inform themselves about a topic
before deciding for themselves what is the best way forward. If we stumble
blindly into the future, coddled by the happy comments we are allowed to post,
innovation suffers and ultimately dies. This is about much more than just
negativity in posts. I hope what I've said makes some kind of sense.

