
Bits on the Wire - yarapavan
https://www.tbray.org/ongoing/When/201x/2019/11/17/Bits-On-the-Wire
======
jkaptur
> ... schemas aren’t free. You have to coördinate between sender and receiver,
> and you have to worry what happens when someone wants to add a new field to
> a message type — but in raw JSON, you don’t have to worry, you just toss in
> the new field and things don’t break. Flexibility is a good thing.

Much like type systems in general, you always _have_ a schema, and you always
have to worry about what happens when someone adds a field. You just get to
choose whether it's explicit and you have to type it alllll the way out
allllll the time, or whether it's implicit, and is defined as "whatever all
the usages are everywhere".

Just as gradual typing is taking over the world (TypeScript, MyPy, etc.), I
predict success for something that lets you gradually add schemas to JSON.

~~~
vincnetas
> ... you just toss in the new field and things don’t break. Flexibility is a
> good thing.

Except when it's not you that are flexible but other parties, and they express
that flexibility not only by adding but also removing or modifying fields
without notice :/

~~~
erik_seaberg
I very rarely see system that distinguish between "this new feature can be
safely ignored" and "it's not safe to process this message if you don't
support this new feature", but I've always thought it a good idea. Like
semantic versioning without a global ordering over schema changes.

------
kragen
> _Then XML came along in 1998._

Useful context for the youngsters: Tim invented XML. I mean, techically, there
were a few other people who helped write the spec, but Tim was using more-or-
less XML in shipped applications before they got involved.

In my book, that means his opinion about the importance of bits on the wire is
pretty weighty, but you also shouldn't discount the possibility that he's in
some sense constitutionally inclined toward standardized data formats.

About other data formats:

I have some experience with CBOR which I regret; it's more compact than JSON,
but it's usually not significantly more compact than gzipped JSON, and it's
usually slower than JSON, which is saying something, and is surprising given
that no scanning for delimiters is needed. Its only real advantage is that you
can safely include binary blobs. It's worth noting that originally CBOR was
part of MessagePack until its author decided to fork it off and name it after
himself, Carsten Bormann.

Cap'n Proto, FlatBuffers, and SBE (which I guess is too obscure for Tim to
mention) get their "deserialization in negative nanoseconds" by virtue of not
deserializing; instead, the respective libraries access the desired fields of
the serialized data in place, which involves an extra pointer indirection or
two and possibly some byte swapping, so it's a little slower than accessing
fields of native structs. But it means you don't have to parse the entire data
structure, just the parts you're accessing, so you can do things like memory-
map a read-only data file instead of parsing it. In many cases, even if you're
accessing the whole thing, saving the extra memory copy is actually worth
_more_ performance than avoiding the byte-swapping and indirections. This is
not a new idea — COBOL was based on it, though without these systems'
extensibility and consequently without the cost of the extra indirections —
but it's an extremely effective idea in many contexts.

All of FlatBuffers, Cap'n Proto, SBE, Protobufs, Thrift, and I think Avro do
have support for adding new fields, contrary to what Tim says about binary
formats in his note; but they mostly do presuppose that those new fields are
being defined by some central authority, because there are things like
sequentially assigned field IDs, so certainly you don't have the level of safe
extensibility that you have in JSON (or HTTP or RFC-822 mail or HTML forms or
SNMP ASN.1 BER data or, of course, XML and HTML tags and attributes).

~~~
tialaramex
To be fair, most of those "extensible" formats aren't safely extensible. ASN.1
is because it uses OIDs, and some scenarios in XML and parts of HTTP are
safely extensible because they use URIs, but anywhere which relies on just
arbitrary text, like JSON, is just shrugging when it comes to global
collisions.

What I'm getting at here is, OIDs and URIs are globally namespaced. When I
decide to add my own custom foo to a format, OIDs let me go get myself an arc
and assign foo an OID from that arc, and URIs let me go get myself an FQDN and
then assign foo a URL with that FQDN. In HTML if I just add a foo attribute
that collides with everybody else's foo attribute, in RFC-822 mail my foo
header can collide with anybody else's foo header. And so on.

~~~
kragen
In theory, I sympathize with this view, but it doesn't really seem to be much
of a problem in practice. I don't think I've ever seen a colliding RFC-822
header name or HTML tag name. (There are lots of problems with RFC-822 but
colliding header names are not one that I've seen.) I _have_ occasionally had
problems in SNMP where the meaning of the same OID changed from one version of
a MIB to the next. I don't think I've seen XML namespaces maliciously
collided, but there aren't very many of them, maybe a couple dozen in
mainstream use.

I think this is sort of like how Wikis don't work in theory but do work in
practice.

One nitpick: I don't think the extensibility of SNMP is due to ASN.1 but due
to the key-value semantics of SNMP. It's true that the keys are OIDs, and the
OIDs are delegated to organizations to avoid collisions (by the ITU I guess),
but if you have a field in some arbitrary ASN.1 syntax that is just an OID,
that doesn't give you any kind of backward-compatible extensibility. However,
although I spent a couple of years struggling with SNMP on a daily basis, I'm
not an expert on ASN.1 by any means, so I could be mistaken about this.

~~~
tialaramex
I get your point that yes, you could use ASN.1 in a way that isn't extensible.
To pick a nit in your nit though:

OIDs are arranged in a hierarchy, so anybody who already has an arc can just
delegate you an arc from their part of the hierarchy, much the same way
foo.example could delegate kragen.foo.example to you, and then you could
delegate tialaramex.kragen.foo.example from that to me and so on.

The IETF hijacked the 1.3.6.1 arc, by basically arguing that if hypothetically
the US Department of Defence were asked to delegate them the 1.3.6.1 arc from
its 1.3.6 arc it would doubtless agree, and so no need to actually ask...

For example 1.3.6.1.5.5.7.3.1 is "serverAuth" the OID used to indicate that an
X.509 certificate is intended to identify a TLS Server, that was assigned by
the IETF's PKIX many years ago.

But the IETF has an arc under 1.3.6.1 for giving out OID arcs to anybody who
needs one, 1.3.6.1.4.1, you can just fill out a form for IANA explaining why
you need one (e.g. to make some new SNMP features work) and you'll get an arc
back typically within a few days. e.g. 1.3.6.1.4.1.44947 is ISRG, the charity
which provides the Let's Encrypt service.

Colliding HTML tag names has been a problem, albeit a small one, even inside
"standard" HTML itself because of the conflict between WHATWG and W3C for
years. The small number of web browsers with significant market share made
this less serious than it might be. As to people redefining MIB OIDs, I can't
help idiots, I would regard every instance of this happening as a bug in the
later definition.

~~~
kragen
What I mean about ASN.1 is that if you have a field foo that's defined as

    
    
        foo SEQUENCE { corge INTEGER, grault OBJECT IDENTIFIER }
    

you can't compatibly augment it with a new field quux as

    
    
        foo SEQUENCE { corge INTEGER, grault OBJECT IDENTIFIER, quux OBJECT IDENTIFIER }
    

and it doesn't help you at all that grault and quux are both OBJECT
IDENTIFIERs.

Which HTML tag names had collisions? I can't think of any at the moment, but
admittedly I'm no HTML expert, though I've been reading HTML specs since HTML+
and before.

I'm more interested in problems not happening than on having someone to blame
the problems on.

~~~
tialaramex
<cite> is the canonical HTML tag collision. WHATWG and W3C defined this
differently. In practice as so often in the Web it's resolved by a mixture of
just accepting everything and so nobody gets what they wanted, and by
semantics of the document being ignored anyway.

Yes, you're correct that you can't just shove arbitrary things into ASN.1 in
this way and so ASN.1 itself is arguably not where the extensibility is,
that's fair.

~~~
kragen
Thanks for the note on <cite>! I don't think that's actually a _collision_ ;
<cite> was in HTML 2, so there wasn't much time for people to start using it
in conflicting ways. What happened with <cite> was that, for about four years,
Hixie published a _changed_ definition in the HTML5 spec, one which didn't
allow citing people's names, just the titles of their works. This attempted
change was eventually reverted. The URL-based and OID-based extensibility
mechanisms we're discussing don't even attempt to prevent such problems.

What I mean by "collision" is something like the following: one person (say,
Hixie) defines <c> in their browser or stylesheet to mean "canonical name"
while another person (say, the W3C) defines <c> to mean "category", resulting
in problems when you try to view documents written for one browser using the
other browser. I don't know of anything like this having happened in HTML,
ever.

------
sarosh
"The take-away is that if you’re going to start emitting events from a piece
of software, put just as much care into it as you would as you do in
specifying an API. Because event formats are a contract, too."

------
AdamN
To me the biggest annoyance of json is there's no standard way to put comments
in there. What's the point of a text-based serialization format if it can't
carry comments???

It has one advantage over YAML though, with JSON you know when you're not at
the end of the file :-)

~~~
akersten
The downside of this is that the comment becomes just another piece of the
data model, because anything that consumes that JSON will need to decide: do I
preserve comments, or remove them, when re-emitting the JSON?

If I remove them, comments are kind of only useful for things like initial
config files or what have you. Which might be fine for your use case, but JSON
broadly is not about making configuration files - this is more suited for
something like YAML.

If you preserve them - how is that different from a key-value just named
"comment": "Here's my comment"? (And if the answer is "because it might not
crash poorly coded applications," that's more of a discussion about the
application than it is about comments in JSON)

~~~
quantified
Consider the comment as like a header in HTTP in relation to body content: a
side channel for information. I despair that anyone was putting processing
pragmas in comments, but you can’t beat all the human out of a programmer.

The word “Notation” in JSON and the fact that JSON allows an arbitrary
quantity of ignored bytes in any object (whitespace) imply that comments would
have been just fine. How many JSON processors are required to preserve
whitespace? (Ignorant here, I’ll take any informed answer and learn from it.)

------
jolmg
My browser opens up a download dialog on clicking this link. It's probably
because the response headers don't include "Content-Type", and so it doesn't
know if it will be able to interpret the response.

~~~
kragen
Interesting, what browser?

~~~
jolmg
Firefox on Archlinux. Though, now I think the issue is with an extension I
have.

~~~
kragen
Weird, I'm not having that problem in Firefox on Debian. But it's sure enough
true that there's no Content-Type header, either with LWP or with Firefox.
Tim, are you doing that on purpose?

~~~
jolmg
I found it's my extension[1]. That's what's interpreting the lack of Content-
Type as indication that Firefox is going to force a download (it actually
doesn't) and showing me its own download dialog.

[1] [https://github.com/Rob--W/open-in-browser](https://github.com/Rob--
W/open-in-browser)

------
geraldbauer
A quick note about json next. I collect new JSON variants, extensions and
formats at the Awesome JSON Next page [1]. [1]: [https://github.com/json-
next/awesome-json-next](https://github.com/json-next/awesome-json-next)

------
Thorentis
Interesting article, but lots of "probably the next biggest consumer" etc. Was
hoping he would have some kind of network traffic analysis report in there
that backs up his ranking of the formats. Seems to be anecdotal more than
anything. Still good overview of the different formats available.

------
commandersaki
A quick note about protobuf. It's just way too complicated in my opinion. If
you need a binary protocol just use TLVs (Tag-Length-Value) - so much less
cognitive load and the parsing is easy to implement and read compared to the
complicated autogenerated mess that is protobuf.

