
Please stop writing new serialization protocols - otterley
https://scottlocklin.wordpress.com/2017/04/02/please-stop-writing-new-serialization-protocols/
======
jmillikin

      > Imagine if there were 40 competing and completely
      > mutually unintelligible versions of html or text encodings
    

There are.

    
    
      > There really should be a one size fits all minimal
      > serialization protocol
    

There can't be.

    
    
      > just the same way there is a one size fits all network
      > protocol which moves data around the entire internet
    

There isn't.

~~~
std_throwaway
Oh no, there are 40 competing standards. We must make one to unify them all!
Oh no, there are 41 competing standards.

~~~
bluejekyll
In fairness they are saying pick one, the earlier one.

~~~
std_throwaway
He picks it up and adds one tiny modification (changing endianess). Others
might need to change another detail making their version incompatible again.

~~~
fnord123
He describes message pack, afaict. Then acknowledges it with tongue in cheek.

------
trishume
I don't think this is as big a problem as the article suggests. There are a
substantial number of tradeoffs in serialization protocols, and each
application/ecosystem can choose their protocol to get the best of these
tradeoffs. As long as there are few enough that every popular protocol has a
library in almost every language this isn't too bad.

One example of a tradeoff that is hard to eliminate is that you can reduce
size and increase performance substantially if you pre-specify a schema like
Cap'n Proto (and others) do. The downside is then if you just get a message
without knowing what it is about it's difficult to figure out. The only way
out of this tradeoff that I can see is having a global schema registry and
every message having 8 bytes dedicated to schema ID, and that has downsides of
its own, especially for small messages.

I do agree with the author though that we could do with more binary
serialization protocols with tools to easily translate back and forth to a
human-readable text format for debugging.

~~~
galacticpony
>> The downside is then if you just get a message without knowing what it is
about it's difficult to figure out.

You're saying that as if there was a clerk reading the message at the other
end. If a computer program gets a message it doesn't "understand", it's not
going to "figure it out" either way.

Worse, if you're a programmer that doesn't know the actual schema and you're
trying to figure out what the schema might be _just by looking at the data_ ,
you'll probably run into trouble.

~~~
trishume
I was talking purely about humans finding messages they don't have the schema
for during the course of debugging. Computers do equally badly with JSON data
and Cap'n Proto data they don't have the schema for.

A fake example scenario of what I'm talking about: Someone notices a server is
producing way too many errors, they check the logs and see a bunch of "invalid
request: ...." messages. The messages are JSON like
{"api_version":2,"cpu_usage":0.57}. They figure out that the server monitoring
system is forwarding stats to the wrong IP address. Even though they don't
know the full schema of the messages, they get the gist and it helped debug.
With Cap'n Proto that would have been a few nonsensical bytes.

I don't think this is a very big downside, so in general I like protocols like
Cap'n Proto.

Perhaps a much bigger downside is that in most languages integrating code
generation into the build pipeline is a pain in the ass, so it is way easier
to use dynamic serialization protocols that don't require compiling a schema.

------
niftich
Thrift and Avro -- and later, gRPC, but not protobuf -- are full RPC stacks
where you use an IDL to codegen your endpoints, and those endpoints
communicate using their own serialization. Since this is form-on-the-wire is
"internal" concern not meant for direct public consumption, I find this
acceptable.

Meanwhile, XML-RPC (which is not a serialization format!), JSON-RPC, SOAP,
Swagger, are stacks that intentionally leave open the possibility that someone
will come along with and consume the form-on-the-wire directly, outside of the
tooling of the environment. Most in-the-wild JSON-responding APIs have the
same expectation.

IDLs themselves are a very old idea, probably because we like declarative ways
of specifying contracts that are then applicable across a heterogeneous
environment, or in different languages and runtimes, and so on.

As for why there's dozens of offshoots of standalone serialization formats
which are all predominantly occupied with the efficient packing of numbers
while keeping the general data model of JSON, I can't answer [1].

[1]
[https://news.ycombinator.com/item?id=12440783](https://news.ycombinator.com/item?id=12440783)

------
jnordwick
The three I ever use:

Cap'n Proto [https://capnproto.org/](https://capnproto.org/)

Simple Binary Encoding (by Martin Thompson) [https://github.com/real-
logic/simple-binary-encoding/wiki/De...](https://github.com/real-logic/simple-
binary-encoding/wiki/Design-Principles)

and if neither of those will do, raw C-structs on the wire (basically what the
other 2 are anyways).

------
userbinator
As much as it seems to be recommended against (by... authors of serialistion
protocols?) I am a strong believer in just using simple structures (and unions
if necessary) directly --- all these serialisation abstractions appear to have
been invented at a time when machines varied far more widely in their
characteristics such as endianness, alignment, word size, integer
representation, and even byte size. Now that your platform is almost certainly
going to be x86 or ARM, it makes little sense to add a layer of (sometimes
substantial) complexity in essentially attempting to accommodate flexibility
that won't be needed. I can see the necessity if e.g. you need to communicate
with a 36-bit 1's complement mainframe, but otherwise it's just bloat.

Along the same sentiment, I'm not a fan of APIs using JSON and/or XML or some
other overly-flexible textual encoding. Simple binary encodings, TLV-ish if
necessary, are the best.

I was never really convinced by the "human readable" argument for textual
encodings either --- you just need to get used to it, then you can read and
write the bytes in a hexdump as easily as you can English. In fact I'd prefer
working with hexdumps to XML. But unfortunately there's now a whole generation
of developers who can't even count to 2 in binary and don't know what a hex
editor is...

~~~
protomyth
I can see "human readable" for system config files since that can save your
bacon in an emergency (editable in a text editor), but I'm really at a loss
when we are talking about a serialization protocol or app config files. I want
efficiency and minimal parsing code, which makes it problematic to throw in
human concerns.

~~~
jimmaswell
Why would you want human editable for system configs but not app configs?

~~~
protomyth
I can probably have a program running to edit the app config file, but if I
cannot edit the system config file in a text editor (particularly if it gets
bad enough I'm booting in single user mode) I am really hosed.

------
bogomipz
The author states:

>"Java monkeys eventually noticed how slow XML was between garbage collects
and wrote the slightly less shitty but still completely missing the point
Avro."

I would like to know why the author feels that Avro misses the point. Can
anyone hazard a guess?

and similar for:

>"Oh yeah, I do like Messagepack; it’s pretty cool."

It would be interesting to hear why they(or anyone else for that matter)
consider Messagepack a worthwhile contribution to the serialization tool shed
but Avro is not.

~~~
linkmotif
I really don't get it either. What is interesting about Messagepack?

------
mmphosis
I had to look up XDR and ASN.1 ...

External Data Representation (XDR)

[https://en.wikipedia.org/wiki/External_Data_Representation](https://en.wikipedia.org/wiki/External_Data_Representation)

[https://tools.ietf.org/html/rfc4506](https://tools.ietf.org/html/rfc4506)

Abstract Syntax Notation One (ASN.1)

[https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One](https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One)

~~~
floatboth
ASN.1 is quite well known because of X.509 certificates though! XDR, yeah, I
had to look up as well.

~~~
trav4225
SNMP is also a reason ASN.1 is well known...

~~~
wichert
And LDAP is pretty popular as well.

~~~
trav4225
truedat.

------
etjossem

      "human readable JSON like format"
      "JSON style human semi-readable form"
      > accidentally mentions JSON twice in the list
    

JSON is the most widely used serialization protocol in web application
development today. The article's repeated mention of it is no accident at all:
it's considered best practice across the industry. Frankly, I don't know why
we are still talking about this.

~~~
jwilliams
The "semi" in semi-readable is no mistake either. The lack of comments is a
real drawback in JSON for human readability. It's a major reason why YAML is
used as a substitute in complex cases (An AWS Cloudformation template might be
a good example- they're all gone YAML).

Similarly, JSON still has ambiguity around encoding (type of UTF). Dates are
the same. Even integers, which are limited to JavaScript's 53-bit
implementation (all numbers in JavaScript are doubles, so you're stuck with
the mantissa limit). This is a huge pain if you're dealing with 64-bit values.

UTF-8 and JSON is a good place to start for Web Development, but there is no
way it's the final or only answer.

~~~
Dylan16807
> integers, which are limited to JavaScript's 53-bit implementation

According to what? JSON just says that a number is a series of digits.

------
simonw
As Douglas Crockford, inventor of JSON, once said: "The good thing about
reinventing the wheel is that you can get a round one."
[https://en.wikiquote.org/wiki/Douglas_Crockford](https://en.wikiquote.org/wiki/Douglas_Crockford)

------
js8
There are main three design trade-offs in the space of serialization protocols
(even ignoring RPC):

1\. Do you want an efficient format (i.e. binary) or a format that is easily
readable/writable by humans (i.e. text)?

2\. Do you want self-describing format (data schema is part of the standard)
or schema-less format, to save space on transmission? And how much?

3\. How to deal with mixed text data? Do you want primarily to store strings
and escape any structured data within them (a la XML), or do you want to
primarily store structured data and force all text to be escaped in them (a la
JSON).

Personally, I like CBOR, which interestingly wasn't mentioned.

~~~
twic
4\. Do you want to optimise for space (in which case use variable-length
integers) or speed of parsing (in which case use fixed-length integers)?

The speed of parsing case is definitely rare, but i've done some work where
packets come in off a 10-gig ethernet port, and every nanosecond counts.

------
i80and
The big problem is that rpcgen is hairy and not reeeeeeally portable (in my
experience, if there's a feature you want, it's not supported on your Unix).

XDR is nice, though, apart from being big-endian and not having widely-
supported 64-bit integers. It's a pity it's unfashionable.

~~~
cdegroot
I think I re-implemented rpcgen a couple of times in Ye Olde Days, so it can't
be hard. Then DCE came by and then I lost track of serialization protocols
that are essentially rehashes of SunRPC/XDR. So yeah, I've been thinking along
the same lines as Scott - why?

These days, I've mostly thrown the towel and plunk in JSON everywhere. At
least it's hip, and if I ever find a piece of code where the serialization is
the bottleneck, early enough to start pushing XDR (not ASN.1 - I once tried to
write a parser for _that_ and ended up with some extra grey hairs ;-)).

------
jcdavis
When facebook wrote thrift protobuf wasn't opensourced yet.

------
erikb
The only problem with this kind of rant is the assumption that people actually
talk to each other. But why would they?

When was the last time you talk to people working on another software stack?
Cooperations between different tribes need to be enforced by strong leadership
or a Big Need like imminent extinction of the tribe. As long as that doesn't
exist and the whole ecosystem is continuing to grow you can just sit there and
watch people building the next silo and the next instead of getting to a
higher step in evolution.

And it's actually the reasonable thing to do. I mean, would you rather have a
miniscule share of a cake others baked, or do you want to have your own cake?
When both is about the same effort, I'd rather have my own cake, even if I
have to define a new serialization protocol to store it.

------
zeta0134
There are some really good reasons to pick an unusual serialization protocol,
and even sometimes reasons to invent your own. (Embedded systems, limited
environments, licensing restrictions, etc.) Generally though, you should use
something the rest of your development team / community is familiar with. Not
because this is efficient in terms of resource usage on the machine, but
because this is efficient in terms of teaching your other developers how your
serialization protocol works.

JSON may be everywhere, and it's tempting to look at its flaws and think, "we
can do better" but it also has the great benefit of having decent
serialization libraries already written in the vast majority of programming
languages. That's one heck of a feature.

------
linkmotif
I wonder what the author dislikes about gRPC.

------
dkarapetyan
Along a similar vain please stop writing external DSLs. Especially in the
DevOps ecosystem. I'm really tired of learning yet another syntax for (bash +
ssh).

~~~
coldtea
If you're tired, then just use bash + ssh.

People use those "extenal DSLs" because they are tired of bash and ssh for the
things the want to do.

~~~
dkarapetyan
Or you know they could just write a library in a real programming language
instead of forcing people to write YAML. Don't quite get the fascination with
YAML. Give me actual fucking syntax instead of some bastardized serialization
format that is badly trying to ape lisp.

~~~
camus2
the issue is as soon as you introduce a scripting language in a solution,
people will write monstrosities with that. A configuration format guarantees
that configuration will stay simple and "readable". I agree that LISP should
be used more often though.

~~~
dkarapetyan
I have now seen enough deployments of puppet, salt, ansible, etc. to last me a
lifetime and then some. Trust me. People have managed to turn configuration
formats into unreadable monstrosities as well. At least with a real
programming language I would have sensible debugging capabilities and maybe
some refactoring tools to help but with a half-baked YAML programming language
I have no such recourse.

------
ricardobeat

        > superset of this with a JSON style human semi-readable form, 
        and an optional self-description field, and you’ve solved all 
        possible serialization problems
    

I'm really curious about Ion: [https://amznlabs.github.io/ion-
docs/](https://amznlabs.github.io/ion-docs/)

Is it in use anywhere else besides Amazon?

------
peterkelly
And we should have one standard programming language to process the data.

------
ahupp
As a general rule, when you find that many people have come up with different
solutions to a similar problem it's a good bet that there's actually not a
single good solution available.

"Facebook couldn’t possibly use something written at Google, so they built
“Thrift,” which hardly lives up to its name, but at least has a less shitty
version of RPC in it."

Thrift predates the public release of protobuf.

------
dweis
Disclaimer: I have written and designed many parts of the Protocol Buffer
libraries.

[https://tools.ietf.org/html/rfc1832.html#section-6](https://tools.ietf.org/html/rfc1832.html#section-6)
has a good encoding example of XDR.

Contrasting that with Protocol Buffers is enlightening as it clearly
demonstrates some differences in design goals and where tradeoffs are made.
Feel free to correct my interpretations as I may have missed something!

1\. XDR appears to have no equivalent of a Protocol Buffer field number => the
format is not self describing. That is, one must have a schema to properly
interpret the data

2\. XDR appears to encode lengths as fixed width based on the block size =>
faster to read/write but larger on the wire than using a varint encoding

3\. XDR's string data type is defined as ASCII => it does not support modern
unicode outside of the variable-length opaque type

1 would seem to present difficulty for modern distributed systems as one
cannot control the release process of every distinct binary in the ecosystem
to ensure that they are all schema equivalent at the same time. This can be
remediated by propagating the schema as a header to the underlying data for
consumption on the other side, but that bloats the wire format. Header
compression could help with this but may be problematic for systems with
severely constrained networks that constantly reestablish new connections (ex.
mobile).

1 also has implications for data persistence. One cannot ever remove or
reorder an XDR struct member else they will incorrectly parse data that was
written in the older format. This is in contrast with Protocol Buffers, where
one can remove or reorder message members whenever they'd like, as long as
they take care not to reuse a tag number (and the newer `reserved` feature can
help with that).

2 is just a performance tradeoff: binary size or (de|en)code performance?

3 has implications for memory constrained systems. Ex. on Android we eagerly
parse string fields to avoid doubling the allocation overhead (first as raw
bytes, then as a String object). If we required all string datatypes in
Protocol Buffers to be defined as bytes fields (the variable-length opaque
data type equivalent), we wouldn't be able to provide this optimization.

Overall, XDR looks like a good fit for inter-process communication in a
homogeneous environment. Protocol Buffers looks like it's a good fit for cross
language communication across heterogenous and unversioned environments.
Directly comparing the two, XDR is much more verbose on the wire (particularly
if we mitigate the versioning issues by serializing the schema in a header)
whereas it's likely significantly faster to (en|de)code. i.e. there's a
tradeoff for networking/storage costs vs. CPU performance.

Scott makes a bunch of provocative declarations in his post but I think many
of them betray a lack of background to appropriately understand the tradeoffs
involved. As illustrated above, Protocol Buffers makes a bunch of design
affordances for compactness on the wire which XDR does not accommodate. He
also believes Google's RPC system to be "unarguably shitty" even though it has
never been open sourced due to dependency issues (what is open sourced as part
of Protocol Buffers is a shim, gRPC is the future here). His impression of why
Facebook built Thrift is similarly misinformed as Protocol Buffers was not
open source when Thrift was written.

~~~
tjalfi
TL;DR. XDR was designed in the 1980s. This is a major factor in Decisions 2
and 3.

RFC1832 has a rationale for decision 2 towards the end of the document. Here
is an excerpt.

    
    
       "(4) Why is the XDR unit four bytes wide?
          There is a tradeoff in choosing the XDR unit size.  Choosing a small
       size such as two makes the encoded data small, but causes alignment
       problems for machines that aren't aligned on these boundaries.  A
       large size such as eight means the data will be aligned on virtually
       every machine, but causes the encoded data to grow too big.  We chose
       four as a compromise.  Four is big enough to support most
       architectures efficiently, except for rare machines such as the
       eight-byte aligned Cray*.  Four is also small enough to keep the
       encoded data restricted to a reasonable size."
    

Most RISC architectures of the time did not support unaligned memory accesses.

Decision 3 is a matter of timing. RFC1014 (
[https://tools.ietf.org/html/rfc1014](https://tools.ietf.org/html/rfc1014))
was released in 1987. Unicode was still a work in progress.

Edited to fix formatting of the excerpt and a typo.

~~~
dweis
Thanks for pointing these out! This sort of archaeology is always interesting.
I tend to have to do it for Protocol Buffers too to understand various design
decisions.

------
YZF
One big influence was HTTP. You really need to consider HTTP as part of the
serialization. HTTP led to XML over HTTP and then JSON over HTTP. Then came
HTTP/2 to and then gRPC.

Would it be better if we still send HTML over ASN.1? I think the _state of
things_ would be much worse.

------
netmask
hoo maaaan, i was about to write a new serialization protocol just after my
weekly JS framework :(

------
Sphax
Maybe if the author wasn't an arrogant jackass I would be more inclined to
listen to him.

------
philipov
At the end, it sounds very much like he's suggesting adding another protocol
based on XDR, with just a few reasonable changes. It's like the article is a
living example of the classic xkcd joke.

[]: [https://xkcd.com/927/](https://xkcd.com/927/)

------
trishume
Situation: there are now n+1 competing serialization protocols

[https://xkcd.com/927/](https://xkcd.com/927/)

~~~
trendia
Sad you got downvoted. That's pretty much what the author did -- list a bunch
of serialization protocols, say that the ones developed in the 80's were good
enough, and _then_ recommend we all use MessagePack.

~~~
yborg
That isn't accurate, he states preference for XDR, which is ancient, and that
he "likes" MessagePack. He didn't explicitly advocate it.

Not blaming anybody for skimming the post, which was a pretty typical
blogrant, if it was from certain other people it would be clearly clickbait,
but this seemed like at least genuine ranting.

~~~
scottlocklin
It was actually an email to a colleague who is a little too enthusiastic about
serialization protocols. I thought it was funny rather than anything
resembling advice.

I'll probably use messagepack instead of json in an upcoming project for
performance reasons, and the fact it works better than the json thing in J.

