Every now and then I come across someone who thinks data is like water and we should have standardised 'pipes' and components for processing and transporting data and converting it between formats like we have pipes and valves for transporting and purifying water.
But data isn't like water, it's like 'chemicals'. You can't have a standard component for processing data the same way you can't have a single components that knows how to process sulphuric acid, crude oil, hydrazine, mercury and molten salt.
Data can be binary, delimited, fixed field, lossy, ASCII, Many variations of Unicode, executable, contain attack such as SQL injection, encrypted, have time sensitive delivery requirements, include checksums, require checksums to be applied in the protocol, be meaningless without metadata or other data, and have who knows what other constraints, limitations and requirements.
Building a data processing system is not at all like building a water works. It's more like building a chemicals processing plant.
All valid. But if you want to avoid a shock, please don't talk to people who transport actual water in actual pipes. I bet you'd be surprised how many different standards of pipes there are and the long list of features that separate one kind of water from the other. ;-)
If you want to offend both groups: Germany has a joke item called a "Gardena RJ45 Adapter"[0] which allows you to connect your garden hose to your network...
Honestly that enhances the narrative. Just be like "data isn't like our idealized view of water" and "even water requires different pipes in different circumstances".
I think the point may include the fact a chemical plant and a water plant are not that much different in comparison to something which is would appear to be a radical shift in approach.
It is my opinion that this will be the function AI serves in reality.
Same thoughts on AI + data. Well, as a dream. In reality programming seems to be one of the last areas AI is applied to despite the potential benefits.
A "protocol" is just a good excuse to create a consciousness entity, which itself is designed to interpret the communication between two other entities.
He picks it up and adds one tiny modification (changing endianess). Others might need to change another detail making their version incompatible again.
edit: scrolling down shows me this has already been posted 3 times in this thread (and will probably be posted a few more times before it's off the front page)
If you remove the indentation, your > will markup the text as a quote. The indentation marks it as unwrapped preformatted text, which renders this text illegible on mobile.
> If you remove the indentation, your > will markup the text as a quote.
No it won't. It just puts a '>' in front of the line. You can also put asterisks around the text to italicize it (as I've done here). But the '>' has no special support.
Nothing is competing with HTML on being what HTML is. Nothing is even close.
All text encodings in common use build on top of ANSI. The de-facto standard (now) is UTF-8 and there's no reason to change that.
IP really does move all the data around the entire internet (that's kind of in the name right there!).
Yes, there are variations, extensions and incompatibilities, but these things are not being re-invented all the time, exactly because these things are ubiquitous standards.
I don't think this is as big a problem as the article suggests. There are a substantial number of tradeoffs in serialization protocols, and each application/ecosystem can choose their protocol to get the best of these tradeoffs. As long as there are few enough that every popular protocol has a library in almost every language this isn't too bad.
One example of a tradeoff that is hard to eliminate is that you can reduce size and increase performance substantially if you pre-specify a schema like Cap'n Proto (and others) do. The downside is then if you just get a message without knowing what it is about it's difficult to figure out. The only way out of this tradeoff that I can see is having a global schema registry and every message having 8 bytes dedicated to schema ID, and that has downsides of its own, especially for small messages.
I do agree with the author though that we could do with more binary serialization protocols with tools to easily translate back and forth to a human-readable text format for debugging.
>> The downside is then if you just get a message without knowing what it is about it's difficult to figure out.
You're saying that as if there was a clerk reading the message at the other end. If a computer program gets a message it doesn't "understand", it's not going to "figure it out" either way.
Worse, if you're a programmer that doesn't know the actual schema and you're trying to figure out what the schema might be just by looking at the data, you'll probably run into trouble.
I was talking purely about humans finding messages they don't have the schema for during the course of debugging. Computers do equally badly with JSON data and Cap'n Proto data they don't have the schema for.
A fake example scenario of what I'm talking about: Someone notices a server is producing way too many errors, they check the logs and see a bunch of "invalid request: ...." messages. The messages are JSON like {"api_version":2,"cpu_usage":0.57}. They figure out that the server monitoring system is forwarding stats to the wrong IP address. Even though they don't know the full schema of the messages, they get the gist and it helped debug. With Cap'n Proto that would have been a few nonsensical bytes.
I don't think this is a very big downside, so in general I like protocols like Cap'n Proto.
Perhaps a much bigger downside is that in most languages integrating code generation into the build pipeline is a pain in the ass, so it is way easier to use dynamic serialization protocols that don't require compiling a schema.
> You're saying that as if there was a clerk reading the message at the other end. If a computer program gets a message it doesn't "understand", it's not going to "figure it out" either way.
Self-describing formats do make backwards compatibility a bit easier. With Cap'n Proto, adding a new field in the middle will break consumers.
Thrift and Avro -- and later, gRPC, but not protobuf -- are full RPC stacks where you use an IDL to codegen your endpoints, and those endpoints communicate using their own serialization. Since this is form-on-the-wire is "internal" concern not meant for direct public consumption, I find this acceptable.
Meanwhile, XML-RPC (which is not a serialization format!), JSON-RPC, SOAP, Swagger, are stacks that intentionally leave open the possibility that someone will come along with and consume the form-on-the-wire directly, outside of the tooling of the environment. Most in-the-wild JSON-responding APIs have the same expectation.
IDLs themselves are a very old idea, probably because we like declarative ways of specifying contracts that are then applicable across a heterogeneous environment, or in different languages and runtimes, and so on.
As for why there's dozens of offshoots of standalone serialization formats which are all predominantly occupied with the efficient packing of numbers while keeping the general data model of JSON, I can't answer [1].
As much as it seems to be recommended against (by... authors of serialistion protocols?) I am a strong believer in just using simple structures (and unions if necessary) directly --- all these serialisation abstractions appear to have been invented at a time when machines varied far more widely in their characteristics such as endianness, alignment, word size, integer representation, and even byte size. Now that your platform is almost certainly going to be x86 or ARM, it makes little sense to add a layer of (sometimes substantial) complexity in essentially attempting to accommodate flexibility that won't be needed. I can see the necessity if e.g. you need to communicate with a 36-bit 1's complement mainframe, but otherwise it's just bloat.
Along the same sentiment, I'm not a fan of APIs using JSON and/or XML or some other overly-flexible textual encoding. Simple binary encodings, TLV-ish if necessary, are the best.
I was never really convinced by the "human readable" argument for textual encodings either --- you just need to get used to it, then you can read and write the bytes in a hexdump as easily as you can English. In fact I'd prefer working with hexdumps to XML. But unfortunately there's now a whole generation of developers who can't even count to 2 in binary and don't know what a hex editor is...
Please don't do this. It may sound reasonable at first but 'simple structures' don't define strings, variable length ints nor any kind of list type.
Eventually you'll need some of that and your protocol will become a mess of implicit padding and implicit stride rules, where every list is handled in a different way.
Not to mention all the deserialization security holes you are going to have.
If you don't believe me take a look at X11.
> you just need to get used to it, then you can read and write the bytes in a hexdump as easily as you can English. In fact I'd prefer working with hexdumps to XML. But unfortunately there's now a whole generation of developers who can't even count to 2 in binary and don't know what a hex editor is...
Man, when I read this, all I can think is "real programmers don't eat quiche." [1]
We will continue to see more abstraction in our tools. By and large, the changes we've seen so far have made us more productive as programmers. You should try quiche sometime. You might like it.
So long as you're willing to document exactly how your compiler packs those structs into memory, and your bytestream, go for it. But don't come crying to me when you upgrade compiler versions or switch compilers or add fields to your struct and the rules change on you. I'm going to hold you to your earlier requirements.
I mean hell, why try and abstract out anything. We should go back to programming in binary, or at most assembly. We have a whole generation of developers who can't do either
I'm not saying all abstractions are bad. The problem is adding abstractions when they're not actually needed, which increases complexity instead of reducing it.
I can see "human readable" for system config files since that can save your bacon in an emergency (editable in a text editor), but I'm really at a loss when we are talking about a serialization protocol or app config files. I want efficiency and minimal parsing code, which makes it problematic to throw in human concerns.
I can probably have a program running to edit the app config file, but if I cannot edit the system config file in a text editor (particularly if it gets bad enough I'm booting in single user mode) I am really hosed.
Until you actually do need the flexibility and then you are hosed. People who design protocols with some degree of flexibility have usually learned this lesson the hard way. To quote Kennedy quoting Chesterton: don't take down a fence until you know why it was put up.
It sounds reasonable at first. But I'm very happy that we have a lot of stuff supporting very rare edge cases. Maybe it's just bad luck, but at work I often run into edge cases and I'm very happy if it is at least partly supported.
Many languages in use today don't even have a (first class) concept of unions or structures, at least not in the sense that you are talking about. If you're talking about about "bloat" those contexts, you might want to start somewhere else...
>"Java monkeys eventually noticed how slow XML was between garbage collects and wrote the slightly less shitty but still completely missing the point Avro."
I would like to know why the author feels that Avro misses the point. Can anyone hazard a guess?
and similar for:
>"Oh yeah, I do like Messagepack; it’s pretty cool."
It would be interesting to hear why they(or anyone else for that matter) consider Messagepack a worthwhile contribution to the serialization tool shed but Avro is not.
Author doesn't know history. XDR pre-dates ASN.1 by a long while. Then there was NDR. ASN.1 is an utter horror that crawled out from under the festering corpse of the OSI stack. Many, many recent CVE's are bugs in ASN.1 parsers. Look up DER/BER but they are only some of the shitty ways ASN.1 has of creating complexity enough to hide CVE bugs in.
"human readable JSON like format"
"JSON style human semi-readable form"
> accidentally mentions JSON twice in the list
JSON is the most widely used serialization protocol in web application development today. The article's repeated mention of it is no accident at all: it's considered best practice across the industry. Frankly, I don't know why we are still talking about this.
The "semi" in semi-readable is no mistake either. The lack of comments is a real drawback in JSON for human readability. It's a major reason why YAML is used as a substitute in complex cases (An AWS Cloudformation template might be a good example- they're all gone YAML).
Similarly, JSON still has ambiguity around encoding (type of UTF). Dates are the same. Even integers, which are limited to JavaScript's 53-bit implementation (all numbers in JavaScript are doubles, so you're stuck with the mantissa limit). This is a huge pain if you're dealing with 64-bit values.
UTF-8 and JSON is a good place to start for Web Development, but there is no way it's the final or only answer.
There are main three design trade-offs in the space of serialization protocols (even ignoring RPC):
1. Do you want an efficient format (i.e. binary) or a format that is easily readable/writable by humans (i.e. text)?
2. Do you want self-describing format (data schema is part of the standard) or schema-less format, to save space on transmission? And how much?
3. How to deal with mixed text data? Do you want primarily to store strings and escape any structured data within them (a la XML), or do you want to primarily store structured data and force all text to be escaped in them (a la JSON).
Personally, I like CBOR, which interestingly wasn't mentioned.
The big problem is that rpcgen is hairy and not reeeeeeally portable (in my experience, if there's a feature you want, it's not supported on your Unix).
XDR is nice, though, apart from being big-endian and not having widely-supported 64-bit integers. It's a pity it's unfashionable.
I think I re-implemented rpcgen a couple of times in Ye Olde Days, so it can't be hard. Then DCE came by and then I lost track of serialization protocols that are essentially rehashes of SunRPC/XDR. So yeah, I've been thinking along the same lines as Scott - why?
These days, I've mostly thrown the towel and plunk in JSON everywhere. At least it's hip, and if I ever find a piece of code where the serialization is the bottleneck, early enough to start pushing XDR (not ASN.1 - I once tried to write a parser for _that_ and ended up with some extra grey hairs ;-)).
The only problem with this kind of rant is the assumption that people actually talk to each other. But why would they?
When was the last time you talk to people working on another software stack? Cooperations between different tribes need to be enforced by strong leadership or a Big Need like imminent extinction of the tribe. As long as that doesn't exist and the whole ecosystem is continuing to grow you can just sit there and watch people building the next silo and the next instead of getting to a higher step in evolution.
And it's actually the reasonable thing to do. I mean, would you rather have a miniscule share of a cake others baked, or do you want to have your own cake? When both is about the same effort, I'd rather have my own cake, even if I have to define a new serialization protocol to store it.
There are some really good reasons to pick an unusual serialization protocol, and even sometimes reasons to invent your own. (Embedded systems, limited environments, licensing restrictions, etc.) Generally though, you should use something the rest of your development team / community is familiar with. Not because this is efficient in terms of resource usage on the machine, but because this is efficient in terms of teaching your other developers how your serialization protocol works.
JSON may be everywhere, and it's tempting to look at its flaws and think, "we can do better" but it also has the great benefit of having decent serialization libraries already written in the vast majority of programming languages. That's one heck of a feature.
Along a similar vain please stop writing external DSLs. Especially in the DevOps ecosystem. I'm really tired of learning yet another syntax for (bash + ssh).
This is a hairier problem than you make it out to be. Eventually, you end up needing to describe the state of a system based on input values you may not be able to control. Arbitrarily complex dependency graphs add another layer of complication. Being able to construct simple data structures can suffice for almost all use cases, which is why YAML is so popular. Eventually, though, you end up needing to programmatically generate those trees of structures and oops now you're Turing-complete. If the goal is to decouple describing the state and executing the actions leading to convergence, there's no good solution to bridge the gap between the two, so you end up writing a DSL or creating an unholy marriage of a data description language and an imperative templating language (YAML+Jinja2), or just doing it all in BASH and giving up on the idea of clean separation.
It's not hairy at all. All these "orchestration" things written in some weird YAML format end up doing the same thing. Re-inventing their own syntax for modules. Re-inventing their own querying language. Re-inventing idempotent building blocks and then forcing you to compose it all in some YAML format.
I'm tired of these state of affairs. All of the above can be done in real programming languages, with real syntax. There is no need for yet another external DSL when an embedded DSL in the form of a library will suffice.
But at the same time there is little chance say Puppet, Ansible, Salt, Chef would just write their newer/better system but will try to emulate some other existing language/DSL.
For example do you see Chef taking off and becoming popular if it just implemented Puppet's language. Or would Salt have any appeal if it just had the Ruby syntax like Chef does?
It just won't happen easily. One reason is because the API surface and the DSL is so varied that chances are there were significant design mistakes made in a previous version. And one of the reason people want the new shiny is that it fixes some of those fundamental design issues. Copying and emulating an existing thing, means sticking with broken fundamental design decisions.
The irony is of course the new shiny also has it own fundamental design flaws and so now there will be a new group of people implementing the newer and shinier thing. (With obligatory articles on HN on "Why we switched from X to Y").
Programming languages and their libraries come with their idiomatic conventions of API design, calling conventions, data structures, etc, which tend to translate awkwardly to a "foreign" programming language.
The solution is usually one of two choices: pick a lowest-common-denominator programming language and implement everything in it such that all higher-level wrappers are only cosmetic shims; or, make a DSL that has clear and defined mapping to language-specific constructs and conventions in each programming language.
The former is the coder's solution, the latter is the designer's solution.
But which programming language. Some like Python, some like Go, some like Ruby. Whatever one chooses, it will alienate large group of people who could otherwise use the product. It's not black and white.
In this case functional ones come out as the obvious choice for declarative specs.
The trick for adoption I imagine is designing a DSL in any LISP of one's choice that defines systems as s-expressions... and just turn it into its equivalent XML instead.
Once the project inevitably becomes ubiquitous in enterprise IT you make a blogpost thanking everyone for being part of the social experiment but the DSL is back to being LISP again, with an optional pricing model per-processor a la SQL Server for those companies wishing to continue using XML instead of passing their specs through an automated converter.
On an unrelated note, I'm looking for a cofounder with a FP background...
But that's the point. Why X? Maybe I don't like X. Maybe s-expressions, paredit and all that lisp stuff makes me upset. XML isn't a programming language either, we're going the DSL route again.
Maybe I like, I don't know, cobol. What if I feel happy with my non-DSL DSL being cobol? Would people like it? Or maybe, really, the dedicated DSL is a better approach? To give people, at least, some handlebars, a lowest common denominator, if you wish.
The point is, no matter what language is selected, someone is always going to be unhappy about the choice. Look at Chef, great technology, IMO. Gives all the power of Ruby. One can do totally whatever they want. Yet people were not happy with it. Alternatives came out, people started reinventing stuff all over again. But even with those, one can use whatever language they want to automate these tools even further. All these DSLs are just helpers, a starting point, make the initial adoption easy.
I think people are looking at this aspect from a wrong perspective. "Is my time worth learning this new DSL / am I compensated enough to make it worth learning". It does not matter, as long as it all calculates.
My idea for a saner more lispy ansible would be something like:
1) You define your configuration in (common lisp) s-expr's.
2) The configuration management system is a lisp program which reads the configuration defined in (1), generates bash code, uses ssh to contact the nodes to be configured and runs said bash code.
That would retain one of the big advantages of ansible, namely you don't need to have some agent running on each node, and you can leverage ssh for security/cert handling etc. And you don't need a lisp environment installed on the nodes either.
Sysadmins and SRE types typically work close to the metal, where C is the language of choice (i.e. how the kernel and libraries are generally programmed). We don't typically grok FP; we understand imperative programming. Python is close enough not to disorient us, but LISP and its derivatives are pretty far out there in left field.
Or you know they could just write a library in a real programming language instead of forcing people to write YAML. Don't quite get the fascination with YAML. Give me actual fucking syntax instead of some bastardized serialization format that is badly trying to ape lisp.
the issue is as soon as you introduce a scripting language in a solution, people will write monstrosities with that. A configuration format guarantees that configuration will stay simple and "readable". I agree that LISP should be used more often though.
I have now seen enough deployments of puppet, salt, ansible, etc. to last me a lifetime and then some. Trust me. People have managed to turn configuration formats into unreadable monstrosities as well. At least with a real programming language I would have sensible debugging capabilities and maybe some refactoring tools to help but with a half-baked YAML programming language I have no such recourse.
> superset of this with a JSON style human semi-readable form,
and an optional self-description field, and you’ve solved all
possible serialization problems
As a general rule, when you find that many people have come up with different solutions to a similar problem it's
a good bet that there's actually not a single good solution available.
"Facebook couldn’t possibly use something written at Google, so they built “Thrift,” which hardly lives up to its name, but at least has a less shitty version of RPC in it."
Contrasting that with Protocol Buffers is enlightening as it clearly demonstrates some differences in design goals and where tradeoffs are made. Feel free to correct my interpretations as I may have missed something!
1. XDR appears to have no equivalent of a Protocol Buffer field number
=> the format is not self describing. That is, one must have a schema to properly interpret the data
2. XDR appears to encode lengths as fixed width based on the block size
=> faster to read/write but larger on the wire than using a varint encoding
3. XDR's string data type is defined as ASCII
=> it does not support modern unicode outside of the variable-length opaque type
1 would seem to present difficulty for modern distributed systems as one cannot control the release process of every distinct binary in the ecosystem to ensure that they are all schema equivalent at the same time. This can be remediated by propagating the schema as a header to the underlying data for consumption on the other side, but that bloats the wire format. Header compression could help with this but may be problematic for systems with severely constrained networks that constantly reestablish new connections (ex. mobile).
1 also has implications for data persistence. One cannot ever remove or reorder an XDR struct member else they will incorrectly parse data that was written in the older format. This is in contrast with Protocol Buffers, where one can remove or reorder message members whenever they'd like, as long as they take care not to reuse a tag number (and the newer `reserved` feature can help with that).
2 is just a performance tradeoff: binary size or (de|en)code performance?
3 has implications for memory constrained systems. Ex. on Android we eagerly parse string fields to avoid doubling the allocation overhead (first as raw bytes, then as a String object). If we required all string datatypes in Protocol Buffers to be defined as bytes fields (the variable-length opaque data type equivalent), we wouldn't be able to provide this optimization.
Overall, XDR looks like a good fit for inter-process communication in a homogeneous environment. Protocol Buffers looks like it's a good fit for cross language communication across heterogenous and unversioned environments. Directly comparing the two, XDR is much more verbose on the wire (particularly if we mitigate the versioning issues by serializing the schema in a header) whereas it's likely significantly faster to (en|de)code. i.e. there's a tradeoff for networking/storage costs vs. CPU performance.
Scott makes a bunch of provocative declarations in his post but I think many of them betray a lack of background to appropriately understand the tradeoffs involved. As illustrated above, Protocol Buffers makes a bunch of design affordances for compactness on the wire which XDR does not accommodate. He also believes Google's RPC system to be "unarguably shitty" even though it has never been open sourced due to dependency issues (what is open sourced as part of Protocol Buffers is a shim, gRPC is the future here). His impression of why Facebook built Thrift is similarly misinformed as Protocol Buffers was not open source when Thrift was written.
TL;DR. XDR was designed in the 1980s. This is a major factor in Decisions 2 and 3.
RFC1832 has a rationale for decision 2 towards the end of the document. Here is an excerpt.
"(4) Why is the XDR unit four bytes wide?
There is a tradeoff in choosing the XDR unit size. Choosing a small
size such as two makes the encoded data small, but causes alignment
problems for machines that aren't aligned on these boundaries. A
large size such as eight means the data will be aligned on virtually
every machine, but causes the encoded data to grow too big. We chose
four as a compromise. Four is big enough to support most
architectures efficiently, except for rare machines such as the
eight-byte aligned Cray*. Four is also small enough to keep the
encoded data restricted to a reasonable size."
Most RISC architectures of the time did not support unaligned memory accesses.
Decision 3 is a matter of timing. RFC1014 ( https://tools.ietf.org/html/rfc1014) was released in 1987. Unicode was still a work in progress.
Edited to fix formatting of the excerpt and a typo.
Thanks for pointing these out! This sort of archaeology is always interesting. I tend to have to do it for Protocol Buffers too to understand various design decisions.
One big influence was HTTP. You really need to consider HTTP as part of the serialization. HTTP led to XML over HTTP and then JSON over HTTP. Then came HTTP/2 to and then gRPC.
Would it be better if we still send HTML over ASN.1? I think the state of things would be much worse.
At the end, it sounds very much like he's suggesting adding another protocol based on XDR, with just a few reasonable changes. It's like the article is a living example of the classic xkcd joke.
Sad you got downvoted. That's pretty much what the author did -- list a bunch of serialization protocols, say that the ones developed in the 80's were good enough, and then recommend we all use MessagePack.
That isn't accurate, he states preference for XDR, which is ancient, and that he "likes" MessagePack. He didn't explicitly advocate it.
Not blaming anybody for skimming the post, which was a pretty typical blogrant, if it was from certain other people it would be clearly clickbait, but this seemed like at least genuine ranting.
It was actually an email to a colleague who is a little too enthusiastic about serialization protocols. I thought it was funny rather than anything resembling advice.
I'll probably use messagepack instead of json in an upcoming project for performance reasons, and the fact it works better than the json thing in J.