Fallacies of Distributed Computing Explained (2006) [pdf]

gwbas1c · on July 11, 2018

I think we need to add a fallacy about serialization and protocol changes, but I'm not sure how to summarize it in a simple sentence.

Specifically, "magic" serializers often have problems interoperating with different languages; or with different versions of a protocol.

I've come across serious problems where:

- An application's deserializer can't handle a new value included in a new version of a protocol, and thus crashes.

- A server needs to forward part of a serialized object unmodified to another server. It strips out unrecognized values, thus requiring that we upgrade all three pieces of software instead of only two.

- A server's serializer is so rigid that we can't add xml tags / attributes without lots of involvement with the team that works on the server; even though there's no real reason for them to be involved in all the details

EdSharkey · on July 11, 2018

I share your frustrations.

I am enamored with serialization libaries like Protocol Buffers that schematize data in its unmarshaled form and strip the symbols from the data in its marshaled form. By representing properties of the data with codes, one is free to add/rename properties (but not remove) from the schema over time. Different versions of the schema can successfully marshal and unmarshal the same data as long as codes (and semantics of what the codes represent) never change. It's a very forwards/backwards-compatible way of working with data as long with you can handle managing semantics over time.

By contrast, schema languages like JSON Schema or XML Schema may be more expressive about target language constraints than something like Protocol Buffers, but they often suffer from the brittleness of the "messages I consider correct must be considered correct by you in precisely the same way"-problem when their messages move over the wire.

mmt · on July 12, 2018

I don't recall seeing this explanation before, though I'm pretty sure I've read a version of it.

What I noticed immediately was what I felt was something missing in explanation of fallacy #1 (The network is reliable), something I predicted decades ago when the network gear salesmen first came around to try and convince us to replace our aging concentrators (multiport hubs) with fancy new switches: losing the backpressure mechanism of collisions would lead to bandwidth-related failures.

The article's explanation mentions the reliablity of modern switches but makes a fairly shallow dismissal of this by referencing things like pulled power cords but spends most of the text actually talking about WAN reliability.

What I've seen in practice is, including with (maybe even especially with) distributed computing environments, is the scenario I proposed to the salesperson way back when: What happens when two servers send data out of their network ports at full line speed (100Mb/s) each, at the same time, for a total of 200Mb/s, to another server's network port that can only receive 100Mb/s?

Of course, once the switch's buffer fills up, half the packets get dropped, and it's not practically deterministic which half. In a distribute environment, where the number of sources greater than 2 and the number of sinks is greater than 1, it gets even less so and often more intermittent.