The dangers of conditional consistency guarantees

thamer · on June 30, 2023

Great insight from Abadi as always. To add to his point about guarantees and network partitions: I find that the way consistency is explained in documentation or even courses can be somewhat misleading or at least incomplete. If you read the Dynamo paper and take from it that you could model a system with a "classic" design of 3 replicas doing quorum reads and writes, you could easily convince yourself that it will be consistent and continue operating even with the loss of a replica.

What this simple approach doesn't cover is all the cases that aren't your perfectly well-behaved read and write operations. It doesn't cover the case where your client gets a timeout and has no idea how many replicas were written to. You could have one replica that has persisted the (failed) write, and read back an old copy from the other two. Then your next read could pick up the new version, giving you alternating views of this data that was supposed to be consistent.

With a database like Cassandra you also have to consider that only the most recent copy of a value is returned, regardless of how many replicas responded. In that case you could read an old value, then the new one, then the old one again, all at quorum when it was supposed to be consistent.

Failures are often messy, and the source of most of the complexity in reasoning about consistency in distributed systems. Many concepts (like network partitions) are also often misunderstood, or being considered in a way that's too "theoretical". Nodes don't always shut down instantly. You likely won't see a clean network split that starts at a fixed point in time and is also resolved in an instant. It'll be partial, asymmetric, with nodes coming in and out, etc. Good luck reasoning about these scenarios…

YZF · on June 30, 2023

With Cassandra if you wrote at quorum and it was successful and you read at quorum you will get your value back. Where it gets hairy is when a write fails (e.g. only one replica got it) or if writes so close to each other that the clocks are a problem. It is definitely non-trivial to build systems over that but then it's non-trivial to build systems that scale horizontally, and geographically, and are highly available, and can recover from failure, in the first place.

Like the article implies, what tends to happen is people don't understand the guarantees and build software that works except when it doesn't. Then you build on top of that etc.

_benedict · on July 1, 2023

It’s not possible to read a new value at quorum and to then later read the old value at quorum in Cassandra. This is enforced by “read repair” that ensures the quorum agrees on the reply, by propagating the newer values if they are missing on any of the quorum, before replying.

That’s not to say failures in distributed systems aren’t messy; they definitely are.

EdwardDiego · on June 30, 2023

> Nodes don't always shut down instantly. You likely won't see a clean network split that starts at a fixed point in time and is also resolved in an instant. It'll be partial, asymmetric, with nodes coming in and out, etc.

Excellent point, it's precisely this scenario that has caused the biggest issues in distributed systems I've been working with.

deathanatos · on June 30, 2023

The systems I work with seem to choose the even worse "PA/EA", to use the article's terminology, i.e., consistency never, even during apparent normal operation.

We had whole slew of bugs that got discovered by this: we use a vendor for storing Docker images, and AFAICT empirically, their system does not obey the property "Read-Your-Writes". We knew this already about it for images, but what tripped the other day seemed to be that layers are also susceptible: so when the manifest of a new image was uploaded, it referenced layers that the client has acknowledged writes for, but the write of the manifest apparently failed because the layer "didn't exist".

(The script that was uploading the images was also written in Typescript, and Deno's API for executing child processes uses C-style "return the error" handling¹. The script forgot to check, so our CI run "passed", all while missing an image. A deployment then bumped the image versions to the new (non-existent) version, and that deploy failed. Thankfully k8s keeps running the old versions, but the number of swiss cheese holes that went through was pretty astounding.)

So many systems fail read-your-writes, and it just makes them all the more annoying to deal with. Inevitably you are writing to a system because you want to do something with that write: I'm not persisting data into a black hole, never to be see again, after all. But sanely serializing the next action in the sequence when you do not know if the write is "good" yet is impossible, and makes things brittle.

(We asked the vendor about this, and the response from support types / vendors is just so utterly disappointing. "Can't you just write a loop to check if the image appears to exist?" This is one of the many spots where CS theory matters in the actual engineering! No, no I cannot "just" write a "simple" loop to check, and the theory can tell you why that doesn't work. Not to mention the layers/manifest part means I need to re-write `docker push`…)

(¹but child process APIs are rife with this. Heck, even Rust's would bite you all the same. AFAIK, I think only Python exposes safe-ish APIs in this area?)

nyanpasu64 · on July 1, 2023

Fun article, neat trick with lock fences. Distributed systems are cool, but easy to build wrong and impossible to fix after the fact.

I think multithreading is effectively a distributed system. Is GPU rendering one too? I've tried reducing Dolphin Emulator's latency with vsync on, and run into a wall trying to understand GPU frame fences, the relative timings of CPU operations and GPU polygon rendering on Wii vs. PC GPUs, and the occupied length of the swapchain.

YZF · on June 30, 2023

2019 ...

Interesting read though. I've seen Hazelcast in the wild but didn't really hear much about it.