

Strong Consistency Models - cjbprime
http://aphyr.com/posts/313-strong-consistency-models

======
rdtsc
Very good post! Well all of Aphyr's "Call Me Maybe" series is great.

A lot of modern systems are distributed and quite often it seems they are
designed and built without much thought given to these kind of issues. In the
"Call Me Maybe" series for example he shows how a lot of popular NoSQL,
webscale etc etc. databases really fall on their face when network partitions
hit.

And there clearly is enough literature written on this but it is scholarly
articles or research papers, that don't get necessarily get read by architects
of most distributed system. (Heck I imagine, a lot of distributed systems are
created so in a ad-hock way. Something like, "Hey Jim we need to have a hot
spare fail-over machine in the US-EAST zone". ... and now they have a
distributed system).

Also I am not sure how much this is taught in schools. When I was in school
distributed systems mean, MPI, mesh topologies and so on. HPC type things.
Nobody said absolutely anything about network partitions.

Speaking of network partitions. The question I have, can one make the chance
of partitions happening low enough to be discounted? Or does it not even make
sense to talk about that? For example, imagine the classic scenario of a
single data center. Switches do fail sometimes. What if there is a redundant
network, running on a separate physical interface (say eth1 not the default
eth0). Now partitions can be detected and simultaneous failure of 2 networks
switches now has to be happen. This is not even a hypothetical scenario. There
is a Japanese NoSQL out there that nobody has probably heard of -- Hibari. It
features such a "partition detector" application.

[http://hibari.github.io/hibari-doc/hibari-sysadmin-
guide.en....](http://hibari.github.io/hibari-doc/hibari-sysadmin-
guide.en.html#partition-detector)

And the code:

[https://github.com/hibari/partition-
detector](https://github.com/hibari/partition-detector)

I saw that a while back and now I remembered about. So is that a practical
solution or there are fundamental theoretical issues with this approach?

~~~
ryanobjc
On the partition side of things... not really. Usually you end up with
coordinated failures of various types.

For example, lets say you run redundant networks. Are the physical paths
diverse? What happens if someone trips on BOTH network cables? What happens if
the redundant network is on the SAME switch hardware? What happens if yes you
use two network switches, but they are hooked up to the same UPS? Or both UPSs
are on the same circuit breaker?

The classic reference on this is from google:
[http://www.catonmat.net/blog/wp-
content/uploads/2008/11/that...](http://www.catonmat.net/blog/wp-
content/uploads/2008/11/thatcouldnthappentous.pdf)

There are a number of theoretical issues with these approaches, starting with
the FLP result.

Also realistically, eventually all practical solutions fail, often due to user
error.

~~~
rdtsc
Ok let's say mostly diverse physical paths, both tied up nicely into bundles,
different switch hardware, different UPS, different power grids (some nice
data centers will even have that!).

At this point other kinds of odd failures are more likely, than failure of
both networks paths at the same time, let's say meteor strikes, disgruntled
employee wiping out backups and data.... Can one even begin then to consider
an AC system? And if so what would it look like?

------
Morgawr
Nice article, all of this was covered in my Concurrency and Multithreading
course in university and we used this book:
[https://www.elsevier.com/books/the-art-of-multiprocessor-
pro...](https://www.elsevier.com/books/the-art-of-multiprocessor-programming-
revised-reprint/herlihy/978-0-12-397337-5)

It's a really great book, I suggest people read it if they are interested in
this type of stuff (it focuses on concurrent, not distributed systems, so it's
not quite 100% the same topic).

------
mjb
This is a very interesting and useful post. I like how Aphyr started with
Linearizability, which despite the unfamiliar name is much easier to reason
about (for most) than serializability and most other models. This is well
worth reading for anybody interesting in distributed databases.

A few nit-picks:

\- The notion of linearizability he uses seems to be implied to be stronger
than the original definition from Herlihy and Wing
([http://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf](http://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf)).
Compare Theorem 1 in that paper to the hints made about global consistency in
that post. Still, this is very nit-picky, and the main points of the
definition are captured.

\- Serializability, as typically implemented in most of the RDBMSs in common
use, is both weaker (as Aphyr points out) and stronger than the notion used in
the literature. Weaker is covered well in Aphyr's other posts. Most databases
are also stronger in some sense, and closer to strong serializability, because
they allow clients to observe values directly in transaction scope, which
constrains their ability to reorder transactions. There are many legal
reorderings of transactions that databases could do under the definition of
serializability that they don't do.

------
ryanobjc
Great write up!

I wish there was a little bit more color around what "availability" really
means. Yes some data stores are offering higher availability, but that doesn't
mean they are immune to node failure. Even for a "CP" system, latencies can go
up, and with enough node failures, partial unavailability and data loss can
occur.

And few systems survive the loss of a rack without some issues! Let alone a
total network in half partition!

In theory, a strongly consistent system could have a failover that is so small
that it is, from a user pov, always available. It would have very easy to
understand programming model. But such a system doesn't really exist, open
source wise, at this time. So we just won't know.

~~~
mjb
> I wish there was a little bit more color around what "availability" really
> means.

The definitions used by Gilbert and Lynch in their CAP proof requires
availability for updates at all nodes. Other proofs use availability for
updates (or consistent reads) at a msjority of nodes.

> Yes some data stores are offering higher availability, but that doesn't mean
> they are immune to node failure.

All theoretical models like this are always susceptible to real-world
implementation limitations. Still, it's very useful to think about these
theoretical models, because they tell us what we could achieve with an ideal
implementation. Knowing that allows us to chose implementation techniques and
approaches suitable to the problem, and to not waste time on trying to
implement the impossible. Just because something is possible doesn't mean that
it can be, or has been, practically implemented. If something is impossible,
though, we know we shouldn't spend time trying to do it.

> Even for a "CP" system, latencies can go up, and with enough node failures,
> partial unavailability and data loss can occur.

Of course. These models are actually pretty silent on durability, which is
frequently seen as a seperate matter, or a limiting case of availability,
depending on the area of research. Durability is really practically important,
but this isn't really the area of research that addresses it.

Partial unavailability because of theoretical limitations and partial
unavailability because of implementation limitations are different things. We
can improve on the latter, and insist our software vendors do the same, but
the former we just need to work around.

> In theory, a strongly consistent system could have a failover that is so
> small that it is, from a user pov, always available.

The definition of consistency and availability doesn't allow you to "fail
over". Completely ignoring availability, there are two cases of failover: one
where the system still appears to be consistent (using one of the strong
consistency models described here), and one where the system becomes
eventually consistent. The various proofs (such as Gilbert and Lynch's CAP
proof) imply that you can't "fail over" and keep consistency in the case where
some nodes are uncontactable. The definition of "some nodes" depends on the
exact proof, but there is no way to fail over into a minority partition and
keep consistency. It's not possible.

On the other hand, there are loads of practical and useful ways to fail over
into a minority partition that still gives useful eventual consistency
semantics. It all depends what you need.

~~~
ryanobjc
Yes, there is the notion of what CAP requires, but we cannot cling to the
strict definitions of these papers. They must be translated in to lay persons
terms. If you don't, how else can you communicate with your stakeholders? How
can you communicate with your users?

Personally I refuse the notion that this is out of scope of our jobs. Think of
the ability and power of a so-called 'renaissance person' \- can do anything.

> The definition of consistency and availability doesn't allow you to "fail
> over". Completely ignoring availability, there are two cases of failover:
> one where the system still appears to be consistent (using one of the strong
> consistency models described here), and one where the system becomes
> eventually consistent. The various proofs (such as Gilbert and Lynch's CAP
> proof) imply that you can't "fail over" and keep consistency in the case
> where some nodes are uncontactable. The definition of "some nodes" depends
> on the exact proof, but there is no way to fail over into a minority
> partition and keep consistency. It's not possible.

Minority partition - maybe you could restate this paragraph in terms of
discoveries such as paxos? Does paxos not allow progress and data retrieval
during some failure scenarios? Yes, realistically once enough nodes are lost,
things grind to a halt. But this is how dynamo works as well. Once the # of
nodes available declines below the R or W factor, the algorithm stops making
progress.

~~~
nemothekid
>but we cannot cling to the strict definitions of these papers

This seems like an incredibly strange sentiment I cannot grasp what you mean.
In another field it seems like you would be trying to say "we cannot cling to
the strict definitions of gravity." At the end of the day proofs are proofs.

