

Errors in Database Systems, Eventual Consistency, and the CAP Theorem - dm_mongodb
http://cacm.acm.org/blogs/blog-cacm/83396-errors-in-database-systems-eventual-consistency-and-the-cap-theorem/fulltext

======
mmalone
I posted this comment to the ACM article, but they're pre-moderating and
apparently someone's shirking their responsibilities... so I'm reposting here.
Hope you guys don't mind:

Hey Michael,

Interesting read. I'm with you on most points. I do have a couple comments
though...

Failure mode 1 is an application error.

Failure mode 2 should result in a failed write. It shouldn't be too hard to
trap errors programmatically and handle them intelligently / not propagate
them. Of course the devil's in the details and hardware / interpreter / other
problems in components that are outside of the DBs control can make things
more difficult. These are the sorts of issues that tend to stick around until
a system is "battle tested" and run in a couple of large / high volume
operations.

Failure modes 3, 4, 5, 6 (partition in a local cluster) - this seems to be
where the meat of your argument is... but you totally gloss over your
solution. I'm not sure I believe that network partitions (even single node
failures) are easily survived by lots of algorithms... Or, more specifically,
I don't believe that they can be survived while maintaining consistency (in
the CAP sense, not the ACID sense). I threw together a super simplified
"proof" of why consistency is essentially impossible in this situation in a
recent talk. See [http://www.slideshare.net/mmalone/scaling-gis-data-in-
nonrel...](http://www.slideshare.net/mmalone/scaling-gis-data-in-
nonrelational-data-stores) \- slides 16 through 20. What algorithms are there
to get around this? If a replica is partitioned you either can't replicate to
it and have to fail (unavailable) or you can't replicate to it and succeed
anyways (replica is inconsistent).

I also don't buy the argument that partitions (LAN or WAN) are rare and
therefore we shouldn't worry about them. For a small operation this may be
true, but when you're doing a million operations a second then a one-in-a-
million failure scenario will happen every second.

Failure mode 7 will probably result in some data loss unless (as you mention)
you're willing to live with the latency of waiting for durable multi-
datacenter writes to occur. But having that option is definitely nice, and
that's a trade off that I'd like to be able to make on a per-write basis. I
may choose to accept that latency when I'm recording a large financial
transaction, for example. Another thought related to this issue - in a lot of
ways writing something to memory on multiple nodes is more "durable" than
writing it to disk on one. So you may be able to do multi-DC replicated writes
in memory with tolerable latency assuming your DCs are close enough that the
speed of light isn't limiting. That should get you durability up to the point
where the entire eastern seaboard disappears, at least.

Failure mode 8 is another core issue that I think you're glossing over. WAN
failures (particularly short-lived ones) can and do happen on a regular basis.
It's true that routing issues are typically resolved quickly, but it's another
law-of-large-numbers thing. Amazon AWS had an issue that took an entire data
center offline a while back, for example. Shit happens. In CAP terms this is
really the same thing as a failure modes 3, 4, 5, 6, and 7 though. So the same
arguments apply. Re: your argument that only a small segment splits - what
happens when a read comes into the small split segment (maybe from a client in
the same datacenter)? If data has been updated on the larger segment it
couldn't have been replicated, so again you're either serving stale data or
your data store is unavailable.

Thanks for putting this together, it was an interesting read. Looking forward
to hearing more about some of these issues!

------
strlen
It's an interesting article, as fundamentally sound systems have been built
which are "CA" within a single datacenter. BigTable is the premier example.

There are however several issues with this. First, eventual consistency means
eventual consistency of data on the replicas; it doesn't mean that a client
can't be presented with a consistent view. Dynamo-pattern systems use quorums
to provide this. Even when a quorums aren't used (R + W < N), at least with
Voldemort, the weak (no read-your-writes) form of eventual consistency is
still only a failure condition: when the "coordinator" (first in the
preference list) node for the specific key fails after a write has happened,
but before the next read is made.

Second, a CA system means that the core switch (and there may be multiple
within a datacenter, especially in the case of startups leasing colocation
space or using a managed/"cloud" hosting provider) is now a single point of
failure. The way to build around this is by building an "AP" layer on top of
the "CA" layer, spanning multiple core switches. This is similar to Yahoo's
PNUTS and Google's Spanner. Both of these had been multi-year projects, by
companies with expertise in distributed computing, with very specific and
limited requirements (i.e., they were building this for themselves, not
selling them as general purpose solution). Which brings me to the next point:

"CA" consistency is much more difficult (that is, error prone) to implement
than "AP" consistency. Two phase commit is one way to do so, but doesn't
provide fault tolerance (it won't withstand the failure of the
coordinator/master node). Paxos is one way to do so, but a high performance
Paxos implementation requires leases and is still very tricky to implement.
Again, it took _years_ for Google to build, trouble shoot and perfect the
Chubby/GFS/BigTable stack; the first version did _not_ have a fault tolerant
master and the query model is still much simpler than SQL.

That's why I am skeptical when people claim to be able to deliver to market
(within months, not years) a commercial solution that provides strong
consistency, fault tolerance, horizontal scalability (without a hard upper
limit), supports multiple datacenters and _still_ allows execution of SQL
queries (even if without certain types of JOINs) with OLTP-suitable
performance. That's not to say it's logically impossible, it's just a very
bold claim to make.

------
justinsb
I think Stonebraker is right: CAP has been dramatically over-applied.

~~~
aristus
It's probably to do with differences in scale. At the LAN level, he's right:
partitioning is rare and C/A might make the most sense. At the WAN level,
partitioning is a fact of life.

There are at least two problems to solve on opposite ends of scale: how to get
many components inside one computer to cooperate without stepping all over
each other, and how to get many computers to cooperate without drowning in
coordination overhead. They may be special cases of a more general problem,
and one solution will work for all. Or perhaps we'll have one kind of
programming for the large and another for the small, just as the mechanics of
life are different inside and outside of the cell.

~~~
justinsb
Agreed, but he's addressed WAN partitioning very well. Editing: "[Consider] a
partition in a WAN network. There is enough redundancy engineered into today’s
WANs that a partition is quite rare ... the most likely WAN failure is to
separate a small portion of the network from the majority. In this case, the
majority can continue with straightforward algorithms, and only the small
portion must block. Hence, it seems unwise to give up consistency all the time
in exchange for availability of a small subset of the nodes in a fairly rare
scenario."

That last sentence is a very strong put-down of NoSQL; if it wasn't published
on the ACM website, it should have a "zing!" at the end of it.

~~~
mmalone
Actually he's not really addressing it at all. What stats does he have that
show WAN partitions are rare? I find that short WAN partitions occur
regularly. And how does the small portion "block?" How does it even know that
it's the "small portion?" If a read comes into the "small portion" (maybe
because the client's in the same data center) what does the small portion do?
If it responds, the response may be inconsistent. If it doesn't, the data
store is unavailable (at least as far as that client is concerned).

~~~
cx01
Paxos (which you'll probably use to implement a strongly consistent
distributed system) allows nodes to reach consensus as long as a majority of
them are able to talk to each other. So if a partition includes a majority of
nodes, this partition will just continue working like before. The minority
partition will not be able to reach any decisions at all and will be
unavailable. Strictly speaking, the minority partition doesn't even know it's
a minority partition, because this is an undecidable problem in asynchronous
distributed systems. But the important point is: At most one partition will
continue making decisions so that consistency is guaranteed.

So you're right in that the minority partition will be unavaiable, but I think
it's a worthwhile tradeoff.

~~~
mmalone
It _may_ be a worthwhile tradeoff. But if you're in three datacenters and one
of them splits you could have a third of your requests fail. Depending on what
you're doing, that may be unacceptable. I agree that it's a tradeoff. I
disagree with the people who are saying that emphasizing consistency is always
the right thing to do.

~~~
cx01
The problem is: If a node performs a write in an eventually-consistent data-
store, the write will not be visible immediately. Alternatively (in a strongly
consistent data-store) the node could just retry the write until it succeeds.
From the perspective of a user that wouldn't really make much of a difference,
but in the latter case there would at least be no risk of having multiple
inconsistent versions of the same data.

~~~
mmalone
Your first point is not true. In the absence of partitions, all "eventually
consistent" data stores that I know of will give you strong consistency. The
eventual consistency bit only comes into play if a problem occurs. (It's
probably also worth noting that many RDBMS replication strategies won't give
you strong consistency at all - even in ideal circumstances.)

I suppose you could retry if a write fails (e.g., can't reach a quorum), but
you could theoretically end up retrying forever (and 10 seconds or so is
forever as far as an interactive user is concerned)... eventually you need to
either fail or write inconsistently. So we're just delaying the inevitable.

Also if you're accepting quorum writes to the "major partition" you still have
to repair the "minor partition" when it comes back online. There's no
traditional DB that implements the sort of read-repair/hinted-handoff/anti-
entropy mechanics that Cassandra, Voldemort, and the various proprietary big
data stores use.

~~~
cx01
The case that I have to retry indefinitely will only occur when the partition
is permanent. And in that case an eventually consistent system will not help
either, because clients will not see the writes (as long as the partition
exists), so the eventual consistent system doesn't offer me any advantage.

~~~
mmalone
My point was that for an interactive application it doesn't take a very long
time for user experience to degrade substantially.

And we've glossed over my points about newer databases exposing these knobs to
clients. It's possible to do exactly what you're describing using Cassandra,
for example (er, you might have to do some consistency level tweaking to make
writes fail if you can't get a quorum of authoritative nodes, but it wouldn't
be hard - and I'm not even sure if that's necessary). It's not possible to do
it with MySQL or PostgreSQL without building some intelligent partitioning
layer on top. And that layer will probably make it impossible to do joins and
add relationship constraints, so you lose any benefits these systems bring to
the table.

------
kennu
I always figured that the C in CAP is more about scalability. If you keep
adding servers indefinitely, it becomes slower and slower to commit every
transaction to every server. But if you ditch the C, you can write fully in
parallel to all servers.

~~~
cx01
ACID transactions require consistency (C in CAP) in order to guarantee
atomicity. So if you ditch C, you'll just be unable to perform transactions at
all.

In general, weakening the consistency guarantees is not done to improve
scalability, but rather to improve the partition tolerance. There are
replication techniques (e.g. chain replication) that allow very high
performance without sacrificing consistency.

~~~
kingkilr
The C in CAP and the C in ACID are not the same C. Consistency in the ACID
sense means that all constraints are met, Consistency in the CAP sense means
all servers have the same data.

~~~
cx01
I know. I actually explained this not too long ago here:
<http://news.ycombinator.com/item?id=1191003>

But what I'm saying here is: You cannot have ACID transactions without the C
in CAP.

