

Problems with CAP, and Yahoo’s little known NoSQL system - jermo
http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html

======
ryanobjc
Classic blog from 2010. The idea didn't catch on however, and no one uses
these terms. PNUTS is still completely closed source inside Yahoo -- this is
because PNUTS depends on an internal queuing system.

As for YCSB, it's commonly references and is easily available on github. It
hasn't changed much lately, and now with tools like jepsen that focus on
correctness as well as performance, YCSB is no longer the preferred testing
tool.

~~~
anonetal
I agree, and I wish the concepts here were more widely adopted. Many people
seem to equate low-consistency modes (offered by Dynamo/Cassandra etc) with
CAP and guaranteeing availability in presence of network partitions. In
practice, network partitions seems much less of a concern and the low-
consistency modes are really needed to get reasonable performance/latency
instead.

~~~
ryanobjc
The concept listed in here is either (a) too complex or (b) not complete
enough to provide a full rich mental model.

CAP was highly successful because it fits in with an existing meme - 3 things,
you can only get 2 of them. It's simple, and easy to apply in a trivial
fashion.

But CAP theorem doesn't have enough "meat" as an engineering analysis, and
doesn't guide your system design. Yes there's a proof, but it doesn't tell you
how to balance the 3 concepts and how to directly compare system designs.

The thing is that as a descriptive model of how an entire distributed database
works, CAP just aint enough. That's why blog posts like
[https://aphyr.com/posts/313-strong-consistency-
models](https://aphyr.com/posts/313-strong-consistency-models) and concepts of
'linearizability' are very useful.

------
alexnewman
Having implemented and used multiple consistent coordination systems: Raft+ in
c5 (I helped write), Single Decree Paxos via DConE at WanDISCO and timeline
consistent implementations via HBase, I have to say the academics miss that
the devil is in the implementation details. We all focus on high level things
like CAP (although FLP is what real hard core academics care about) when
system details like how you can aggregate fsyncs, system pauses (Especially
with GC) and how you integrate your coordination system into the larger system
play a much larger role in overall system latency. The coordination posture is
a minor detail when compared to GC issues. Now I know what you are going to
say, "He's not an academic". He ran a real DB company. I totally disagree. The
DB community in general focuses on the wrong thing. That's true in Hadoop,
it's true with cassandra, and I would bet that it's true at google as well.

------
hiphipjorge
A classic! If you had to take away one thing from this article, it would be
that the CAP theorem (while useful) doesn't take into consideration latency,
which might be more important than even more important than partition
tolerance. Yes, partition tolerance is inevitable, but you have to deal with
latency every single time!

------
shin_lao
The author of CAP himself said that the CAP theorem gained undue popularity.
It is an interesting theorem only because it models what partition implies in
terms of trade-off when they happen.

But partitions are rare. And when partitions happen, generally you can't
access the partitioned area so a lot of problems disappear.

Availability isn't an on/off switch, there is a wide range of how "available"
you can be and what you can do. For example you can allow reads but disallow
rights.

Last but not least, the most important is what happens after the partition is
over and what level of guarantees you offer regarding coherence.

~~~
yummyfajitas
Partitions are not rare. A partition simply means that one node is not
available to another, which does not have to be caused by a network failure.
For example, a stop the world GC is equivalent to a network failure - the node
is unreacheable at this time. So is a deployment with downtime (e.g. a 10
second restart per-node).

Also, while you can't access the partitioned area, you might have nodes
outside that area. The question is whether the non-partitioned nodes become
useless to you or not.

[http://kellabyte.com/2013/11/04/the-network-partitions-
are-r...](http://kellabyte.com/2013/11/04/the-network-partitions-are-rare-
fallacy/)

[https://webcache.googleusercontent.com/search?q=cache:D3htez...](https://webcache.googleusercontent.com/search?q=cache:D3htezwrrNIJ:kellabyte.com/2013/11/04/the-
network-partitions-are-rare-fallacy/+&cd=1&hl=en&ct=clnk&gl=us)

~~~
shin_lao
Partitions are rare relative to other events, and I was quoting Eric Brewer,
the author of the CAP theorem.

If you have one partition per day (which is very often) but do 100,000 request
per seconds: partitions are rare.

10 seconds is hardly a partition as it is below the default TCP timeout (one
minute).

 _Also, while you can 't access the partitioned area, you might have nodes
outside that area._

So these nodes can't access the partition and you don't have any conflict.

~~~
yummyfajitas
10 seconds/day breaks 3 9's. The fact that this might be below the TCP timeout
is irrelevant, that's almost certainly NOT the timeout that will be considered
as unavailability. E.g., suppose you are delivering an advertising tag.
Unavailable means > 200ms.

 _So these nodes can 't access the partition and you don't have any conflict._

Conflict comes when you reconnect.

------
int19h
The way I've heard the CAP theorem used practically at a high level is to
frame the question as follows:

What happens in the case of a (logical) network partition? - an AP system
continues taking requests and provides eventual consistency, while a CP system
waits for the partition to go away, or says come back later.

------
explosion
I like the author's concept of PACELC, though it seems a bit implementation-
specific.

------
jchrisa
title should say (2010)

------
aaa667
I don't understand what is meant by "this means that the roles of the A and C
in CAP are asymmetric" \- could someone explain this to me?

~~~
jfoutz
You can be available, or you can be consistent. The higher guarantee of
consistency you demand, lowers your availability.

Imagine a small database with a million copies on a million machines. when I
write something, i'm going to have to wait a bit while all million machines
acknowledge the write is complete. If i have a lower threshold, say 1, the
write completes _real fast_ but the million machines aren't consistent.

Real systems balance consistency and availability.

