
Beating the CAP Theorem Checklist - pjvds
http://ferd.ca/beating-the-cap-theorem-checklist.html
======
finnh
Nice checklist! I really like Coda Hale's "You Can't Sacrifice Partition
Tolerance" [0]

I think part of people's frequent confusion is thinking it's "CAP: Pick two"
when really it's "C or A: pick at most one for any particular situation".

0: [http://codahale.com/you-cant-sacrifice-partition-
tolerance/](http://codahale.com/you-cant-sacrifice-partition-tolerance/)

~~~
atondwal
The Oct 22 update basically says "nevermind".

~~~
saurik
No; the Oct 21st update notes that someone (important) "disagrees", and then
the Oct 22nd update takes that disagreement, quotes a particular paragraph
from it, concedes a single point (one that is mostly about terminology), and
then has a "that said" that in turn disputes the overall conclusion of that
disagreement both in this specific aspect and in the general case over the
following paragraphs (which all back up his overall position that A vs C is
valuable to consider).

In fact, the part being disagreed with on Oct 21st wasn't even the idea that
CAP is itself correct, or that "you can't give up partition tolerance", but
instead is over the contention that choosing A over C is ever the correct
design decision: the argument from Stonebraker is that in real-world systems A
is hopeless to obtain anyway as while network partitions do occur, they are
far less frequent than "bohrbugs, application errors, human errors and
reprovisioning events".

------
jchrisa
One thing they don't teach in CAP theorem class is that you can be AP
sometimes and CP others. For instance you can be strongly consistent within
one system boundary (perhaps a datacenter) and eventually consistent between
datacenters.

The CAP theorem can thus be annoying on both sides of the divide: people who
think they can beat it, and people who think every system must be fully
described by either AP or CP. Real life is messy, and for most real
distributed systems you can't adequately characterize thier behavior just by
pointing to a single spot on a Venn diagram of CAP.

~~~
mononcqc
More than that. Individual operations in a system could be AP and others CP.
If you have something like dynamic quorum sizes, making quorums a majority
would work towards some form of CP, while making it smaller than a majority
could work towards AP constraints.

There are of course more mechanisms to consider when positioning yourself on
the spectrum, but nothing would keep you from requiring CP for operations that
have to do with subscriptions, and AP for operations that show how many people
are online on a website, for example.

Then you have to consider 'refinements' over cap such as PACELC, where you can
decide to switch modes depending on cluster health. For example, you could
decide that when there is a (P)artition, you are (A)vailable, (E)lse, you are
(C)onsistent. Nothing also stops you from providing, if you want extremely low
latency, no consistency at all. For a cache, choosing latency (which would be
PAEL, where L stands for Latency) would make sense.

CAP is just the very basic set of rules to consider when going distributed.

------
zorked
In case someone does not recognize the reference, this is a parody of the Spam
Solutions Checklist that used to be circulated when e-mail spam was a new
issue:
[https://craphound.com/spamsolutions.txt](https://craphound.com/spamsolutions.txt)

------
justinsb
I think everyone accepts that the CAP theorem (as proved) is true. The issue
is what conclusion you draw. The theorem just says you can't be perfect on all
three axes. Some people conclude that we must throw everything away, and e.g.
use a key-value store.

To my mind, that's a bit like learning about random bit-flips, and giving up
programming.

~~~
mononcqc
Here are a few examples:

[http://nathanmarz.com/blog/how-to-beat-the-cap-
theorem.html](http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html)

[http://dominictarr.com/post/44516618714/working-around-
the-c...](http://dominictarr.com/post/44516618714/working-around-the-cap-
theorem-with-rumours-db)

[http://scale-out-blog.blogspot.ca/2012/04/disproving-cap-
the...](http://scale-out-blog.blogspot.ca/2012/04/disproving-cap-theorem.html)

Not everyone accepts the CAP theorem yet, or some have an understanding of it
that is flawed enough that they think they can beat it.

~~~
justinsb
One of those was posted on April 1st :-)

Nathan's post is very interesting - thanks for posting that. He has taken on
board the implications of CAP, and built a system (Storm) that is genuinely
different.

I wish more people did that. If you're going to use CAP to justify building a
database but having it be less capable, that's just marketing. If you're going
to look at CAP and think "OK, we can't get to 100%, but how can we get
closer", then that's great. For blog marketing reasons, you might describe
that as "beating the CAP theorem"; but it's actually finding new solutions
within the constraints.

~~~
mononcqc
Good catch for April fools! I shouldn't have included that one, obviously.

For the nathan post, reading the comments reveal that he ends up acknowledging
CAP is true, which would make him fall into the "( ) nice try, but blatantly
false advertising" box of the list.

CAP theorem _can_ be used as an excuse to provide weak systems that don't do
much (unfortunately), but it doesn't mean it's what it is meant to validate as
an idea, fortunately. As I mentioned in another comment, it's more or less a
very basic set of rules (axioms?) to consider when building distributed
systems.

------
kbenson
It would be fairly awesome if the page parsed params to put x's in the
appropriate spots, so you you could use the appropriate link to customize it
for the recipient you send it to.

------
dustingetz

        * Rich Hickey on Datomic, CAP and ACID [1]
        * Exploiting Loopholes in CAP (Michael Nygard of Relevance) [2]
        * How to beat the CAP theorem (by Nathan Marz of Twitter) [3]
    

[1]
[https://news.ycombinator.com/item?id=5093037](https://news.ycombinator.com/item?id=5093037)
[2] [http://www.infoq.com/presentations/Loopholes-
CAP](http://www.infoq.com/presentations/Loopholes-CAP) [3]
[http://nathanmarz.com/blog/how-to-beat-the-cap-
theorem.html](http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html)

------
vinkelhake
Seems to have been inspired by the Programming Language Checklist:
[http://colinm.org/language_checklist.html](http://colinm.org/language_checklist.html)

------
superuser2
I understand that this is probably wrong somehow, but what about a system
wherein there is sufficient capacity to take a "partitioned" datacenter out of
the loop entirely and serve web traffic from your other locations, then bring
it back up to speed while still offline after the disruption ends?

Again, I accept that that's probably bad, but why?

~~~
lotyrin
Your strategy as I understand it is CP. In the case that you have a partition
and it's not a majority, it stops accepting traffic or goes read-only.

Say an outage leaves your three DCs (or more generally, N partitions, of which
none are a majority) unable to talk to one-another. They are all a minority
and so all go offline or read-only. No partition can accept writes without
risking that another partition is too, and that those writes will be
inconsistent.

You're partition tolerant, and have consistency guarantees, but you're not
available.

~~~
superuser2
Can I "cheat" by having satellite, PSTN, cell phones, semaphore, smoke
signals, sneakernet, etc. to allow the DCs to negotiate (perhaps randomly)
which will consider itself the master? Then the non-masters could direct users
to the master.

I guess this isn't a true "partition" and if we're talking about an internet
backbone cable cut, it might go unavailable for _some_ users who could not be
redirected, but for the rest it is intact. While networking failures are to be
expected, could I minimize my risk of a complete partition by sharing this
status information? I suppose if we're in a cartoon end-of-the-world scenario
where my 3 DCs drift out to sea on separate continents, I go unavailable, but
at that point everyone has bigger problems to worry about.

If the PSTN, cell phone network, Internet, dedicated fiber link, shortwave
radio, roads, and navigable airspace (drone, helicopter, carrier pigeon)
between all 3 of my locations are severed, then the system goes completely
unavailable, but at this point we're talking about a mass casualty incident,
civilization collapse, or other situation where no one would be trying to use
my system anyway. Medical records for first responders would have to fail to
"eventually consistent" but patients can only be in one place at a time.
Unless my system is the nuclear deterrent, does it matter?

~~~
dedward
The partition part is about how tolerant your system is to a partition
situation. Having massively reliable communications between systems will
reduce your MTBF, but it doesn't address how your system will actually handle
a partition. You do need to account for it one way or the other. If you get a
partition, you can't keep operating with consistency across the whole system.

Now when you look at patient data .. sure, eventually consistent might be just
fine.. but that's not consistency as referred to in CAP. So while it's
perfectly acceptable to your app, it's not about making CAP invalid.

~~~
superuser2
>If you get a partition, you can't keep operating with consistency across the
whole system.

But why can't I shut down some nodes and direct all my users to the other(s)?
As long as I have enough of a channel to communicate which node is now
authoritative (and if all systems including driving between DCs fail
simeltaneously, nobody needs my app anyway) can't I keep serving my users out
of one DC?

Thank you for engaging this thought experiment, by the way. This is
fascinating stuff.

~~~
Scramblejams
As the linked site notes, clients are part of the distributed system, which
means they can be partitioned as well, both from each other (more important in
a P2P app) and from your servers. So you may not be able to "direct all your
users to the other(s)."

