
How to beat the CAP theorem - icey
http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
======
mononcqc
This doesn't beat the CAP theorem. The approach can be quickly described as a
log-based append-only database with read-repair code based on last-write-wins.

The system will have inconsistent views for the whole time a netsplit lasts.
If I have two nodes, A and B, and that they both hold a piece of data related
to a meeting M0 at 11 on Monday, then a netsplit happens, the strategy
suggested won't help with the following case:

Jim and Peter have access to node A, and Mary and Julie have access to the
node B. Because there is a netsplit between A and B, communication is
impossible there. However, Jim and Peter agree to move the meeting M0 to 3PM
on Monday, and we call it {A,M1}. Mary and Julie, however, agree to move the
meeting to 1PM on Tuesday. We call it {B,M1}. Now, the M1 record on A and B is
no longer the same. If the netsplit isn't resolved before Monday, then
inconsistent data will have caused the loss of our team meeting!

When the netsplit is resolved and we consult the meeting, the approach of the
blog post will pick either B's version of M1 or A's version of M1 based on the
map-reduce we've had. To the reader, we'll have discarded either of them. This
is why the last write wins -- it's the latest state in the log of events, the
other one is ignored.

Last-write-wins is a default mode to be used on read-repair, and nothing
solving the problem of losing consistency. During the netsplit, views of the
data remain inconsistent.

The way to keep them consistent would have been to have A and B blocking
writes. As such, we would have kept Jim, Peter, Julie and Mary from writing
new meeting times and made sure that they all had the same hour presented.

An easy way to test whether you beat the CAP theorem or not is to imagine your
system functional during a week-long netsplit. Either it keeps on working with
inconsistent data, or it has a way to keep it consistent. The CAP theorem
tells us it's impossible to do both at once, and the blog post didn't disprove
that, but merely provided a way to resolve conflicts in data in an automatic
way.

I would expect someone to 'beat' the CAP theorem as much as I would expect
someone to 'beat' the Pythagorean theorem. The blog is interesting in its own
approach to make the CAP theorem simpler to handle, but misses its own point
of beating it.

------
pdbne
I'm not sure if you're trolling here:

> The CAP theorem has been beaten.

> As you are about to see, a system like this not only beats the CAP theorem,
> but annihilates it.

The CAP theorem states that you can't have all three of strong consistency,
partition tolerance, and availability at once. This is a proven fact
([http://www.cs.cornell.edu/courses/cs6464/2009sp/papers/brewe...](http://www.cs.cornell.edu/courses/cs6464/2009sp/papers/brewer.pdf)).

I don't see how your proposed system provides strong consistency. Eventual
consistency is not strong consistency. Let's pretend the HDFS deployment in
your batch workflow is partitioned from the rest of the system: sorry, but you
won't be able to get strongly consistent data from it, and you'll have to
either read stale data or simply block until the partition ends. This is
pedantic, the CAP theorem is pretty specific.

Now, if you're saying something about the eventual consistency of logically
monotonic facts in the system (see the CALM conjecture, now theorem:
[http://databeta.wordpress.com/2010/10/28/the-calm-
conjecture...](http://databeta.wordpress.com/2010/10/28/the-calm-conjecture-
reasoning-about-consistency/)), which I think you are, then I agree that you
have a system providing (even provable) eventual consistency. The basic idea
is that if code is logically monotonic, meaning the set of facts it operates
on continues to grow over time, then it will obey eventual consistency
properties. If I'm not mistaken, your notion of "immutable data" is equivalent
to restricting programs to operate on logically monotonic data (in your
examples, this is accomplished by making data take the form "fact X is true at
time T").

~~~
nathanmarz
I never said anywhere that it provides strong consistency. You still choose
between availability and consistency. What you beat is the complexity the CAP
theorem normally causes. You regain the ability to easily reason about your
systems.

"There is another way. You can't avoid the CAP theorem, but you can isolate
its complexity and prevent it from sabotaging your ability to reason about
your systems."

"What caused complexity before was the interaction between incremental updates
and the CAP theorem. Incremental updates and the CAP theorem really don't play
well together; mutable values require read-repair in an eventually consistent
system. By rejecting incremental updates, embracing immutable data, and
computing queries from scratch each time, you avoid that complexity. The CAP
theorem has been beaten."

~~~
adgar
> What you beat is the complexity the CAP theorem normally causes. You regain
> the ability to easily reason about your systems.

So instead of saying "The CAP theorem, while still being completely valid,
doesn't have to as annoying as you might think," you said "The CAP theorem
[has been] annihilated."

Gotta love blogging.

~~~
nathanmarz
Wow. This is one of the most blatant examples of taking things out of context
I've ever seen. The "annihilated" sentence you're referencing is part of a
thought experiment in the post about an ideal system that's infeasible but
instructive!

Here's the proper context for that statement:

"The simplest way to compute a query is to literally run a function on the
complete dataset. If you could do this within your latency constraints, then
you'd be done. There would be nothing else to build.

Of course, it's infeasible to expect a function on a complete dataset to
finish quickly. Many queries, such as those that serve a website, require
millisecond response times. However, let's pretend for a moment that you can
compute these functions quickly, and let's see how a system like this
interacts with the CAP theorem. As you are about to see, a system like this
not only beats the CAP theorem, but annihilates it.

The CAP theorem still applies, so you need to make a choice between
consistency and availability. The beauty is that once you decide on the
tradeoff you want to make, you're done. The complexity the CAP theorem
normally causes is avoided by using immutable data and computing queries from
scratch."

------
rxin
Good write up, Nathan. Nonetheless, you started by saying you were going to
challenge the basic assumptions of how data systems should be built, but the
resulting system architecture is in fact very similar to a lot of data
streaming systems that the academia and industry have proposed and built for a
number of years. An example of such system is Google's Percolator [1].

In essence, you separate the system into two parts, a read-only part and a
real-time part. CAP explains fundamental trade-offs in OLTP type of queries,
whereas this system design is catered towards streaming, analytical queries.
These are two very different contexts.

[1] <http://research.google.com/pubs/pub36726.html>

~~~
nathanmarz
I was under the impression that Percolator was a purely incremental system.
I'll have to give the paper another read to see if they're using a hybrid
batch/realtime approach.

The system I described isn't limited to any particular "type" of query. It's a
way of supporting arbitrary queries on arbitrary data, which encapsulates
everything.

Most people assume mutable state and incremental updates when building data
systems. This is in fact the approach that almost all databases take. These
are the assumptions I challenged.

~~~
VladRussian
>Most people assume mutable state and incremental updates when building data
systems. This is in fact the approach that almost all databases take. These
are the assumptions I challenged.

it isn't assumptions. It is requirements for some systems. You drop
requirements - you get a breathing room and can build a different system.

~~~
nathanmarz
Mutability and incremental updates are never requirements. Requirements have
to do with performance and resource usage: X amount of space, Y amount of
memory, Z amount of latency, etc.

Mutability and incremental updates are approaches. I showed a different
approach with immutability at the core which leads to a system with some
amazing properties.

~~~
eru
I hope that examples like this and git make more people see the value of
immutability. In the context of functional programming the same concept is
called purity.

------
perlgeek
> The key is that data is immutable. Immutable data means there's no such
> thing as an update, so it's impossible for different replicas of a piece of
> data to become inconsistent.

I'm not following that point.

Suppose you have an event-souring data store as the OP suggests, and some
nodes go out of sync. So one of the nodes records that Sally now lives in
Atlanta, the other doesn't.

The query for Sally's location asks for the last update. One node says
Atlanta, the other still says Chicago. What's that, if not inconsistent?

Only appending data means you don't corrupt old data, but you still can get
conflicting responses from different parts of your system.

~~~
nathanmarz
What you're describing is a query. "What is Sally's current location?"

Queries can be inconsistent during partitions. This is what eventual
consistency means. But once the partition clears, the queries will be
consistent again.

The data is immutable. Both "Sally lives in Atlanta as of time X" and "Sally
lives in Chicago as of time Y" are true.

This is far, far different from databases based on mutable state. In order for
a database like that to be eventually consistent, you need to do read-repair
to enforce consistency. This is besides the other problems I brought up with
mutable databses, such as a lack of human fault-tolerance.

~~~
kahirsch
If I only have one antique pocket watch and, during a partition, Alan buys the
pocket watch while communicating with partition A and Bob buys the pocket
watch while communicating with partition B, how do we get back to a consistent
state without application code?

~~~
nathanmarz
A good example to look at is banks. Banks are eventually consistent systems.
An ATM allows you to withdrawal funds (to a set limit) even if it can't
communicate with the bank. However, because banks keep full audit logs of all
transactions (immutable data), they eventually discover that you took out too
much money and charge you an overdraft fee.

<http://en.wikipedia.org/wiki/Overdraft>

~~~
lemming
This isn't really relevant to the pocket watch problem. This only works
because the bank doesn't care about over-committing a finite resource (mostly
because they can charge you that fee). My company processes transactions for
prepaid credit cards, so any money that is overdrawn is essentially lost -
it's important to understand the characteristics of the problem and that there
really is no magic anti-CAP bullet.

~~~
nathanmarz
Exactly right. The bank example just demonstrates an alternative approach to
full consistency.

------
ww520
OP adds confusion by calling data immutable. Data has a well accepted common
usage of being updatable. Event is the term what OP wants. Event is a snapshot
of fact happened in time and cannot be changed and it's immutable.

What OP described in general is the append-log based approach to storing and
querying data. There are a whole bunch databases and file systems built on the
append-log idea. It's well worth the time to check out the research in this
area to see the benefits and pitfalls of the approach.

Also check out the Complex Event Processing topic which bears similarity to
what OP described.

~~~
richhickey
Sorry, no. It is calling data mutable that is the confusion.

Try this exercise:

Identify something you consider a datum. Write it down. Note that it has more
than one part. Now imagine it 'changing' and write that datum down. Now,
identify the 'it' that (you would say) changed. Note that the 'it' part is
_the same_ in the 2 datum. Let's call this 'it' part the identity. Now look at
the part that is different. For discussion's sake let's say that part was 42
in the first case and 99 in the second. Did 42 change into 99, or was it
replaced by it? Let's call 42 and 99 values. You'll not be able to describe
something that 'changed', that incorporates _both_ the identity and value,
that doesn't involve some memory/disk/system that had a (just one!) 'place'
for storing information about 'it', and it is that place that has had it's
contents replaced/changed, not the datum/fact.

Did Obama becoming president _change_ the fact that Bush had been president?
No. Could you make an information system that forgot Bush was president? Sure,
but that's your fault, not a characteristic of data.

Finally, check out this 30 year old paper that contains this interesting
section: UPDATE IN PLACE: A poison apple?

[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108....](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.7810)

As long as we continue to use lax language like mutable/updateable data we
will be without the tools to describe, think about, and build better systems
that independently manipulate identity, value and place, and correctly handle
time.

~~~
ww520
I'm sorry. I don't understand the point of the exercise.

First, what kind of data are you talking about? Are you talking about data
value? Data record? Data object with identity?

Second, are you saying database records being mutable cannot model Bush was a
president once Obama has become president?

I don't know the reason for the aversion to mutable data record/object. Both
mutable and immutable data record/objects have their places in modeling real
world objects.

~~~
jcromartie
> I don't know the reason for the aversion to mutable data record/object.

I've found that mutable data is a root of all kinds of evil. In fact, I'd
guess that _most_ bugs stem from a mis-management of mutable state.

I know that pretty much all of the bugs I see every day are the result of
failure to coordinate state changes, or poorly-designed state changes.

~~~
ww520
Can you give some specific examples of bugs stemming from mismanagement of
mutable state? Some examples on bugs resulted from the failure to coordinate
state changes or poorly-designed state changes? And how the immutable data
solve the problems?

~~~
jcromartie
I'm not sure why you don't believe me :)

There are literally too many to list. But if I just had to pick some recent
ones:

* A dropdown on a web form that displays a different value than is returned (failure to coordinate the displayed value and the "actual" value)

* Logins remain locked after resetting a password (failure to change a value from "locked" to "not locked", instead of "is the user locked?" being a function of history)

* If you navigate to a business object from a list, and perform an action (via HTTP POST) on that object, you can accidentally perform the same action on other objects from that list through refreshing or clicking multiple times (since the "current" object is in the session state)

* Bugs that only happen when you navigate to a page from a certain other page (because the destination page depends on certain session state that was not set, instead of being a simple function of URL parameters)

~~~
ww520
Not that I don't believe you. It's just that I want to learn from specific
examples on what kind of problems people encounter when dealing with mutable
data and how would immutable data solve the problems.

So how would immutable data solve the above problems? Thanks.

------
balakk
If I am not completely mistaken, the pattern that's being hinted is called
Command-Query Responsibility Segregation(CQRS) in certain circles:

<http://cqrsinfo.com/documents/cqrs-introduction/>

While making the problem easier, I do not know if this actually "solves" the
CAP theorem issues.

~~~
joeyespo
At a high level, CQRS is simply separating your reads from your writes. It
follows the observation that most web-related DB queries are in the form of
reads and much fewer writes. I don't know about solving the CAP theorem, but
it does a very nice job of scaling horizontally.

------
Maro
In practice, beating the CAP theorem inside the data center is achieved by
using a 2-tiered system.

Have a set of lightweight controller processes which use a consistent majority
voting algorithm (Paxos) to replicate the state of the cluster. Since it's
lightweight, you can run 5 or 7 of them, meaning 2 or 3 can die. This is
enough Availability in practice, per customer input.

Have a set of data nodes which actually store the data. Put groups of 3 of
these in replica sets and use consistent replication (Paxos). If one or two of
them goes down, the controllers reconfigure the replica set, so you get
Availability even if 2 of 3 goes down. (Customers love that they get
Consistency but 2 of 3 can go down.)

This scheme is used in my company's product. The idea comes from
academic/Google papers, and is used in other products, too.

The scheme does not work for multi-datacenter use-cases. If you want that, you
have to give up consistent replication (Paxos) inside the replica sets and use
some kind of eventual consistency. This variation of the 2-layered scheme is
used by one of the popular competing NoSQL systems.

~~~
VladRussian
>Beating the CAP theorem inside the data center is fairly easy in practice by
using a 2-tiered system.

>Have a set of lightweight controller processes which use a consistent
majority voting algorithm (Paxos, Consistency, Partition Tolerance) to
replicate the state of the cluster. Since it's lightweight, you can run 5 or 7
of them, meaning 2 or 3 can die. This is more than enough Availability in
practice, per customer input.

it isn't beating of the CAP theorem. It is decreasing the Partition Tolerance
and thus getting more Availability while preserving Consistency in full
accordance with CAP.

------
seabee
I wrote my thesis on an immutable data store as a means to beat CAP. I didn't
get quite as far as your other observations though, as mine was in the context
of continuation web frameworks.

Very nice write-up however.

~~~
nathanmarz
Thank you. Got a link to your thesis anywhere? Would love to read it.

~~~
seabee
It's not in a state I would like to make public yet. Sufficient to get my
degree, but the quality is poor due to personal circumstances. However I sent
you a link via your contact form.

------
Uchikoma
Beating the CAP theorem by proofing it by dropping the requirement for
consistency? Sorry to be conservative and following Aristotelean logic, but
this title does not make sense.

~~~
lmm
Indeed, it's not quite "beating". "How to get full availability and eventual
consistency (the best you can do under CAP) with no effort in the application
layer" might have been a truer title. Nevertheless, this is something useful.

------
iandanforth
A few days ago I posted my frustration with Cassandra counters, this article
articulates a solution elegantly:

'if you make a mistake or something goes wrong in the realtime layer, the
batch layer will correct it.'

The resiliency of this protection however is limited by the time it takes to
do a full re-processing of the historical data. If there is a processing
mistake in your real-time layer, it is likely to exist in your batch layer as
well, thus invalidating the pre-computed results as well.

~~~
nathanmarz
But you can always correct the mistake and recompute things. Unlike systems
based on mutable data, mistakes aren't permanent.

~~~
ww520
If the transaction log is kept for the mutable system, you can always replay
the transaction to arrive at the correct state. Most systems truncate the
transaction log after a checkpoint to save space. If space is not an issue,
you can always keep the transaction log since the beginning.

------
pierrebai
How does the system deal with real-world data that has cross-datum consistency
requirements? Many system mnodeling real-world have this issue: accounting,
inventory, ...

I'm also annoyed by the repeated mantra that you must have partition
tolerance. In fact, most naïve systems are designed without it, with some
central authority that must always be accessible. (Any system that is stricly
a tree with an essential root node is like that. Any partition that does not
include the root node cannot do any work, thus the system is not partition
tolerant.) You can then have CA (by requiring that all requests and updates
must propagate to the root.)

~~~
IgorPartola
You then do not have the A part of CA. Your root node is not accessible =>
your entire system is not accessible.

------
jgavris
'How to _tolerate_ CAP, according to application / system specifications..'

------
dr_rezzy
Great write up. You really nail the essence of a true data driven approach
(DDA) to solving problems. Simplification of the data world view can lead to
great things. The key to your batch computation example is proper data
modeling IMO. This leads to flexibility which is crucial for a design to be
successful. I feel people will stumble with the content of this post because
of the lack of proper understanding of DDA.

------
chrisdew
This looks like Event Sourcing / CQRS. Is that correct?

~~~
nathanmarz
There are similarities. It's really closer to functional programming than
anything else.

------
gfodor
Nice post! To me, the logical next step is a mathematical framework for
proving that all major relational operators (and in partition intolerant mode,
ACID) can be implemented using this approach -- if such a model can be cleanly
laid out it could drive the future of database systems.

------
curveship
Thanks for the long and interesting article. Two comments, one an observation
I'd be interested in hearing your response to, the second a bit of a devil's
advocate question.

First, in your write-up, you describe data as singular, but of course, it's
often a tuple. To take a concrete example, let's say that Sally, in order to
make her move from Atlanta to Chicago, goes to U-Haul's website and reserves a
14' truck for Monday. She thinks a bit more and decides that she needs a
bigger truck, so she goes to the site and changes her reservation to a 17'
truck. After even more thinking, she decides she really won't be ready by
Monday, so she changes her reservation date to Tuesday.

Let's call these three moments t0, t1 and t2. So her order has changed from:

t0: 14' truck, Monday

t1: 17' truck, Monday

t2: 17' truck, Tuesday

BUT, it so happens that U-Haul has implemented a datastore like you suggest,
one that is only "eventually consistent", and when Sally made her change at
t2, the node she was talking to hadn't heard yet about the switch from a 14'
to a 17' truck. So the actual tuples that were recorded were:

t0: 14' truck, Monday

t1: 17' truck, Monday

t2: 14' truck, Tuesday

Now you're in a pickle: how can you tell that the record stored at t2 was
partially out of date, verses the possibility that maybe Sally really did want
to change back to a 14' truck at the same time she changed the date to
Tuesday.

To solve this, you need to do two things: the table that records new items is
different from the ones that record updates, in that the new item tables
contain full tuples, while the update tables are radically normalized, to the
extent that they'll usually have just two columns: a key and an atomic
replacement value. So what we'd actually record is:

t0: 14' truck, Monday in the "new reservation" table

t1: 17' truck in the "reservation item change" table

t2: Tuesday in the "reservation date change" table

I don't have any knee-jerk opposition to a schema like that, but it seems to
me you'll run into two issues:

\- all the usual issues with extreme normalization, like performance and
massive and complicated joins. And since, as I understand it, your realtime
iterative db is consuming the same data stream as your main store, it will
face this same performance penalty, at least in its input.

\- your schema darn well better be in BCNF, or enforcing data constraints
across the hyper-normalized update tables is going to get very hairy very
fast. Like here, let's say that U-Haul has a policy that it never rents 17'
trucks on Tuesdays (I don't know why, maybe they wash them on Tuesdays). Since
when Sally made her switch to Tuesday, the node she was talking to thought she
was still renting a 14' truck, it was happy to record the change. But when
your batch process comes along and tries to resolve the two updates, what does
it do? There's no obvious answer (going with the last consistent state -- 17'
for Monday -- is certainly wrong, as we know that's no longer what Sally
wants). In other words, you're back in the "read repair" complexity and hell
you thought you'd avoided.

Second point ... eh, I think I've well overspent my word budget. In summary: a
devil's advocate would say that the realtime db you've called just an
"optimization" is, in fact, the real db, since if the realtime and batch dbs
differ, at least at any time other than when a new batch update lands, then
you'll go with the value from the realtime db (if, during normal operation,
batch says a count is 4 but realtime says 5, you report 5). So what you've
really done is attached a fairly massive auditing system onto the same-old
database design. Its job is to periodically check that no inconsistencies have
been introduced between nodes. It's a clever auditing system, in that the
immutable model means it may have a leg up on deciding what the right result
is, but still just an auditing system.

tl;dr: won't this lead to hyper-normalization, and is it really a new db
design or just a new kind of auditing system

edit: formatting, tl;dr

~~~
nathanmarz
I would store the reservation date and truck length as separate data objects
in the batch layer. You would tie them together with the reservation id. It
isn't a big deal to do those joins since the batch system is good at that kind
of large-scale computation and you've already accepted high latency in the
batch layer.

I wrote an article about graph schemas on the batch layer here:
[http://nathanmarz.com/blog/thrift-graphs-strong-flexible-
sch...](http://nathanmarz.com/blog/thrift-graphs-strong-flexible-schemas-on-
hadoop.html)

The realtime db only accounts for the last few hours of data. Unless you're
doing a query that only needs the last few hours of data, you need to do a
merge between the batch view and the realtime view. We constantly flush old
data out of our realtime db so as to keep it small and fast.

------
relaunched
While I don't have an answer on how to beat CAP theorem...I know someone who
is working on it as we speak, as well as many other issues associated with the
cloud. View his tweet below:

@eric_brewer Eric Brewer I will be leading the design of the next gen of
infrastructure at Google. The cloud is young: much to do, many left to reach.
10 May via web

------
heyrhett
Wow, this article is full of clear, coherent, well thought out points... and
references!

------
lenary
I love that research is going into disproving already superseded theories.

