How to beat the CAP theorem 183 points by icey on Oct 13, 2011 | hide | past | web | favorite | 77 comments

 I'm not sure if you're trolling here:> The CAP theorem has been beaten.> As you are about to see, a system like this not only beats the CAP theorem, but annihilates it.The CAP theorem states that you can't have all three of strong consistency, partition tolerance, and availability at once. This is a proven fact (http://www.cs.cornell.edu/courses/cs6464/2009sp/papers/brewe...).I don't see how your proposed system provides strong consistency. Eventual consistency is not strong consistency. Let's pretend the HDFS deployment in your batch workflow is partitioned from the rest of the system: sorry, but you won't be able to get strongly consistent data from it, and you'll have to either read stale data or simply block until the partition ends. This is pedantic, the CAP theorem is pretty specific.Now, if you're saying something about the eventual consistency of logically monotonic facts in the system (see the CALM conjecture, now theorem: http://databeta.wordpress.com/2010/10/28/the-calm-conjecture...), which I think you are, then I agree that you have a system providing (even provable) eventual consistency. The basic idea is that if code is logically monotonic, meaning the set of facts it operates on continues to grow over time, then it will obey eventual consistency properties. If I'm not mistaken, your notion of "immutable data" is equivalent to restricting programs to operate on logically monotonic data (in your examples, this is accomplished by making data take the form "fact X is true at time T").
 I never said anywhere that it provides strong consistency. You still choose between availability and consistency. What you beat is the complexity the CAP theorem normally causes. You regain the ability to easily reason about your systems."There is another way. You can't avoid the CAP theorem, but you can isolate its complexity and prevent it from sabotaging your ability to reason about your systems.""What caused complexity before was the interaction between incremental updates and the CAP theorem. Incremental updates and the CAP theorem really don't play well together; mutable values require read-repair in an eventually consistent system. By rejecting incremental updates, embracing immutable data, and computing queries from scratch each time, you avoid that complexity. The CAP theorem has been beaten."
 > What you beat is the complexity the CAP theorem normally causes. You regain the ability to easily reason about your systems.So instead of saying "The CAP theorem, while still being completely valid, doesn't have to as annoying as you might think," you said "The CAP theorem [has been] annihilated."Gotta love blogging.
 Wow. This is one of the most blatant examples of taking things out of context I've ever seen. The "annihilated" sentence you're referencing is part of a thought experiment in the post about an ideal system that's infeasible but instructive!Here's the proper context for that statement:"The simplest way to compute a query is to literally run a function on the complete dataset. If you could do this within your latency constraints, then you'd be done. There would be nothing else to build.Of course, it's infeasible to expect a function on a complete dataset to finish quickly. Many queries, such as those that serve a website, require millisecond response times. However, let's pretend for a moment that you can compute these functions quickly, and let's see how a system like this interacts with the CAP theorem. As you are about to see, a system like this not only beats the CAP theorem, but annihilates it.The CAP theorem still applies, so you need to make a choice between consistency and availability. The beauty is that once you decide on the tradeoff you want to make, you're done. The complexity the CAP theorem normally causes is avoided by using immutable data and computing queries from scratch."
 So what you're saying is... the CAP theorem has NOT been beaten.
 Beaten != Eliminated
 It's like beating cancer by learning to mitigate the downsides.
 Isn't "beating cancer" typically reserved for people who go into remission and aren't going to be killed by their cancer?The analogy doesn't hold. OP is still stuck with CAP.
 He even said he annihilated the CAP theorem. You can't weasel out of it, the title and intro is linkbait.
 Good write up, Nathan. Nonetheless, you started by saying you were going to challenge the basic assumptions of how data systems should be built, but the resulting system architecture is in fact very similar to a lot of data streaming systems that the academia and industry have proposed and built for a number of years. An example of such system is Google's Percolator [1].In essence, you separate the system into two parts, a read-only part and a real-time part. CAP explains fundamental trade-offs in OLTP type of queries, whereas this system design is catered towards streaming, analytical queries. These are two very different contexts.
 I was under the impression that Percolator was a purely incremental system. I'll have to give the paper another read to see if they're using a hybrid batch/realtime approach.The system I described isn't limited to any particular "type" of query. It's a way of supporting arbitrary queries on arbitrary data, which encapsulates everything.Most people assume mutable state and incremental updates when building data systems. This is in fact the approach that almost all databases take. These are the assumptions I challenged.
 >Most people assume mutable state and incremental updates when building data systems. This is in fact the approach that almost all databases take. These are the assumptions I challenged.it isn't assumptions. It is requirements for some systems. You drop requirements - you get a breathing room and can build a different system.
 Mutability and incremental updates are never requirements. Requirements have to do with performance and resource usage: X amount of space, Y amount of memory, Z amount of latency, etc.Mutability and incremental updates are approaches. I showed a different approach with immutability at the core which leads to a system with some amazing properties.
 I hope that examples like this and git make more people see the value of immutability. In the context of functional programming the same concept is called purity.
 > The key is that data is immutable. Immutable data means there's no such thing as an update, so it's impossible for different replicas of a piece of data to become inconsistent.I'm not following that point.Suppose you have an event-souring data store as the OP suggests, and some nodes go out of sync. So one of the nodes records that Sally now lives in Atlanta, the other doesn't.The query for Sally's location asks for the last update. One node says Atlanta, the other still says Chicago. What's that, if not inconsistent?Only appending data means you don't corrupt old data, but you still can get conflicting responses from different parts of your system.
 What you're describing is a query. "What is Sally's current location?"Queries can be inconsistent during partitions. This is what eventual consistency means. But once the partition clears, the queries will be consistent again.The data is immutable. Both "Sally lives in Atlanta as of time X" and "Sally lives in Chicago as of time Y" are true.This is far, far different from databases based on mutable state. In order for a database like that to be eventually consistent, you need to do read-repair to enforce consistency. This is besides the other problems I brought up with mutable databses, such as a lack of human fault-tolerance.
 So you would still have to deal with the case of someone reading inconsistent data and taking wrong action as a result. If that action is only internal to the system you could go on and do a cleanup when the system becomes consistent again. If the action is external you can not.
 Right. If full consistency is a requirement than you can still have it, at the cost of availability. Alternatively, since the dataset is immutable, it contains a history of everything that happened. So you can resolve problematic actions later on (this is similar to what banks do).It's important to realize that the tradeoff between consistency and availability is a limitation of nature, not of our tooling.
 So CAP is still a problem, not beaten.
 If I only have one antique pocket watch and, during a partition, Alan buys the pocket watch while communicating with partition A and Bob buys the pocket watch while communicating with partition B, how do we get back to a consistent state without application code?
 A good example to look at is banks. Banks are eventually consistent systems. An ATM allows you to withdrawal funds (to a set limit) even if it can't communicate with the bank. However, because banks keep full audit logs of all transactions (immutable data), they eventually discover that you took out too much money and charge you an overdraft fee.
 This isn't really relevant to the pocket watch problem. This only works because the bank doesn't care about over-committing a finite resource (mostly because they can charge you that fee). My company processes transactions for prepaid credit cards, so any money that is overdrawn is essentially lost - it's important to understand the characteristics of the problem and that there really is no magic anti-CAP bullet.
 Exactly right. The bank example just demonstrates an alternative approach to full consistency.
 This is a use case where it sounds like you want full consistency, so in the realtime layer you would use a database that becomes unavailable during partitions.
 I think that most database applications cannot be seen as a monotonically-growing collection of facts, but as a sequence of operations, and those operations don't necessarily commute.Most of the operations do commute, or almost commute--there are edge cases involving, for example, balances or inventory falling to zero, and with side effects, duplicates, generated ids, timestamps, etc. I think it's difficult for these to be handled automatically, because of semantic issues. For side effects, there have to be compensating actions--charging back credit cards for orders not filled, for example, or sending an email saying, sorry, you're not actually getting the watch. For operations that don't commute, having a batch system isn't going to be adequate.For actions that do commute, I don't see how having a batch system is necessary. It just means having a third opinion of what the value should be. Unless you have a define down-time, you're introducing more consistency issues.Also, "online" (as in OLTP), rather than "realtime" is more consistent with standard DBMS terminology.
 the database does it when partition ends. Its called eventual consistency. The database would use something along the lines of vector clocks.
 The problem is that in this case the database can only apply some ad-hoc heuristic. In the case of the pocket watch, this will be: the first user to buy it gets the watch, and the second user gets annoyed by an email saying that "the watch we said you bought was actually bought by someone else". There's no magic bullet here - this may be acceptable for some use cases but will not be for others.
 >Both "Sally lives in Atlanta as of time X" and "Sally lives in Chicago as of time Y" are true.>This is far, far different from databases based on mutable state. In order for a database like that to be eventually consistent, you need to do read-repair to enforce consistency.what databases you're talking about? any specific example?
 Dynamo, Riak, Cassandra
 so this is more complicated than your schema ? :http://wiki.basho.com/Replication.html#Read-Repair"Read repair occurs when a successful read occurs — that is, the quorum was met — but not all replicas from which the object was requested agreed on the value. There are two possibilities here for the errant nodes:1. The node responded with a “not found” for the object, meaning it doesn’t have a copy.2. The node responded with a vector clock that is an ancestor of the vector clock of the successful read.When this situation occurs, Riak will force the errant nodes to update their object values based on the value of the successful read."
 Yes. First of all, not every algorithm is amenable to read-repair. Imagine, for example, storing a unique count in the database. There's no way to know how to combine divergent values in that case. (If the root value is 4, and you have two divergent values of 5, you have no idea if the increment was due to the same element or not. The right answer is either 5 or 6, but you have no idea).More importantly, if you make a mistake, you corrupt the database. The system I described based on immutable data is human fault-tolerant, which is a critical property. If you mess up, you can always correct things.
 >Imagine, for example, storing a unique count in the database. There's no way to know how to combine divergent values in that case. (If the root value is 4, and you have two divergent values of 5, you have no idea if the increment was due to the same element or not. The right answer is either 5 or 6, but you have no idea).if 2 nodes are allowed to accept writes for the same "cell" independently without synchronization, ie. node A : 4->5, node B : 4->5->6 how your schema would work in this case? (of course any schema would work fine if only one node allowed to master the "cell" )
 I think the point here is that nodes don't accept these random writes; any error that's introduced into a system with this structure is fixed on recompute.
 You use the timestamp to resolve the two values to see whether 5 or 6 is the latest.Your system would have the same problem to resolve which data is the latest.
 In his system you neither partition would have written "5" or "6". Rather, the one on the left would have written a "+1", and the one on the right would have written a "+1", and you can tell whether these are the same "+1" or not. You only combine them when you do the query.
 One is +1 (to 5) and the other one is +2 (to 6). Which one is the correct one?
 This is the key point. You either set your writes to be synchronous or asynchronous, depending on whether you want consistency or high availability. The point of having no updates means that you have less possibility for conflicts when sites A and B are appended to simultaneously. This means robust replication even with async writes.
 You add a version number or timing information to each update. Node A says Atlanta, with version 42. Node B says Chicago, with version 41. 42 > 41, so you go with Atlanta.
 when you have fault in the system so you can't access node A, node B returns Chicago with version 41 and you proceed with it if your system choice is Availability at the price of Consistency, or you wait for the node A if you choice Consistency at the price of Availability.
 OP adds confusion by calling data immutable. Data has a well accepted common usage of being updatable. Event is the term what OP wants. Event is a snapshot of fact happened in time and cannot be changed and it's immutable.What OP described in general is the append-log based approach to storing and querying data. There are a whole bunch databases and file systems built on the append-log idea. It's well worth the time to check out the research in this area to see the benefits and pitfalls of the approach.Also check out the Complex Event Processing topic which bears similarity to what OP described.
 Sorry, no. It is calling data mutable that is the confusion.Try this exercise:Identify something you consider a datum. Write it down. Note that it has more than one part. Now imagine it 'changing' and write that datum down. Now, identify the 'it' that (you would say) changed. Note that the 'it' part is the same in the 2 datum. Let's call this 'it' part the identity. Now look at the part that is different. For discussion's sake let's say that part was 42 in the first case and 99 in the second. Did 42 change into 99, or was it replaced by it? Let's call 42 and 99 values. You'll not be able to describe something that 'changed', that incorporates both the identity and value, that doesn't involve some memory/disk/system that had a (just one!) 'place' for storing information about 'it', and it is that place that has had it's contents replaced/changed, not the datum/fact.Did Obama becoming president change the fact that Bush had been president? No. Could you make an information system that forgot Bush was president? Sure, but that's your fault, not a characteristic of data.Finally, check out this 30 year old paper that contains this interesting section: UPDATE IN PLACE: A poison apple?http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108....As long as we continue to use lax language like mutable/updateable data we will be without the tools to describe, think about, and build better systems that independently manipulate identity, value and place, and correctly handle time.
 I'm sorry. I don't understand the point of the exercise.First, what kind of data are you talking about? Are you talking about data value? Data record? Data object with identity?Second, are you saying database records being mutable cannot model Bush was a president once Obama has become president?I don't know the reason for the aversion to mutable data record/object. Both mutable and immutable data record/objects have their places in modeling real world objects.
 > I don't know the reason for the aversion to mutable data record/object.I've found that mutable data is a root of all kinds of evil. In fact, I'd guess that most bugs stem from a mis-management of mutable state.I know that pretty much all of the bugs I see every day are the result of failure to coordinate state changes, or poorly-designed state changes.
 Can you give some specific examples of bugs stemming from mismanagement of mutable state? Some examples on bugs resulted from the failure to coordinate state changes or poorly-designed state changes? And how the immutable data solve the problems?
 I'm not sure why you don't believe me :)There are literally too many to list. But if I just had to pick some recent ones:* A dropdown on a web form that displays a different value than is returned (failure to coordinate the displayed value and the "actual" value)* Logins remain locked after resetting a password (failure to change a value from "locked" to "not locked", instead of "is the user locked?" being a function of history)* If you navigate to a business object from a list, and perform an action (via HTTP POST) on that object, you can accidentally perform the same action on other objects from that list through refreshing or clicking multiple times (since the "current" object is in the session state)* Bugs that only happen when you navigate to a page from a certain other page (because the destination page depends on certain session state that was not set, instead of being a simple function of URL parameters)
 Not that I don't believe you. It's just that I want to learn from specific examples on what kind of problems people encounter when dealing with mutable data and how would immutable data solve the problems.So how would immutable data solve the above problems? Thanks.
 > OP adds confusion by calling data immutable. Data has a well accepted common usage of being updatable.I find this point to be generally contentious. Certainly in the realm of programming languages, data is often held immutable. I think mutable vs. immutable data is an important distinction with wide-ranging implications, so I find being explicit here adds clarity rather than confusion.
 I agree. Saying that data must be updatable is a truism. Of course the vast majority of systems must have updatable data, as that is a fundamental building block of common-sense, real-world reasoning—to the end user data is always mutable. But to the systems designer, thinking about data being immutable and reasoning from there allows deeper thinking that takes us to more interesting architectures.
 The key insight is that an "update" is really a query getting the most recent data record for some entity/type.
 Like I said, I was going to break your assumptions on data systems.Treating data as mutable (whether you use MVCC or append logs or any other approach) leads to major complexity problems, as I detailed in the post.
 Redefining the meaning of general term just creating more confusion about the point you want to make. Why not use precise definition.The term "data" is really really general and it's bad to convey what you really mean. Are you talking about data value? Data object? Data record? Data version? Data event?Quoting from your blog, "A piece of data is a fact that you know to be true at some moment of time." That's called an Event. That's a snapshot of fact in time.MVCC is one way to do transaction. The blog didn't deal with transaction at all so I don't see the comparison.Append log-based databases do the closest to what you described in avoiding changing data in place and recompute/precompute data at query time. It's a very simple mechanism. What complexity problem do you see with it?
 If I am not completely mistaken, the pattern that's being hinted is called Command-Query Responsibility Segregation(CQRS) in certain circles:http://cqrsinfo.com/documents/cqrs-introduction/While making the problem easier, I do not know if this actually "solves" the CAP theorem issues.
 At a high level, CQRS is simply separating your reads from your writes. It follows the observation that most web-related DB queries are in the form of reads and much fewer writes. I don't know about solving the CAP theorem, but it does a very nice job of scaling horizontally.
 In practice, beating the CAP theorem inside the data center is achieved by using a 2-tiered system.Have a set of lightweight controller processes which use a consistent majority voting algorithm (Paxos) to replicate the state of the cluster. Since it's lightweight, you can run 5 or 7 of them, meaning 2 or 3 can die. This is enough Availability in practice, per customer input.Have a set of data nodes which actually store the data. Put groups of 3 of these in replica sets and use consistent replication (Paxos). If one or two of them goes down, the controllers reconfigure the replica set, so you get Availability even if 2 of 3 goes down. (Customers love that they get Consistency but 2 of 3 can go down.)This scheme is used in my company's product. The idea comes from academic/Google papers, and is used in other products, too.The scheme does not work for multi-datacenter use-cases. If you want that, you have to give up consistent replication (Paxos) inside the replica sets and use some kind of eventual consistency. This variation of the 2-layered scheme is used by one of the popular competing NoSQL systems.
 >Beating the CAP theorem inside the data center is fairly easy in practice by using a 2-tiered system.>Have a set of lightweight controller processes which use a consistent majority voting algorithm (Paxos, Consistency, Partition Tolerance) to replicate the state of the cluster. Since it's lightweight, you can run 5 or 7 of them, meaning 2 or 3 can die. This is more than enough Availability in practice, per customer input.it isn't beating of the CAP theorem. It is decreasing the Partition Tolerance and thus getting more Availability while preserving Consistency in full accordance with CAP.
 I wrote my thesis on an immutable data store as a means to beat CAP. I didn't get quite as far as your other observations though, as mine was in the context of continuation web frameworks.Very nice write-up however.
 It's not in a state I would like to make public yet. Sufficient to get my degree, but the quality is poor due to personal circumstances. However I sent you a link via your contact form.
 Beating the CAP theorem by proofing it by dropping the requirement for consistency? Sorry to be conservative and following Aristotelean logic, but this title does not make sense.
 Indeed, it's not quite "beating". "How to get full availability and eventual consistency (the best you can do under CAP) with no effort in the application layer" might have been a truer title. Nevertheless, this is something useful.
 A few days ago I posted my frustration with Cassandra counters, this article articulates a solution elegantly:'if you make a mistake or something goes wrong in the realtime layer, the batch layer will correct it.'The resiliency of this protection however is limited by the time it takes to do a full re-processing of the historical data. If there is a processing mistake in your real-time layer, it is likely to exist in your batch layer as well, thus invalidating the pre-computed results as well.
 But you can always correct the mistake and recompute things. Unlike systems based on mutable data, mistakes aren't permanent.
 If the transaction log is kept for the mutable system, you can always replay the transaction to arrive at the correct state. Most systems truncate the transaction log after a checkpoint to save space. If space is not an issue, you can always keep the transaction log since the beginning.
 How does the system deal with real-world data that has cross-datum consistency requirements? Many system mnodeling real-world have this issue: accounting, inventory, ...I'm also annoyed by the repeated mantra that you must have partition tolerance. In fact, most naïve systems are designed without it, with some central authority that must always be accessible. (Any system that is stricly a tree with an essential root node is like that. Any partition that does not include the root node cannot do any work, thus the system is not partition tolerant.) You can then have CA (by requiring that all requests and updates must propagate to the root.)
 You then do not have the A part of CA. Your root node is not accessible => your entire system is not accessible.
 The system computes functions on all the data you have. That describes every possible data system.In the realtime layer where you compensate for the last few hours of data, you'll need an incremental strategy to maintain that "cross-datum" consistency.
 'How to tolerate CAP, according to application / system specifications..'
 Great write up. You really nail the essence of a true data driven approach (DDA) to solving problems. Simplification of the data world view can lead to great things. The key to your batch computation example is proper data modeling IMO. This leads to flexibility which is crucial for a design to be successful. I feel people will stumble with the content of this post because of the lack of proper understanding of DDA.
 This looks like Event Sourcing / CQRS. Is that correct?
 There are similarities. It's really closer to functional programming than anything else.
 Nice post! To me, the logical next step is a mathematical framework for proving that all major relational operators (and in partition intolerant mode, ACID) can be implemented using this approach -- if such a model can be cleanly laid out it could drive the future of database systems.