The Dynamo SOSP paper had two goals: 1) show how systems are a composition of techniques and how all of these need to work together build a production system 2) given that it was based on a variety of research results it was intended to give feedback to the academic community about the difficulties of moving from research results to production, and what matters in real-life vs production.
The paper was never intended to be a complete blueprint for easy design of follow-up systems.There just isn't enough room in an academic paper to do justice to that. If it was going to have any role in that the best we could hope for was as a collection of points you would have to think hard about and make decisions about when you were going to design your own storage engine.
That doesn't means I agree with the conclusions of the analysis, on the contrary, I think they are seriously flawed as well (as with everything else that is not absolutely perfect :-)). But I can understand that when people look for the dynamo paper to be a blueprint that solves all their storage needs and provides a perfect available service under all failure scenarios (and solves world peace), that they may be left with a few questions afterwards.
I always thought that the real contribution of the paper was that it made you think hard about the trade-offs you are faced with when you have to design high-available, ultra-scalable systems that are cost-effective and provide guaranteed performance. 5 years later we know a lot more but this stuff is still hard, and we still need to balance rigorous principles with production magic to make it work. But you do need to fully understand the principles before you can make the production trade offs.
Caveat: some of these remarks were tongue-in-cheek fun; I leave it to the reader which ones :-)
Here's the core misunderstanding:
"It is hinted that by setting the number of reads (R) and number of writes (W) to be more than the total number of replicas (N) (ie. R+W>N) - one gets consistent data back on reads. This is flat out misleading. On close analysis one observes that there are no barriers to joining a quorum group (for a set of keys). Nodes may fail, miss out on many many updates and then rejoin the cluster - but are admitted back to the quorum group without any resynchronization barrier."
That's because adding a barrier (a) isn't necessary for the consistency guarantee, (b) doesn't add any extra safety in the worst case, and (c) adds complexity (always be suspicious of complexity!)
Consider the simple case of a 3-node cluster (A, B, C) with N=3.
For quorum reads and writes, R = W = 2.
Then reads will be consistent if any two nodes are up. It doesn't matter if nodes A and B are up for the write, then B and C are up for the read -- the reader will still always see the latest write.
Of course you will have to block writes and reads if more than one node is down for either operation. This is what the dynamo paper means when it talks about allowing clients to decide how much availability to trade for consistency.
The rest of the article is basically variations of this theme. (The only way to provide "strong consistency is to read from all the replicas all the time"!? Wrong, wrong, wrong.)
"Using hinted handoff, Dynamo ensures that the read and write operations are not failed due to temporary node or network failures. Applications that need the highest level of availability can set W to 1, which ensures that a write is accepted as long as a single node in the system has durably written the key it to its local store."
So, if W < N, then N - W writes can be written using hinted handoff to nodes other than the final destination, and forwarded when the natural nodes come back online. BUT the W writes specified by the requested consistency level do have to be to the right nodes or everything falls apart.
This is the implementation that makes the most sense when taken with the rest of the system design, so even though it's not quite spelled out I'm pretty sure that's how it works. (I _am_ sure that's how Cassandra's consistency levels work, as possibly the highest-profile eventually consistent open source database. :)
The paper doesn't say this. The passage you cited explicitly says that if W=1 any node in the whole system could take the write, contradicting your statement.
Even then, it wouldn't make sense to apply your rule. If W writes have to happen on the first N nodes in the preference list, then I wouldn't need sloppy quorums at all.
No, it says "a single node," not "any node." The distinction is important: I'm saying the former means "a single [real destination] node."
> If W writes have to happen on the first N nodes in the preference list, then I wouldn't need sloppy quorums at all.
Hinted handoff is there to make "eventually" happen faster for clients who _don't_ specify R + W > N. Not to magically make W not matter for clients who do, because clearly that doesn't work. (And if the authors didn't understand that, why bother discussing W at all? it doesn't make sense.)
But the paper explicitly says: "Applications that need the highest level of availability can set W to 1, which ensures that a write is accepted as long as a
single node in the system has durably written the key it to its local store. Thus, the write request is only rejected if all nodes in the system are unavailable.". The last sentence makes clear that they mean any single node in the whole system.
Additionally the paper states: "all read and write operations are performed on the first N healthy nodes from the preference list, which may not always be the first N nodes encountered while walking the consistent hashing ring.". There is no mention that at least W of the first N nodes have to accept a write.
If you want to keep arguing that Dynamo's HH ignores the W value they spend a lot of effort explaining, fine, maybe you're right; maybe Dynamo's authors really are a bunch of idiots. Nobody outside Amazon can say for sure.
But Cassandra doesn't ignore W, and neither does any other actual system I know of: HH is just for the N - W nodes. Because that's the only way to implement it that makes sense. :)
i understand you are one of the core committers for Cassandra. It speaks much for the design when a person so familiar with the code base cannot speak accurately of it (what to talk of all those open source freeloaders!).
i would be wary of marketing a commercial service based on such consistency protocols.
I think this is based on a flawed application of metrics from other organizations that do not match Amazon's actual business needs.
The author talks about bank transaction systems with five nines of availability. Sure, you can build a centralized system with a critical core that has near-zero downtime - usually at great operational cost - but there's not a bank in the world that's actually 99.999% available to customers, over the internet. Most retail banks I've used take several hours of outages per month for maintenance on their web services.
Amazon wants its systems not just to keep running but to keep taking orders from millions of people all over the world. This means that they are concerned with the reliability not just of a storage server, but everything needed to connect it to its application servers and to customers. They have found that the cost-effective way to do this is to distribute every component across geographically separated datacenters. Amazon has remained available through real and simulated datacenter-wide outages ranging from power/cooling failures to floods, fire, and hurricanes. No Amazon system lives within a single building, much less a single network switch or rack.
Finally, although the formally provable aspects of Dynamo's "eventual consistency" guarantee may be vague, any team operating a distributed system will study and understand the actual operational characteristics in practice, under both normal conditions and various failure modes. Some systems may have realistic failures that lead to days or hours of inconsistency (in which case the team has deliberately chosen this and will write client software to be aware of it); others might be tuned to achieve consistency within milliseconds under normal operation and to set off alarms within seconds after a failure. I've never known any that would be used in circumstances where "human lifetimes" are a relevant time period.
The most often quoted example of Dynamo's use is the shopping cart application on every Amazon page. In the worst case, your shopping cart will mysteriously empty itself. This is a huge pain, and a potential loss for Amazon, but it's not catastrophic in the way that is implied here. Indeed, assuming the liveness of a quorum, the application will read back all conflicting entries for the shopping cart (those that aren't ordered under their vector clock timestamps) and the onus is on it to merge the conflicts. Of course, the shopping cart will take the union of all updates to ensure that nothing is dropped (and therefore some delete operations may be lost).
The key point is that some applications can do without observing a linearisable history, and the interest of this paper is that it explores the design space if you drop that requirement.
I don't understand the post's points about CAP; all three requirements are in tension. Dynamo is unusual in that it is live in the case of a network partition while still maintaining its consistency guarantees.
Similarly - those systems that use chain-replication asynchronously like he describes can still suffer from the same read-old-value-after-it-was-written consistency issue, if the reader jumps between two replicas for consecutive reads. Avoiding that can require synchronous coordination of updates (a la Paxos, e.g.) which is, I think, what the paper is driving at. Otherwise, there are still failure modes which, in order to patch up, require stronger guarantees about liveness of quorums than Dynamo needs.
I understand that Dynamo is no longer used internally at Amazon at scale, so maybe some of the practical points this post makes about the realities of central coordination held water for real deployments. Still, I don't buy the reaction that prioritising availability uber alles and designing a system that does not behave exactly like a strongly-consistent key-value store immediately invalidates it for workloads that have high availability requirements and lower consistency needs.
That's certainly relevant. I'd love to know more. Do you know why? What are they doing instead?
regarding CAP - CA is achievable in a single data center (where as i argue - one cannot tolerate partitions anyway). the problem with Dynamo is that the overheads incurred in tolerating partitions are imposed on environments that do not suffer from partitions.
you are right that point in time consistency is no good for partition tolerance. i didn't mean to say that it was. all i wanted to point out was that very few, if any, commercial database deployments do concurrent writes across continents (and do synchronous replication across the same). cross data center replication is used for disaster recovery - and at best analytics - where point in time is just fine.
Also his treatment of partition tolerance assumes that operations set up the Dynamo cluster machines incorrectly; all in the same rack.
He also completely glosses over merkle trees, assuming that the sync they help perform must be a heavy operation.
@jbellis - i hope the responses to your comment have convinced you about the problems with the dynamo quorum scheme/read-write protocols. i can convey that the problems i described definitely do exist in Cassandra. Jun Rao has made this point in the Cassandra public mailing lists a long time back and i have a pointer to the JIRA that he filed in my post as well.
regarding Dynamo being an interesting design space and for the academic community etc. That may very well have been the case - but the reality of the situation is that the world is now overflowing the Dynamo clones with people considering them for all kinds of usages. Hey - if it was good for Amazon (and Facebook and LinkedIn) - it's probably good for me! The people trying to use Dynamo clones do not understand all the small details. They don't understand what applications would be safe to write on it and which would not. I hope my posts (imperfect and opinionated undoubtedly) would provide a counterpoint to this sentiment and make users think harder and deeper before they make the leap.
i was also hoping to trigger a discussion (looks like i succeeded). i hope that it takes us to a better design space than what exists currently.