Basically, the author doesn't understand quorum protocols and draws a bunch of erroneous conclusions because of that.
Here's the core misunderstanding:
"It is hinted that by setting the number of reads (R) and number of writes (W) to be more than the total number of replicas (N) (ie. R+W>N) - one gets consistent data back on reads. This is flat out misleading. On close analysis one observes that there are no barriers to joining a quorum group (for a set of keys). Nodes may fail, miss out on many many updates and then rejoin the cluster - but are admitted back to the quorum group without any resynchronization barrier."
That's because adding a barrier (a) isn't necessary for the consistency guarantee, (b) doesn't add any extra safety in the worst case, and (c) adds complexity (always be suspicious of complexity!)
Consider the simple case of a 3-node cluster (A, B, C) with N=3.
For quorum reads and writes, R = W = 2.
Then reads will be consistent if any two nodes are up. It doesn't matter if nodes A and B are up for the write, then B and C are up for the read -- the reader will still always see the latest write.
Of course you will have to block writes and reads if more than one node is down for either operation. This is what the dynamo paper means when it talks about allowing clients to decide how much availability to trade for consistency.
The rest of the article is basically variations of this theme. (The only way to provide "strong consistency is to read from all the replicas all the time"!? Wrong, wrong, wrong.)
I think you're missing the fact that Dynamo uses sloppy quorums (see section 4.6 in the paper), i.e. writes don't have to happen on the first N members of the preference list. For example, if W=3 and the first 3 nodes in the preference list are down, the write would be taken by the next 3 nodes in the preference list. Now if the first 3 nodes come back online, the system would be in an inconsistent state.
The Dynamo paper is very information-dense, so you have to do a little reading between the lines. Here's the relevant quote from 4.6:
"Using hinted handoff, Dynamo ensures that the read and write operations are not failed due to temporary node or network failures. Applications that need the highest level of availability can set W to 1, which ensures that a write is accepted as long as a single node in the system has durably written the key it to its local store."
So, if W < N, then N - W writes can be written using hinted handoff to nodes other than the final destination, and forwarded when the natural nodes come back online. BUT the W writes specified by the requested consistency level do have to be to the right nodes or everything falls apart.
This is the implementation that makes the most sense when taken with the rest of the system design, so even though it's not quite spelled out I'm pretty sure that's how it works. (I _am_ sure that's how Cassandra's consistency levels work, as possibly the highest-profile eventually consistent open source database. :)
> BUT the W writes specified by the requested consistency level do have to be to the right nodes or everything falls apart.
The paper doesn't say this. The passage you cited explicitly says that if W=1 any node in the whole system could take the write, contradicting your statement.
Even then, it wouldn't make sense to apply your rule. If W writes have to happen on the first N nodes in the preference list, then I wouldn't need sloppy quorums at all.
> any node in the whole system could take the write
No, it says "a single node," not "any node." The distinction is important: I'm saying the former means "a single [real destination] node."
> If W writes have to happen on the first N nodes in the preference list, then I wouldn't need sloppy quorums at all.
Hinted handoff is there to make "eventually" happen faster for clients who _don't_ specify R + W > N. Not to magically make W not matter for clients who do, because clearly that doesn't work. (And if the authors didn't understand that, why bother discussing W at all? it doesn't make sense.)
> No, it says "a single node," not "any node." The distinction is important: I'm saying the former means "a single [real destination] node."
But the paper explicitly says: "Applications that need the highest level of availability can set W to 1, which ensures that a write is accepted as long as a
single node in the system has durably written the key it to its local store. Thus, the write request is only rejected if all nodes in the system are unavailable.". The last sentence makes clear that they mean any single node in the whole system.
Additionally the paper states: "all read and write operations are performed on the first N healthy nodes from the preference list, which may not always be the first N nodes encountered while walking the consistent hashing ring.". There is no mention that at least W of the first N nodes have to accept a write.
You're missing the forest for the trees. You're taking a pedantically literal view of one sentence and saying "see? it can't work!" when that interpretation of the sentence in question makes no sense at all when taken with the paper as a whole.
If you want to keep arguing that Dynamo's HH ignores the W value they spend a lot of effort explaining, fine, maybe you're right; maybe Dynamo's authors really are a bunch of idiots. Nobody outside Amazon can say for sure.
But Cassandra doesn't ignore W, and neither does any other actual system I know of: HH is just for the N - W nodes. Because that's the only way to implement it that makes sense. :)
cx01 has already pointed the problems in your arguments. unfortunately i understand quorum protocols (they are so simple!) and you are dead wrong. because of hinted handoffs, lack of central commit log and lack of resync barrier - transient failures will lead to stale reads being returned to clients.
i understand you are one of the core committers for Cassandra. It speaks much for the design when a person so familiar with the code base cannot speak accurately of it (what to talk of all those open source freeloaders!).
i would be wary of marketing a commercial service based on such consistency protocols.