

Doozer: a consistent, highly-available data store from Heroku labs - edwardog
http://xph.us/2011/04/13/introducing-doozer.html

======
benblack
(Reposting my blog comment here)

"I mean primarily the ability to face a network partition and continue
providing write service to all clients transitively connected to a majority of
servers. Secondarily, I mean the ability to continue providing read-only
service to all clients connected to any server."

A truly available system, in the sense of CAP, allows writes, not just reads,
even under partition, even for clients not connected to the "majority" nodes.
This leads inevitably to the need for conflict detection and resolution, and
that whole "eventually consistent" thing. What you are describing is very
useful, hence the existence of things like Chubby and ZK, but is most
definitely not "available", per CAP.

Folks might also be interested in the classic paper on constructing locks from
low level primitives like CAS, Herlihy's 'Wait-free Synchronization"
[http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=A01...](http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=A01FADDC263D88A280509C0D8BF86C6F?doi=10.1.1.87.871&rep=rep1&type=pdf)

------
jdf
There are a number of other systems that allow this same approach of
consistency and high availability. For example, Cassandra, which is freely
available (as required by the poster), appears to be able to give you this
behavior if you set ConsistencyLevel to QUORUM.

Clustrix, the company I work for, offers a full SQL data store with similar
quorum semantics. However, it's not free.

Google Megastore allows similar consistency semantics (with its own data
model) in a cross data center "cloud" fashion. It's also not free, but it
would probably be suitable for some set of Heroku customers, particularly if
they're already using Google App Engine.

~~~
leef
You actually cannot implement strongly consistent compare-and-set on a quorum
based system. This is because quorums do not have strong failure semantics.

For example, lets say you have key A with data 'foo' living on all 3 nodes of
a quorum based system. You perform a write to key A with data 'bar' and one
host accepts the write and two fail resulting in your client getting a failure
back. Reading data back from all nodes will return 'foo'. Time goes along and
read repair or some other anti-entropy process kicks in a replicates 'bar' to
the other hosts. Now a read for key A returns 'bar'. In short, when a client
performs a write and gets a failure back they actually have no idea if it will
update the data store or not. You cannot implement compare-and-set in that
type of environment.

~~~
krarick
That is not how paxos works. It guarantees that all hosts in the system learn
the same sequence of writes. The scenario where "one host accepts the write
and two fail" does not happen. This is proved in
[http://research.microsoft.com/users/lamport/pubs/paxos-
simpl...](http://research.microsoft.com/users/lamport/pubs/paxos-simple.pdf)

~~~
leef
Right, I was talking about the quorum semantics as used by systems like Dynamo
and Cassandra. Paxos provides a much stronger guarantee than simple quorums.

------
vicaya
"The only freely-available tool that attempts to provide both of these is
zookeeper, but zookeeper is specialized, focused on locking and server
management. We need a general-purpose tool that provides the same guarantees
in a clean, well-designed package."

I think they dismissed Zookeeper too quickly without trying to understand it
first. The zookeeper primitives (get/set nodes and their children) seems
simpler and cleaner than doozer is client protocol, AFAICT. Zookeeper should
scale better (especially for readers) than typical Paxos based systems as
well.

~~~
leef
The problem with zookeeper is there is no safe way to dynamically change the
nodes in the cluster. That's where Paxos comes in since it is provably correct
even with nodes joining and leaving.

------
sigil
Cool. Is the consensus algorithm straight-up Paxos, or are there
modifications? The Paxos Made Live [1] paper from Google raised some
interesting issues.

[1] <http://labs.google.com/papers/paxos_made_live.html>

~~~
bmizerany
Yes, it is Paxos. We have a post coming up about it.

~~~
sigil
Great, looking forward to it.

------
besquared
What happens when you CAS the same key from two sides of a network partition
and the partitions rejoin?

~~~
benblack
As explained in the post, the minority side of the partition is not writable.

~~~
besquared
Ahh I see it's in the extra small note at the bottom

"I mean primarily the ability to face a network partition and continue
providing write service to all clients transitively connected to a majority of
servers."

Probably should be a little more prominent considering it's one of the first
questions you'd want answered but I was a bit careless to miss it.

