
Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore - systems_we_make
http://www.systemswemake.com/papers/spinnaker
======
boskone
I was posting about a similiar system around a year ago.

Ended up standardizing clusters of only 3 nodes. Initially I wrote a
Paxos+Dynamo but it was a dead end for me for our needs. I did not need a
massive data store, but a cheap highly available / scalable enterprise service
that consistently gave the same answer. e.g. an eCommerce pricing engine. 3
commodity box nodes is all you ever need. I've done over 500,000 prices per
second with ~ 9 unique way of pricing ~475,000 skus, based off the equivalent
of 100's millions of data rows.

Use IP multicast for cluster/leader election configuration. Much faster. All a
server needs configured to boot strap into the cluster is a single IP. Opted
to not use Zookeeper though it came out when I was writing my system.

I tried a Dynamo/Cassandra approach on the updates but abandoned it as too
slow early on. I ended up using something more like Linked-Ins Kafka for the
"log" shuttling. In Kafka parlance, one topic with 3 partitions one for each
node. Each node mirrors all three partitions but is owner of 1. One other diff
is I do push/push while Kafka a does push/pull. This makes sense as each node
is both a broker for and consumer of the other nodes.

Node 0 writes incoming updates to partition 0 and does a 1 out of 2 push to
partition 0 on nodes 1 and 2 (third auto drops to async as we have 2 of 3
durable stores)

For update application Paxos is used to then only for consensus of the
highwater mark of message Id (the kafka offset). Analogous to Spinnaker's LSN.

Consensus updates to the message high water mark are "batch" applied by a
single thread, and only a single thread, to the in-memory store which uses a
special array based hash map (Compact Hashmap). The single thread with a
ReaderWriter lock give transactions for free with a "database lock" being
faster row locking as its all in mem. I have the equivalent of 100s millions
of rows in mem all synched between the three nodes.

One final optimization is the observation that if the overlaying application
is REST based then updates commute under PUT semantics (version with a vector
clock). For this type of application the total ordering of the application of
the updates across the nodes is relaxed as it is guaranteed to converge
(eventually consistent).

Since all update are batch applied, in mem, on a single threads with a
reader/writer lock a client always views only the converged consistent state
though each nodes path to convergence may be different.

I had looked at using a Dynamo/Cassandra like Spinnaker but after coding it
found it too slow. When I went with a log shuttling push/push similar to
LinkedIns Kafka performance of updates jumped quite nicely. I note one of the
authors works at LinkedIn.

Wrote the thing in Scala which was also sort of new at the time. But turned
out to be a good choice.

