This is the same technology that is used by the webgrepper tool [http://blekko.com/webgrep] (a grep for the web pages' sources).
Disclaimer: I work at blekko and I developed the webgrepper.
As a side note, we have used this for various other purposes - some fun ones being, store a big music collection (to extract meta data via mapjob), citizenship test q&a (to pick random questions), the 'joke of the day' (of course, this is our "hello world" example internally to new employees) ..etc.
Greg has planned a whole series about the combinator architecture behind blekko's datastore. Greg and I have both presented aspects of the system at various conferences, but we're happy to chat about it with you directly too. I think this might be the first time it's been published on the web though.
Paxos would be good for electing a master, but we wanted to avoid having any masters in the architecture. There are also scenarios where paxos can be slow or fail to reach a consensus. We wanted high availability from each node in the cluster regardless of whether 2/3 of the rest of the cluster were down or unreachable; both parts of a partioned cluster should also be able to continue to function as best they could.
Individual nodes can often make "personal" decisions about what to do in subobtimal situations. If you can answer an incoming request, even with partial or out-of-date data, do so; it's better than not replying. For the repair agent, each node can see its own view of "holes" in the 3-level replication, and offer to make copies of <3 buckets to bring back up to three copies.
First is on our frontend side. We have 2 nginx servers (using linux HA and vips) which send traffic out to the nodes of the cluster which are up, retrying to a different node in the case of failure or a slow reply.
Deeper in the system, there are 3 copies of every piece of data.
Both of these are fairly normal mechanisms; the 3-copy thing is used at Google and by Hadoop and friends.
Within the datastore, there are 3 copies of each piece of data. When a get() request is made, it goes out to the "closest" copy; if an answer isn't heard from by some threshold, a 2nd request is made to one of the other replicas. Whoever gets the data back first wins.
Like many NoSQL databases, we don't really have a query language. Mostly you read a few columns from a particular row. That could contain a combinator, such as at TopN containing 1,000 inlinks to a webpage... something that you would ordinarily fetch using a range query that returned 1,000 rows from a table.
Partitions are based on a hash of the primary key. The number of buckets in the system has to be a power of 2. But we can split buckets to increase the number, or even have some buckets that have split and some that haven't yet. Each bucket is stored on 3 separate servers (and the assignment makes sure the three servers are on separate racks).