Disclaimer: I work at blekko and I developed the webgrepper.
As a side note, we have used this for various other purposes - some fun ones being, store a big music collection (to extract meta data via mapjob), citizenship test q&a (to pick random questions), the 'joke of the day' (of course, this is our "hello world" example internally to new employees) ..etc.
First is on our frontend side. We have 2 nginx servers (using linux HA and vips) which send traffic out to the nodes of the cluster which are up, retrying to a different node in the case of failure or a slow reply.
Deeper in the system, there are 3 copies of every piece of data.
Both of these are fairly normal mechanisms; the 3-copy thing is used at Google and by Hadoop and friends.
It would had been nice to see some examples of the query language they use, it if is comparable to other NoSql databases.
Individual nodes can often make "personal" decisions about what to do in subobtimal situations. If you can answer an incoming request, even with partial or out-of-date data, do so; it's better than not replying. For the repair agent, each node can see its own view of "holes" in the 3-level replication, and offer to make copies of <3 buckets to bring back up to three copies.