Glad you enjoyed the presentation, and I hope your experimentation goes well.
Re: HTTP Stack: "Performance" is such a slippery word here. If what an application is most sensitive to is the time it takes to fulfill one request, then yes, the HTTP stack may be an important component to keep track of, as it plays a role in defining the lower bound of request fulfillment time.
However, since all HTTP requests are handled in parallel, and since any node can handle any request, applications whose main sensitivity is not latency, but is instead throughput or availability, are less likely to find the HTTP stack to be a performance bottleneck.
There is also the option of doing away with the HTTP stack completely, and speaking Erlang straight to Riak.
They serve different purposes. Riak is a DHT (for key sharding) layer intended to run on top of different backends, just like Dynamo. CouchDB has an integrated storage engine. Riak doesn't support range queries (nature of DHT). I'm pretty sure CouchDB does, which is easy for a single node. CouchDB doesn't scale in the sense that it doesn't do any automatic sharding yet. It's quite possible to use CouchDB as a backend of Riak, though it implies crippling some interesting features in CouchDB.
Technically, I found CouchDB more interesting as it's more "different" from other approaches.
DHT can certainly support range queries by broadcast and merge. The keyword in the context though is efficiently. Scalaris not a typical Dynamo like DHT. It's based on Chord, so it's possible support range scan with O(logn) metadata RPCs (n is number of nodes.) per query, which I wouldn't call efficient as these RPCs are expensive in that they not very cachable (nature of the DHT, unless the range query is exactly the same.) Typical Bigtable implementation can easily cache portion of metadata with clients and resulting 0 metadata RPC in many cases.
i was at the nyc nosql conference where mongodb, couchdb and riak were all discussed by core contributors. google nyc nosql for more details and possibly slide links. i know slides, video are floating around somewhere.
anyway, mongodb and couchdb are more closely related to each other than they are to riak. a closer comparison to riak would be cassandra (also based off dynamo).
riak is based off the amazon dynamo paper and it is a pretty faithful adaptation at that. circular hashing, node chit chat protocol, eventual consistency. CAP theorem. uses n,w,r paradigm where n=number of replicas, w=number of replicas that must be available in cluster to write and r=number of replicas that must be available in cluster to read. so in this mechanism the balance of CAP is up to you, the programmer. disconnected writes written to two separate nodes due to a network flap are both returned to the client for conflict resolution. interestingly riak is written in erlang and uses rest heavily. because they do they were able to demo access libraries in virtually every language.
regarding mongo v couch:
mongo (written in c++) has sharding/replication and rebalancing (via addition to cluster) in production now. rebalancing via removal is on the drawing board. mongo also supports dynamic querying. mongo also supports bson (binary json). i also believe there is a 4MB max row limitation. multiple daemons for different functions. limited creature comforts like web gui, efforts underway to rectify.
couch is more complete in its restful implementation and general overall approach. i feel it is an easier setup and administration. one daemon pretty much handles everything i believe. couch is a never-delete database in that it always appends writes and internally versions them. this allows for very interesting replication topologies like master-master-master due to their unique versioning scheme. this will also lead to and "offline" couch version that you can run on your desktop. couch also has "fouton" which is their admin web gui.
the teams behind each of them are committed and talented.
I ran the conference and have been shamefully behind in cutting all the video. So far I've only gotten the video linked in the article done. Working on more today. They should all be available on that vimeo page when all is said and done.
That said, Bryan gave a fantastic presentation and really knows his stuff.
Yet another Dynamo/DHT. Now I've seen implementations in Java (Dynamo, Cassandra, Voldemort), Perl (Blekko) and now Erlang. I wonder if any of these has been used on >1000 node cluster? I haven't seen any mention of checksums (besides those in TCP) to ensure that switch/router would not corrupt data (including cluster states), like Amazon learned the hard way.
Of course, they don't support range scan efficiently either. A proper Bigtable implementation is much harder, I guess :)
> I wonder if any of these has been used on >1000 node cluster?
1000 nodes is a _lot_. Facebook's inbox search Cassandra cluster was at 150 nodes 6m ago; probably around 200 now given their growth trend -- so unless you have more data than FB you probably don't need to worry too much. :)
IIRC the largest OSS bigtable-clone cluster is Baidu's at around 120. (maybe 150 now?) So same ballpark really.
> they don't support range scan efficiently either
> A proper Bigtable implementation is much harder
It's actually much easier to get something that works when all your hardware goes well with the bigtable single-master model; it's just that having those single points of failure is a bitch (as even google with their vast engineering resources and 3 year head start is still working out, as in the appengine outage a couple months ago -- http://groups.google.com/group/google-appengine/msg/ba95ded9...). Once you put the engineering in, the fully distributed model is much better.
(For those following along: I'm a Cassandra developer; vicaya is a developer of hypertable, a bigtable clone.)
1000 nodes are nothing if you crawl the entire web and keep a history. I'm sure someone already has a modified version of an OSS bigtable cluster at 1PB+ and close to 1k nodes :)
>> they don't support range scan efficiently either
> Cassandra does.
Only if you call order preserving hashing efficient when all the keys can be easily hashed in to one bucket, no matter what hash function you choose :) Cassandra's range query is an occasionally useful hack, which is not comparable to Bigtable like implementations, in terms of efficiency, scalability and robustness.
The Bigtable's single-master with standbys is never a problem in practice. The AppEngine outage was a due to a bug in GFS master protocol decoding. If this bug is in every node, all nodes would be crashing instead of one with some service (reads and some writes) still available. The so called "fully" distributed model is only better when you assume that you don't have bugs in your code and that node failure is random. I argue that separating different responsibilities/functionalities into different components is good for fault isolation, a sound software engineering practice. Naive fully distributed model is much more brittle in practice due to code bugs.
BTW, although Hypertable is inspired by Bigtable in design, the implementation and features set (support a dozen languages via Thrift and full read isolation via MVCC) are different enough that I wouldn't call it a clone :)
Disclaimer: I've never actually run Dynomite, so I'm going on what I can find in docs.
In any case, a few of differences I found:
1. Configuration of N/R/W parameters. Dynomite appears to set these properties once, at startup time, for all values. Riak configures N per-bucket ("bucket" being like a 1-level directory structure), and R/W per-request. Specifying N/R/W so "late" means that applications can put different CAP requirements on different sets of data and tune those requirements at request time.
2. Dynomite's main external interface seems to be thrift-based, with also some RPC exposure. Riak's main external interface is JSON-based, over ReSTful HTTP.