

Riak - web-shaped data storage system - aatif
http://highscalability.com/blog/2009/10/8/riak-web-shaped-data-storage-system.html

======
paraschopra
It looks really, really interesting - the presentation they have here
<http://riak.basho.com/nyc-nosql/> explains CAP, Amazon's dynamo concepts,
Map/Reduce beautifully. Definitely going to try it.

Only thing I am concerned here is that since the data is accessed over HTTP
Stack, won't it become the performance bottleneck.

~~~
beerriot
Glad you enjoyed the presentation, and I hope your experimentation goes well.

Re: HTTP Stack: "Performance" is such a slippery word here. If what an
application is most sensitive to is the time it takes to fulfill one request,
then yes, the HTTP stack may be an important component to keep track of, as it
plays a role in defining the lower bound of request fulfillment time.

However, since all HTTP requests are handled in parallel, and since any node
can handle any request, applications whose main sensitivity is not latency,
but is instead throughput or availability, are less likely to find the HTTP
stack to be a performance bottleneck.

There is also the option of doing away with the HTTP stack completely, and
speaking Erlang straight to Riak.

------
jnaut
What are the differences, advantages, dis-advantages between Riak and CouchDB?
I haven't found a comprehensive comparison till date.

Edit: Yes, I have been a bit lazy not to go and try Riak first hand. I think I
will do it sooner then later but as of now, any light on the subject will be
appreciated.

~~~
vicaya
They serve different purposes. Riak is a DHT (for key sharding) layer intended
to run on top of different backends, just like Dynamo. CouchDB has an
integrated storage engine. Riak doesn't support range queries (nature of DHT).
I'm pretty sure CouchDB does, which is easy for a single node. CouchDB doesn't
scale in the sense that it doesn't do any automatic sharding yet. It's quite
possible to use CouchDB as a backend of Riak, though it implies crippling some
interesting features in CouchDB.

Technically, I found CouchDB more interesting as it's more "different" from
other approaches.

~~~
puzza007
The Scalaris DHT supports range queries, I think.

<http://www.zib.de/CSR/Projects/scalaris/>

~~~
vicaya
DHT can certainly support range queries by broadcast and merge. The keyword in
the context though is _efficiently_. Scalaris not a typical Dynamo like DHT.
It's based on Chord, so it's possible support range scan with O(logn) metadata
RPCs (n is number of nodes.) per query, which I wouldn't call _efficient_ as
these RPCs are expensive in that they not very cachable (nature of the DHT,
unless the range query is exactly the same.) Typical Bigtable implementation
can easily cache portion of metadata with clients and resulting 0 metadata RPC
in many cases.

------
vicaya
Yet another Dynamo/DHT. Now I've seen implementations in Java (Dynamo,
Cassandra, Voldemort), Perl (Blekko) and now Erlang. I wonder if any of these
has been used on >1000 node cluster? I haven't seen any mention of checksums
(besides those in TCP) to ensure that switch/router would not corrupt data
(including cluster states), like Amazon learned the hard way.

Of course, they don't support range scan efficiently either. A proper Bigtable
implementation is much harder, I guess :)

~~~
jbellis
> I wonder if any of these has been used on >1000 node cluster?

1000 nodes is a _lot_. Facebook's inbox search Cassandra cluster was at 150
nodes 6m ago; probably around 200 now given their growth trend -- so unless
you have more data than FB you probably don't need to worry too much. :)

IIRC the largest OSS bigtable-clone cluster is Baidu's at around 120. (maybe
150 now?) So same ballpark really.

> they don't support range scan efficiently either

Cassandra does.

> A proper Bigtable implementation is much harder

It's actually much easier to get something that works when all your hardware
goes well with the bigtable single-master model; it's just that having those
single points of failure is a bitch (as even google with their vast
engineering resources and 3 year head start is still working out, as in the
appengine outage a couple months ago --
[http://groups.google.com/group/google-
appengine/msg/ba95ded9...](http://groups.google.com/group/google-
appengine/msg/ba95ded980c8c179)). Once you put the engineering in, the fully
distributed model is much better.

(For those following along: I'm a Cassandra developer; vicaya is a developer
of hypertable, a bigtable clone.)

~~~
vicaya
1000 nodes are nothing if you crawl the entire web and keep a history. I'm
sure someone already has a modified version of an OSS bigtable cluster at 1PB+
and close to 1k nodes :)

>> they don't support range scan efficiently either > Cassandra does.

Only if you call order preserving hashing efficient when all the keys can be
_easily_ hashed in to one bucket, no matter what hash function you choose :)
Cassandra's range query is an occasionally useful hack, which is _not_
comparable to Bigtable like implementations, in terms of efficiency,
scalability and robustness.

The Bigtable's single-master with standbys is never a problem in practice. The
AppEngine outage was a due to a bug in GFS master protocol decoding. If this
bug is in every node, all nodes would be crashing instead of one with some
service (reads and some writes) still available. The so called "fully"
distributed model is only better when you assume that you don't have bugs in
your code and that node failure is random. I argue that separating different
responsibilities/functionalities into different components is good for fault
isolation, a sound software engineering practice. Naive fully distributed
model is much more brittle in practice due to code bugs.

BTW, although Hypertable is inspired by Bigtable in design, the implementation
and features set (support a dozen languages via Thrift and full read isolation
via MVCC) are different enough that I wouldn't call it a clone :)

------
benreesman
the basho guys are insanely smart and I've been dying to get a look at riak,
really excited to see it released.

------
jbellis
What's the 10,000 ft summary of how Riak is different from Dynomite,
PowerSet's Dynamo-clone-in-Erlang?

~~~
beerriot
Disclaimer: I've never actually run Dynomite, so I'm going on what I can find
in docs.

In any case, a few of differences I found:

1\. Configuration of N/R/W parameters. Dynomite appears to set these
properties once, at startup time, for all values. Riak configures N per-bucket
("bucket" being like a 1-level directory structure), and R/W per-request.
Specifying N/R/W so "late" means that applications can put different CAP
requirements on different sets of data and tune those requirements at request
time.

2\. Dynomite's main external interface seems to be thrift-based, with also
some RPC exposure. Riak's main external interface is JSON-based, over ReSTful
HTTP.

3\. Hinted handoff is a target for Dynomite's 1.0 milestone (according to
<http://wiki.github.com/cliffmoon/dynomite/roadmap>). Riak already fully
supports hinted handoff.

~~~
jbellis
Thanks!

