

Digg Saying Yes to NoSQL; Going Steady with Cassandra - jbellis
http://about.digg.com/node/564

======
zaidf
For someone who knows nothing about NoSQL and decent with MySQL, can someone
give a brief Idiot's overview of how NoSQL works? If I add a record in say a
table named "news", where is the data stored? If I need to do a search by news
id or description, what's the front-end api like and what happens at the
backend when the api is called?

~~~
buro9
I would very seriously recommend reading the journal paper on Amazon Dynamo
which is the predecessor to Apache Cassandra:
[http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-
dyn...](http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-
sosp2007.pdf)

PS: Why aren't we in the habit of citing papers like this? So many links are
to websites rather than direct sources, and the info in papers is usually
highly readable and extremely informative about the details.

~~~
vog
I agree so much! Most blogs fail to sumarize such papers properly. And if I
want a summary or introduction, I can get find those in the papers, too.

The same with RFCs and W3C documents. All that stuff is written very well,
from the technical view as well as regarding their readability. Why putting a
layer of blogs and websites around them, which actually lower the perceived
quality?

Also, I'm frequently annoyed by articles that write about what some other
people (e.g. RMS) wrote. These aren't much shorter than the source article,
and aren't written nearly as clear as the original.

I appeciate it when an author just links to the source and elaborates on that
topic, stating his own opinion, and perfers to quote rather than to paraphrase
what's already clearly written in the source article.

------
rgrieselhuber
I'm looking pretty seriously at MongoDB and I've heard that Cassandra is worth
considering. I do a lot of data warehousing / statistical analytics which
generally means some sort of star schema-based reporting with lots of
crosstabs, dimensions, etc.

If anyone can relate their experience with either of these two platforms,
would either be a good choice for live querying for these types of
applications? I know you can use MapReduce to eventually get the data you
need, but I need to support queries that respond in (well) less than a second,
even for very large data sets.

~~~
rbranson
HBase is probably closer to what you're looking for.

~~~
rgrieselhuber
Why?

------
joshd
I'd love to see an example of how people are redefining their schema with
NoSQL databases (especially with document based databases like CouchDB). A
common example you hear is "Your blog document can contain an array of comment
nodes". Which is all great in theory but obviously won't scale.

If Digg are using Cassandra as a big key-value store then how do they look up
comments for posts? If they're storing one entry per comment with an index on
post ID then it's not really key based any more.

We hear a lot of case studies about performance increases, but I never seem to
see any practical details on how the databases should be used correctly.

~~~
ieure
Comments are denormalized and stored two ways: One in a plain ColumnFamily
(for random access) and one in a SuperCF (for sequential access).

The plain CF ("Comments") uses the comment ID (which is a timestamp+salt) as
the row key. The fields of the comment are columns in the row (username, text,
date_created etc).

The SuperCF ("StoryComments") uses the story ID as the row key. Each row
contains one SuperColumn per comment. The SC name is the comment_id, and the
columns are the fields of the comment.

So, say you want to get the first 50 comments for a given story, you'd do a
get_slice on StoryComments, passing the story ID in as the row key, "" as the
start column, and a count of 50. You get back 50 SuperColumns, each of which
contains one comment.

Cassandra is extremely good for sequential reads, not just random lookups.
Just about any list of things can be efficiently stored and retrieved with its
batch insertion and slicing operations.

------
thinker
For those who may have missed this a couple weeks ago, Twitter also
considering Cassandra - [http://nosql.mypopescu.com/post/407159447/cassandra-
twitter-...](http://nosql.mypopescu.com/post/407159447/cassandra-twitter-an-
interview-with-ryan-king)

~~~
jbellis
Past the considering stage and to the "deploying" stage.

------
JulianMorrison
Nice to see Digg pouring effort into making Cassandra itself better for
everyone.

