

Looking into the future with Cassandra - bnmrrs
http://blog.digg.com/?p=966

======
justin_vanw
SELECT `digdate`, `id` FROM `Diggs` WHERE `userid` IN (59, 9006, 15989, 16045,
29183, 30220, 62511, 75212, 79006) AND itemid = 13084479 ORDER BY `digdate`
DESC, `id` DESC LIMIT 4;

Ok, how do we optimize this query?

Step 1: Keep all dugg items in memcached for the last N days, where N is
determined by when you run out of memory. Then, your query becomes:

SELECT `digdate`, `id` FROM `Diggs` WHERE `userid` IN (59, 9006, 15989, 16045,
29183, 30220, 62511, 75212, 79006) AND digdate < now() - interval '5 days' AND
itemid = 13084479 ORDER BY `digdate` DESC, `id` DESC LIMIT 4; /* Excuse the
postgresql syntax _/

If your database is properly clustered, this will mean you are only running
the query against partitions holding old dugs, which is probably not as hot as
the more recent stuff. Additionally, I strongly suspect that you see more
recent articles more than old ones, if the article is less than 5 days old you
need no SQL at all, just the memcache lookup. For example, if you are looking
at the homepage, and there are 15 articles on it, you have to do a single
memcached get request for all the pairs like (article_id, friend_id), so if
you have 100 friends that is 100 _ 15 keys to request. This is large, but who
cares, you can add memcached servers and webservers until you puke and this
will keep scaling without limit. When browsing old articles the db will get
hit heavily, but only the partitions holding old data, and I would guess that
this is a very very small fraction of their overall use.

Step 2: When a user is activly using the site, like they have viewed 2 pages
in the last 10 minutes or something, shove all their old (article_id,
friend_id) pairs into memcached as well. Once a user has reached the 'activity
threshhold' and the cache is filled, no sql is necessary to find all their
friend's dug articles. As a bonus, no weirdo software like 'cassandra' which
may or may not continue to exist in 1 year is necessary.

For step 1 you need very little effort, just put a key into memcached every
time a user digs something, and put a 5 day timeout on that key. This is 1
line of code in whatever code handles the http request representing a 'dig'.
Then you have to build up the list of friends and keep it somewhere when a
user logs in to the site (or returns with a cookie that has them logged in).
This would take one memcache request when the user logs in/comes back to see
if their friends list is in memcached, one sql statement if it is not, and a
line in the are that handles adding friends to spoil the key if their friends
list changes (you could try updating it, but why, just let it be regenerated
on their next http request). Finally, you have to generate the keys for the
(article_id, friend_id) pairs on each page view, and do a multi_get from
memcached.

Step 2 would require an asynchronous process, so would be more complex.

I could implement step 1 in an hour or so if familiar with the digg codebase,
and step 2 in perhaps 2 days, however if they have other async processes that
occur when a user logs in that you could integrate this with it could take as
little as an hour or two as well, since the logic is dead simple, it is the
mechanics of running a process to do it that is time consuming.

Finally, you would have to figure out how much memory you would need to store
N days of digs (users with no friends do not count in this). I believe it
would not be very much.

~~~
ojbyrne
Since I wrote that query back in 2005, and it was actually somewhat more
complex (had to deal with users banned after they dug the same story as you
for starters) I assume Ian (whom I've never met) is simplifying for
pedagogical reasons. And we had something similar to what you describe in no.
2 in 2006.

~~~
ojbyrne
Also I find it fascinating that the table is "Diggs" (it was originally "digs"
and I suspect it still is). Gotta get the branding right, even in sql queries.

------
rarrrrrr
In SQL systems, you can easily shift the computation cost from at read time to
at write time. That's what triggers are for. When new data comes in that
changes a result you know you need quickly, a trigger can automatically add to
a work queue to update a result table.

Since the trigger has automatic access to the contents of the new data (and
old data in case of an update or delete) the computation to update the results
table can often be made much faster.

Every situation is different of course, but it's overreaching to say that SQL
systems have no options beyond read-time computation of results.

~~~
staunch
This was my first thought too. Why not just denormalize the data in MySQL?
What benefit is Cassandra really offering here over that option?

~~~
jbellis
Cassandra gives you two benefits.

First, Cassandra uses a disk layout similar to the one described in the
Bigtable paper (<http://labs.google.com/papers/bigtable.html> sections 5.3 and
5.4); in particular it does no random writes. Relational databases like mysql
pretty much all use btree-based storage which was great 20 years ago but is
terrible today when seeks are your bottleneck.

I was talking to some people today who are struggling to get mysql to do ~100
insert/update operations per second. Cassandra will easily give you 10x that
-- _per node_.

The second benefit is that Cassandra gives you real, scalable partitioning,
invisible to the app, for when you do need to add nodes. When you have more
than a handful of machines, not having to babysit replication + partitioning
is a huge, huge deal.

~~~
aaronblohowiak
"I was talking to some people today who are struggling to get mysql to do ~100
insert/update operations per second."

I apologize for being flippant, but.. Were they running MySQL on an EeePC ?

They are probably Doing Something Wrong if they are struggling to do >100
insert/updates a second on even modest hardware.

~~~
jbellis
Because of how the storage engine works, mysql insert performance degrades
significantly as table size grows into double digits of GB, and especially
once the PK index no longer fits in RAM.

Cassandra doesn't have to read indexes to figure out where to put the data so
write performance stays constant as your data set grows.

------
mcav
Wow:

> _For this feature, the fully denormalized Cassandra dataset weighs in at 3
> terabytes and 76 billion columns._

~~~
breck
What's amazing is that 3 terabyes costs about $300 nowadays. Include good
components, power, redundancy, maintenance, and it probably comes to $1,000
per year.

So this feature is actually pretty cheap to do, unless I'm missing something.

~~~
SwellJoe
Every time I read "terabyte", I have to pause, because my brain goes "That's
gigantic!" and then a little voice says, "It's the same size as the hard disk
in my desktop...it cost $89."

I have the same sort of weird mental disconnect with SD cards. It's just hard
to fathom 32GB of data fitting into something about the size of a quarter.

~~~
imajes
also note that WD (?) have managed to make a terabyte 2.5" disk. Expect high-
end laptops to ship with that soon enough. :)

------
mikeryan
This is weird to me.

"We started thinking seriously about deploying Cassandra in production around
three weeks ago. After looking at the site for something that would be a good
fit, we settled on green badges."

It seems completely baffling to me that someone would go out and compare
different db solutions, pick one and THEN try to find a way to fit it on their
site architecture.

~~~
jbellis
I read that as, "which feature is easiest to port from the existing mysql
solution?" Which is a reasonable question to ask, given how different the two
databases are.

~~~
mikeryan
My process would be this.

1\. Decide that mysql (or other RDB) is problematic.

2\. Decide which features cause the most pain (List them with the most painful
on top)

3\. Discover which alternative db solutions would do the best in alleviating
the top issues.

~~~
jbellis
4\. Start implementing, starting with the most-improvement-for-least-effort.

I'm saying the two lists are not mutually exclusive. :)

------
gcv
I read in several blog posts that Cassandra has its share of data-corrupting
bugs. [http://blog.evanweaver.com/articles/2009/07/06/up-and-
runnin...](http://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-
cassandra/) mentioned that even Facebook does not use it as a system of
record. Wonder how Digg deals with that.

~~~
jbellis
What that article says is if you do something analogous to backing up a live
mysql database with dumb cp, Bad Things Happen.

Yes, we do plan to make it harder to shoot yourself in the foot, but this
isn't a data-corrupting bug of the kind you meant.

------
imajes
Anyone from digg here? I'd love to know how long it took to build that dataset
the first time. In other words: what's the recovery window like to reset that
size dataset?

~~~
ieure
It takes around 27 hours. 3 to load the MySQL data into Hadoop, then 24 to
compute and insert into Cassandra.

~~~
imajes
thanks!

------
by
Why does the first step "Query Friends for all my friends" take 1.5 seconds? I
am struggling to understand this. If this simple table of say 100,000,000 rows
is indexed on userid and we are only reading and returning say 200 rows for a
particular userid what makes it so slow?

------
Quarrelsome
But surely that's broken. If I dig something but then add a friend AFTERWARDS,
they wont see the shield as the bucket for my digg didn't contain their user
id at write time. Am I missing something?

