Ok, how do we optimize this query?
Step 1: Keep all dugg items in memcached for the last N days, where N is determined by when you run out of memory. Then, your query becomes:
SELECT `digdate`, `id` FROM `Diggs`
WHERE `userid` IN (59, 9006, 15989, 16045, 29183,
30220, 62511, 75212, 79006)
AND digdate < now() - interval '5 days'
AND itemid = 13084479 ORDER BY `digdate` DESC, `id` DESC LIMIT 4; /* Excuse the postgresql syntax /
If your database is properly clustered, this will mean you are only running the query against partitions holding old dugs, which is probably not as hot as the more recent stuff. Additionally, I strongly suspect that you see more recent articles more than old ones, if the article is less than 5 days old you need no SQL at all, just the memcache lookup. For example, if you are looking at the homepage, and there are 15 articles on it, you have to do a single memcached get request for all the pairs like (article_id, friend_id), so if you have 100 friends that is 100 15 keys to request. This is large, but who cares, you can add memcached servers and webservers until you puke and this will keep scaling without limit. When browsing old articles the db will get hit heavily, but only the partitions holding old data, and I would guess that this is a very very small fraction of their overall use.
Step 2: When a user is activly using the site, like they have viewed 2 pages in the last 10 minutes or something, shove all their old (article_id, friend_id) pairs into memcached as well. Once a user has reached the 'activity threshhold' and the cache is filled, no sql is necessary to find all their friend's dug articles. As a bonus, no weirdo software like 'cassandra' which may or may not continue to exist in 1 year is necessary.
For step 1 you need very little effort, just put a key into memcached every time a user digs something, and put a 5 day timeout on that key. This is 1 line of code in whatever code handles the http request representing a 'dig'. Then you have to build up the list of friends and keep it somewhere when a user logs in to the site (or returns with a cookie that has them logged in). This would take one memcache request when the user logs in/comes back to see if their friends list is in memcached, one sql statement if it is not, and a line in the are that handles adding friends to spoil the key if their friends list changes (you could try updating it, but why, just let it be regenerated on their next http request). Finally, you have to generate the keys for the (article_id, friend_id) pairs on each page view, and do a multi_get from memcached.
Step 2 would require an asynchronous process, so would be more complex.
I could implement step 1 in an hour or so if familiar with the digg codebase, and step 2 in perhaps 2 days, however if they have other async processes that occur when a user logs in that you could integrate this with it could take as little as an hour or two as well, since the logic is dead simple, it is the mechanics of running a process to do it that is time consuming.
Finally, you would have to figure out how much memory you would need to store N days of digs (users with no friends do not count in this). I believe it would not be very much.
Further, step two is irrelevant. There's no way of knowing if a (article_id, friend_id) pair in memcached is from the most recent N days or whether it's been stuffed in due to a user being active. Therefore, searching the DB for older diggs is still necessary, and should take the exact same amount of load as if they weren't in memcached at all.
Memcached + MySQL makes great sense when the data set is small and simple. If all the content on the site fits within 1 GB you could probably easily push a hundred millions unique visitors a day. For an application like this, the relatively poor performance of MySQL and the inflexibility of Memcached cause problems.
It seems to me that using Cassandra, even in its current immature state, makes much more sense than the solution you're proposing.
Step 2 is fine, memcached allows setting of timeouts on keys. If you always set a timeout of (dig_time + five_days) - now() you are set. Be careful, times > 30 days imply a unix timestamp not a timeout.
Mysql kindof sucks. The clustering is easy to set up at least.
Cassandra and other big on disk hash tables are pretty cool. I think once they have datastore-like indexing capabilities they will be totally usable. My qualm with using them is that there are 80 of them right now, and they are all pretty immature. The ones that are mature, like bsddb, and complicated to use.
My point was that cassandra is just a big memcached.
That's pretty much the idea. Cassandra makes replication + failover totally seamless, so there's no excuses in that respect. :) Cassandra also supports replicating across multiple data centers.
but NoSQL kinda gives you memcache like awesomeness with some level of persistence. Think of cassandra as a more feature-rich, persistent memcache. :)
So saying "NoSQL gives you [or doesn't give you] X" is meaningless.
And for the record, SQL is great, but trying to keep using it once your data set has outgrown a single machine is like "trying to build a bookshelf out of mashed potatoes." (Gotta love jwz.)