Also, if you're attending or watching at home and there is something specific you're hoping I will cover, tell me here in the comments and I'll do my best to cover it!
It seemed like the object-blob Postgres store was a disaster, then they brought in Cassandra which was initially used as a cache of sorts, but now is the primary point of contact for threads apparently?
Reverse-engineering the persistence strategy from the open sourced code is more tedious and potentially inaccurate than I am comfortable with, so could a Reddit employee or code contributor provide some insight as to how those problems were worked past?
There was Reddit downtime today (read-only mode), which is common. What's the usual cause of that sort of thing these days?
Now, maybe that was Amazon's fault for selling a flaky storage solution, or maybe it was our fault for not getting our databases the hell off of it, but yeah, it definitely wasn't Postgres's fault.
I'm gonna go page ketralnis, since he got his hands a lot dirtier than I did with respect to reddit's storage technologies.
As a side note, things have been much better, from what I hear, since they moved to local storage. Things work much better here at Netflix for the same reason.
I haven't had much of a chance to play with the new Provisioned IOPS much, but from what I've seen that is a pretty solid storage solution.
This bug was a pretty rare case, but to be specific, here it is:
1. Databases for a particular data type are specified in the config file like:
dbmaster, dbslave1, dbslave2, dbslave3
3. Reads are sent to (sort of) random.choice(active_list) and writes are send to active_list
4. But wait! If the master is "down" (the EBS backing him has slowed down so far as to mark him unresponsive) then active_list may very well be a slave
5. Londiste now can't reconcile the slave database with the master database. There's a sequence conflict!
The symptom would be that a replica would refuse to replicate, we'd detect the replication lag on that slave, and remove it from the active list. Then its load would fall, we'd get an alert, and we'd say "well crap" and rereplicate to it.
In practise this didn't happen very often, I can only think of a few occasions off-hand, and generally the master's disk lag was so bad that a downed slave was the least of our problems.
It got progressively worse as time went on the EBS performance got worse. It was happening every couple of weeks after you left, and I think it got even worse after that.
Luckily we had dug into the londiste internals and managed to figure out how to restore replication much faster, but that was still super painful. :)
But you're right, the disk lag usually caused the immediate problem -- the out of sync slave was usually pretty easy to deal with, depending on how many got out of sync at once. It was just annoying and time consuming.
Many of us are at Hipmunk now, so at least that many. We're also still friends with each other and with the current reddit folk
Increased traffic? I don't think I understand the question
I was trying to determine what component broke first.
An app like reddit has most of its backend tweaked or rewritten continuously as it becomes more important to find and alleviate some new bottleneck. If you do it right, you predict the next bottleneck before it becomes one but there will always be something.
Maybe writes take too long. Why? You've capped the performance of the disc in your DB master (or more importantly of the most expensive single set of discs you can afford). Maybe adding app servers proportional to site-wide traffic stopped helping you keep up at some point. Turns out you're network bandwidth bound. Maybe requests are too slow, but apps aren't using much CPU. Why? Because you're network-latency bound. Hey, it turns out that some tight loop was doing single memache GETs when it could have been doing a single multi get.
That last one is just a general performance bug, but that's just it. Every bottleneck that keeps you from just adding resources is a performance bug. Assuming your app is not infinitely fast, there's always something.
We were already using memcachedb as a persistent keystore, "caching" precomputed listings and comment threads. When it started becoming more trouble than it was worth to maintain, we decided to try cassandra as a beefier replacement.
Mostly by getting off of EBS
> It seemed like the object-blob Postgres store was a disaster
> then they brought in Cassandra which was initially used as a cache of sorts, but now is the primary point of contact for threads apparently?
I don't know what "primary contact for threads" means, Cassandra doesn't really do IPC.
Some datatypes are in Postgres (Links, Accounts) and some datatypes are in Cassandra (Votes, Saves/Hides).
> There was Reddit downtime today (read-only mode), which is common. What's the usual cause of that sort of thing these days?
I can't speak for today now that EBS is mostly out of the loop, but it used to be EBS issues on a DB master
I'd love to know what I did to harm you so that I can apologize.
Or do you just like giving me a hard time for the lulz?
Or maybe I'm missing someone's dickishness, who knows.
1. An exemplary specimen from this very thread: http://news.ycombinator.com/item?id=4681972
To clarify, I'm a long-time redditor who was pretty impressed with how well you guys managed given the small team you had.
My mistake! :)