

Reddit explains their architecture, scaling and recent downtime - mqt
http://blog.reddit.com/2010/03/and-fun-weekend-was-had-by-all.html

======
KrisJordan
"It turns out that this one little decision makes it so that we can't
horizontally scale that layer of our architecture without losing all the data
already there (because all of the keys would point to the wrong server if we
added a new one)."

I'm surprised they're looking to switch stacks entirely as opposed to a
consistent hashing / distributed hash table (a la chord, dynamo, etc.)

<http://en.wikipedia.org/wiki/Consistent_hashing>

<http://en.wikipedia.org/wiki/Chord_(peer-to-peer)>

~~~
jedberg
We can't just switch hashing at this point, because none of the data would be
in the right place.

Since we have to move all the data anyway, we figured now would be a good time
to switch stacks. Memcachedb isn't really a very solid product, so even if we
scale it, it isn't a good long term solution.

------
showerst
I love how transparent reddit is about these sorts of things. In their video
from pycon, they even discussed their server hosting bills!

Crazy how such a minor design decision (not using consistent hashing) could
have such huge consequences down the line...

------
kscaldef
"[memcached] can no longer return data fast enough for our needs, due to the
way it interacts with BDb (its underlying data store)."

Does anyone know what the specific issue being alluded to here is?

~~~
patio11
MemcacheDB periodically flushes its in-memory DB to disk, using BDb. That
causes a global read/write lock for a looooong time. (This pauses BCC for 5
seconds at a time when it happens, and I only have 20 MB stored in memcachedb.
To rectify this I'm moving to Redis as soon as I get a day free. Suffice it to
say that if Bingo Card Creator taxes the architecture of your key/value store,
you may not quite be ready for prime time.)

~~~
antirez
I may be a bit biased, but if what they want is a fast persistent cache Redis
should really be a good fit.

~~~
zepolen
Does Redis block when writing to disk?

~~~
antirez
No writes are asynchronous.

------
synnik
"It is a highly customized user experience, on par with something like
Facebook (just not as many users)"

Really, Reddit?

While, I don't doubt that they have complexities to deal with, this sounds
like skewed perspective of either overestimating their own complexity, or
underestimating Facebook's.

~~~
jedberg
We have had multiple discussions with the Facebook folks, and are well aware
of the complexities of both sites. I believe you are actually underestimating
reddit's complexity.

Here are some examples:

When you load a comments page with 500 comments, there are 1000s of data
points that have to be loaded. For every comment, we have to check how you
voted on it to draw the arrows, we have to check if you are the author, we
have to check if you are allowed to remove that comment as a moderator, we
have to check if you can edit it, if the author is your friend, and so on. And
we have to do that 500 times.

When you load a listing, we have to pull your subscriptions, and then merge
the results from all the reddits you subscribe to. Then we have to check most
of the same things as we do for comments.

After all that checking, we have a to render a page that is customized just
for you. Some of that will come from the render cache and some from the data
cache, but it is still highly customized.

The big difference is that Facebook doesn't have any logged out users, whereas
we do.

Akamai takes care of our logged out users, but rendering the page for a logged
in user is extremely complicated.

~~~
synnik
Thanks for the reply --

And I'm not trying to be difficult, but... I'm not convinced.

In your scenario, the 500 comments are already loaded for any user. The
comments you have authored are a small dataset, not a check for each and every
comment. (At least, I HOPE not.) Similarly for your friends comments... and
whether or not you are a mod is a single check.

For the average thread and user, this really should be a handful of checks.
Surely, there are some very active users with many friends for whom the checks
could hit numbers in the 1000s. But that should be the edge case, shouldn't
it?

Same thing for loading a listing. Pulling subscriptions and merging sounds
like just DB query design. All the "same things as we do for comments" should
be minimized as described above.

Rending a page is just that -- rendering... highly customized, yes. But
customized on the previously discussed criteria, not a while new set of
customizations.

And like you said, if you aren't logged in, none of this applies.

Which all begs two questions: 1) What percentage of users are logged in? 2)
What percentage of page renders really fit the scenario you described above,
with enough comments and related data points that the checks hit 4 digits?

If you come back and say 90% of your page renders needs 1000 data points, I'd
certainly concede the point. But I'd be very surprised. (And highly suggest
that new algorithms are needed.)

But I'm sincerely honest what those percentages are?

------
papaf
Two things surprise me about this article - probably because I've
misunderstood it and don't see the big picture.

One is that there are master and slave databases and searches are done off the
master - I've always seen them done off the slaves in other systems. The other
is that they state that using MD5 doesn't allow for horizontal scaling. One of
the qualities of MD5 is that all bits have an equal probability of being 0/1.
Surely the last 1 or 2 bits can be used to indicate which server is holding
the data?

~~~
smanek
Searches are likely done off slaves - I suspect that is not presented properly
because of the oversimplification of the diagram.

You can just use a few bits from an MD5 hash to decide server as long as you
know how many servers you're going to have up front. The problem is that if
you later wanted to add or remove a server, you would need to come up with a
new scheme and move every piece of data around so it's on the right server
(which would take days/weeks).

The more scalable/flexible solution is to use a consistent hashing algorithm
(check out some of the papers on Chord) so that adding or removing a server
doesn't require you to move as much data around.

