
Why Reddit's been slow lately (dev group post) - rufo
http://groups.google.com/group/reddit-dev/msg/c6988091fda9672d?pli=1
======
dalore
What if instead of looking up each user in the thread to see if they are a
friend, they just serve the same page to all users. Then a bit of javascript
loads up your friend list from the server and uses dynamic css to change the
style of your friends username to the friend style.

That way everyone can share the cache page, and the dynamic icing is separate.

~~~
jonknee
Probably better to send the friend list as JSON down the wire with the page
otherwise you'll double the number of requests (not something you usually want
to do when trying to scale).

~~~
enneff
You could use the client-side store (supported in a bunch of common browsers)
as a cache.

~~~
die_sekte
Or cookies. 4 kB is a lot of friends. Invalidate them every time someone adds
a friend.

Both of these however make cache invalidation (in case of multiple
browsers/computers) hard—nearly impossible even.

~~~
wanderr
Cookies suck for that sort of thing because they are sent over the wire with
every request.

------
antirez
Apparently Reddit folks don't like Redis too much (private email exchange),
but I'm practically sure that Redis could help them so much here...

There are two strategies to mitigate reddit problems using Redis IMHO, one is
simple to plug, one is advanced.

Strategy #1: Use Redis as a cache that does not need to be recomputed, instead
of memcached.

To do this what they should do is things like, for all the recent "hot" news,
to take everything inside Redis and update the Redis side when they write to
the database.

For instance they could use a Redis hash for every news to store all the
comments of a given news, indexed by comment id for easy update, so every time
there is to render the comment page just a Redis call HGETALL is needed to
fetch everything, like in a cache, but still with the ability to update single
items easily (including vote counters if needed, using HICNRBY).

The same for firendship relations and so forth. Every place can be
reimplemented using an updatable cache, starting from the slower parts.

Strategy #2: Use Redis directly as the data store, killing the need of a
cache.

This needs a major redesign, but probably it can be done incrementally
starting from #1, because when using Redis as a smart cache you write both the
code to read and update the cache, so eventually killing the code that updates
the "real" database will make Redis the only store, or it is possible to still
retain the code updating the old data store just to have another copy of the
whole dataset where it is easy to run complex queries for data mining and so
forth, that is something an SQL database does well but Redis does not.

I think that David King evaluated Redis in contrast to Cassandra, and he did
not liked the lack of cluster solution with failover, resharding and so forth
(what we are tying to do with redis cluster), but I think he missed part of
the point that Redis can be used in many different ways, more as a flexible
tool than a pre-cooked solution, and in their case the "smart cache" is
probably the best approach.

If Reddis will reconsider the issue giving Redis a chance I'm here to help.

------
mikey_p
Here's your sign: "7s of that was waiting on memcached."

If memcache is slower than your DB, you're doing something wrong.

~~~
Confusion
I don't think you've just told them something new. The problem is figuring out
what 'something' is.

------
smackfu
"A request that I just made on my staging instance took 13s (!) to render the
front page. That's on its own cache so it should be slower than the live site,
but that's still pretty ridiculous."

I'm actually seeing that kind of speeds on the live site front page.

------
aarongough
I'm curious as to how they're structuring the data for their comment trees.
Does each comment only have one parent? That parent either being another
comment or the parent article? Are they using nested sets?

My preferred alternative is to have all comments have 2 separate parent
fields. One that always points to the parent article and another that points
to the parent comment if it has one, null otherwise.

Structuring the data this way means that you can fetch all the comments for a
particular article _very_ quickly and if you wish simply hand that raw data
over to the client to be structured using JavaScript, which helps offload some
of the work your server would otherwise be doing...

</armchair_development>

~~~
aarongough
So I decided to get out of my armchair and have a look at the code. It's a lot
to take in and I've never done any work in python but:

It looks like they're already doing part of what I proposed above. Each
comment is associated directly with a 'link', and after retrieval the tree is
sorted on the server-side.

Personally I don't see any reason why the tree couldn't be sorted client-side,
sorting definitely seems to be one of their time-sinks, especially given that
each tree has to be sorted a number of different ways (by controversy, heat,
age, score, etc...), and given that the trees tend to change often (with each
vote, and with each new comment)

~~~
ketralnis
> I don't see any reason why the tree couldn't be sorted client-side

The sorting isn't the expensive bit, tmk

~~~
aarongough
I'd be interested to know what percentage of time it takes up for rendering a
comment thread in the live system though.

------
iampims

      But when we render the Comment back to you in that same request we need
      the ID that the comment will have, but we don't know the ID until we write it out.
    

Wouldn't something like Snowflake help for this particular case?

    
    
      Snowflake is a network service for generating unique ID numbers
      at high scale with some simple guarantees.
    

<http://github.com/twitter/snowflake>

Kellan (from Flickr) has a neat post about _Ticket servers_ :
[http://laughingmeme.org/2010/02/08/ticket-servers-
distribute...](http://laughingmeme.org/2010/02/08/ticket-servers-distributed-
unique-primary-keys-on-the-cheap/)

------
xtacy
The author of the post also says that the EC2 network is "slow". Does anyone
have numbers about the performance of EC2 network in general, and why this is
so?

~~~
mjschultz
A few months ago I read: "The Impact of Virtualization on Network Performance
of Amazon EC2 Data Center"
(<http://www.cs.rice.edu/~eugeneng/papers/INFOCOM10-ec2.pdf>) which has some
performance numbers (latency and throughput) for small and medium instances.

Unfortunately, I can't recall enough of the paper right now to give you the
nickle-and-dime overview, but it has graphs you can look at!

------
AgentConundrum
As a bit of a side note, as a regular user of reddit it felt a bit odd to see
this as from "David King".

It says something about the level of interaction between the reddit admins and
its users that I recognize him primarily as "ketralnis".

~~~
code_duck
Indeed, I had no idea who David King was, but ketralnis is familiar.

------
elblanco
Reddit is _always_ going through slow phases. I bet those probably coincide
with growth phases and they run rather lean (or so I've heard). I'd be more
worried if they started getting very fast, that might mean their user base is
shrinking and they have too much infrastructure.

------
shuri
Start with graceful degradation when things get tough? (Less comments, top
friends,...).

~~~
krakensden
My initial worry about that would be datastore support- can you really do that
efficiently with Postgres/memcached?

~~~
shuri
In memcached, caching smaller things should allow more to be cached. On the
Postgres side, when the disks are hard at work any access is expensive but
reading less should still help. Depending on the query you can try to get it
to read less. In other situations maybe just turn stuff off. I don't know the
specifics but simple things like not displaying the exact number of comments
may help (counting stuff can be frustratingly expensive sometimes).

