
Caching at Reddit - d23
https://redditblog.com/2017/1/17/caching-at-reddit/
======
chime
> When you vote, your vote isn’t instantly processed—instead, it’s placed into
> a queue.

I remember looking into this a while ago and was bewildered to find that when
I upvoted or downvoted, there was no XHR call to the backend! There was no
hidden iframe/image, no silent form post. Absolutely no network activity. Yet
when I refreshed, my vote was shown correctly. I thought I was going crazy.

This was long ago so I'm a bit fuzzy on the details but after a bit of
digging, I found the most elegant data collection technique I've ever seen.
Instead of sending network data when I voted, a local cookie was set with the
link id and vote value. Then when I went to another page, my browser naturally
sent the cookie to the server, where I believe it was processed, and then a
fresh cookie was sent back to my browser. I could vote on 10 links, the local
cookie would get large and then on the next page refresh, the backend would
receive my batch of votes, process them, and send me a fresh cookie again.

I don't think they do that now and I've never seen anyone do something like
this. Even HN just makes an XHR call on voting. After twenty years on the web,
it's not often that I am surprised so this was quite a thrill.

~~~
dirtae
What if the next page that you go to isn't on Reddit?

~~~
jhanschoo
Then the next time you visit reddit, I guess.

~~~
test1235
Maybe I'm not a 'normal' user so it wouldn't have mattered to them, but I used
to open loads of tabs at once, and slowly make my way across all the stories,
then more often than not, I'd close the browser where all my cache, cookies
and history and whatnot are cleared.

This would mean none of my votes would ever have made it to the server!

~~~
dspillett
To an extent it might be worth letting those votes be lost to keep things
simple, if only relatively few users browse that way.

There are ways around it if it were a problem though:

* Have a timeout on updates. If a vote has been sat in the local cookie send for some time send an XHR request to push the current queue to the server

* Use onunload event handlers to push any remaining queued votes when a tab closes. You could even try maintain a count of how many tabs are currently open (in the same cookie) and only send that request on the last one closing.

* If the amount of votes hits a certain length, also send a XHR request to register them.

Probably not perfect (I have in the past found onunload events to not be
terribly reliable) but it would capture most of the otherwise lost votes you
describe. It might be too much work to write/debug/maintain though, if the
number of lost votes would be small anyway.

------
slavik81
> Performance matters.

It took 8.4 seconds to load the Reddit front page on my phone. Hacker News
took 1.1 seconds. This feels like advice from the overweight gym teacher on
how to do pushups.

The desktop Reddit site took 2.2 seconds over the same connection, by the way.
It seems like it would be much more valuable to optimize whatever is taking up
>75% of page time on mobile.

~~~
raldi
HN serves the same front page to all users, other than the bar at the top. To
make a reddit front page, you have to look up the hottest results of ~100
subreddits and shuffle them all together in the correct order. Literally every
time a vote is cast, that could change the front page for every single user,
in a different way for each one.

~~~
swsieber
False. It cannot since it has too look up the vote status for (logged in
users).

~~~
flashman
I dunno man, I think I'm going to trust the former Reddit admin (/u/raldi)

~~~
Strom
Blind trust in such a simply verifiable case is quite naive.

You can easily open up the front page of HN as a logged in user and see that
it contains information about which storeies you've voted for and which you've
flagged. On top of that, you can click "hide" to hide stories. These will only
be hidden for your specific account, not for every person who loads the front
page.

What's more, the go-to call-to-action by HN admins during really popular
stories is to have people log out, because logged in users don't get cached
results. Have a look at the comment by HN admin dang on the Trump winning
story. [1] The key part being " _please log out to read HN today, unless you
want to comment. Then we can serve you from cache_ ".

[1]
[https://news.ycombinator.com/item?id=12907201](https://news.ycombinator.com/item?id=12907201)

~~~
justinlaster
That's not the same thing as showing different results _per_ user. It's just
tacking on the user's operations when it renders out the front page.

They're two completely different operations, with completely different
complexities behind them.

The "hide" feature is not analogous to what reddit is doing.

~~~
raldi
Exactly. It's the difference between "Look up the first 100 people in this
phone book not named Justin" and "Look up the 100 Justins that come first
alphabetically across these 100 phone books"

------
sciurus
The pain of static slab allocation is real! Changing usage patterns causing
problems can be tricky to track down too; mcsauna looks helpful for this.
Upgrading to memcached 1.4.25 and running with
"slab_reassign,slab_automove,lru_crawler,lru_maintainer" was a huge
improvement for our primary memcached cluster at Eventbrite.

------
jjoe
The slowness of page load mentioned by folks here is the reason why I think
caching at the HTTP level (ex: Varnish) is much more efficient than caching at
the service level (ex: Memcached), which is much further down the stack and is
bound to be latency-sensitive. Because it's much less entangled in your code
and deep into your infrastructure (less technical debt). A hybrid approach can
work too but only if it's light and unobtrusive.

By the way, and I'm going out on a limb with my shameful plug, I built a
Varnish-as-a-Service kind of infrastructure called Cachoid (
[https://www.cachoid.com](https://www.cachoid.com) ). But to my own defense,
I'm putting my energy, time, and money where my mouth is.

~~~
ssambros
To my understanding Reddit serves a highly customized content to each logged
in user. Can you help me understand how HTTP level caching will solve this
problem more efficiently than their service level caching does right now?

~~~
jjoe
You cache non-logged in users to start with. And then you cache based on
sessions (logged in users) because you don't really need to show fresh votes
on each visit and right away (admitted to it in the article). Plus there's
lots of room for ESI.

~~~
Klathmon
They already do those things.

Logged out users see a "snapshot" of the page updated every so often.

And I really don't think that caching pages per session would really help with
their load all that much. Why not just use HTTP cache headers at that point?

Plus while you don't really need to show votes ASAP, logged in users will want
up to date comments.

~~~
jjoe
This is where ESI helps. You can portions of the page that don't change much.

------
jrowley
The cache-perma seems pretty clever to me.

> For example, when new comments are added or votes are changed, we don’t
> simply invalidate the cache and move on—this happens too frequently and
> would make the caching near useless. Instead, we update the backend store
> (in Cassandra) as well as the cache. Fallback can always happen to the
> backend store if need be, but in practice this rarely happens. In fact,
> permacache is one of our best hit rates—over 99%.

They basically have their application state duplicated in both places.
Interesting architectural choice.

~~~
hehheh
This sounds an awful lot like an old(ish?) technique termed "write-through
caching". Building data structures that can take advantage of write-through
caching at reddit-comment-scale does sound like it'd be an interesting problem
to optimize.

~~~
jedberg
That's basically what it is. The data is abstracted away so that when you do a
write it just goes to both places as needed.

------
bluedino
I wonder how much better/worse the site would run if they had their own
hardware like StackExchange. And I wonder how StackExchange would run if it
were on AWS.

~~~
nasalgoat
In my experience running large sites, dedicated hardware is not only cheaper
but orders of magnitude faster because you can finely-tune the hardware to
very, very specific use cases, and you can have a fully private, highly
optimized network with very small cable runs to effectively eliminate network
latency issues.

Most of my current job involves solving problems that are caused by cloud
limitations.

~~~
mtanski
And if you need throughput, 40gigE / 56gigE is not that hard expensive with
your own hardware. You end up with like 4x to 10x servers to handle the same
load.

------
eunoia
Some back of the envelope math for their caching costs:

54 x R3.2xlarge EC2 instances

On demand = $314,571.6/year

w/ 1 year term = $166,860/year

w/ 3 year term = $110,340/year

w/ convertible 3 year term = $150,174/year

Is that a lot? Seems like a lot

~~~
degenerate
For a text link site that lets you vote submissions up and down? Yes that's a
lot. I'm sure a lot of the decisions at reddit corp were along the lines of,
"well, we can't go back and change that now, so let's do whatever we need to
fix it and move forward" AKA lots of RAM and caching all over the place.

~~~
mwpmaybe
If a company is grossing $20MM/year and stands to improve that by more than
0.75% by having a faster site, then yes, it makes sense to spend 0.75% of your
revenue to do so.

------
bbeausej
Thanks for sharing the details. It's impressive to see the memory allocation
and pool size for a site handling this much traffic. I would love to get some
more information on reddit's platform overall traffic volume as I feel this
would complement the discussion nicely.

------
nodesocket
I wonder if switching from memcached to redis would make a bottom line
difference in terms of the number of instances needed (cost) and performance?

~~~
deedubaya
Maybe a difference in performance, but caching comes down to bits-to-be-
stored. That likely wouldn't change.

~~~
cookiecaper
redis does have some built-in features that automatically compact data. [0]
I'm sure reddit has considered this, but worth noting.

[0] [https://redis.io/topics/memory-
optimization](https://redis.io/topics/memory-optimization)

~~~
sgmansfield
client-side compression works out better for a couple of reasons:

1\. You use less bandwidth in/out of the box. This matters in AWS because your
bandwidth is (relatively) limited.

2\. Your CPU usage to (de)compress is distributed over a larger fleet. The
clients spend the CPU time, not the cache servers.

------
Florin_Andrei
Are those visualizations done in Grafana?

~~~
dkasper
Yes, Grafana with Graphite as the backend store.

------
QuercusMax
Ironic timing that Reddit is currently undergoing a major outage right now...

~~~
gooeyblob
Not really, there were some routing issues with our CDN affecting an extremely
small percentage of traffic.

~~~
QuercusMax
Seemed to be down for lots of people in the bay area:
[http://downdetector.com/status/reddit](http://downdetector.com/status/reddit)

Maybe only small because most californians were still sleeping? I was only up
due to a sick kid. :D

------
egonschiele
What's the difference between mcrouter vs something like haproxy?

------
ksec
Slightly Off Topic, as of 2017, what's the advantage of Memcached over Redis.
I thought we are basically in the era of Redis.

~~~
antirez
It's hard to talk in general terms but Memcached is threaded so you can
saturate your CPUs without requiring multiple instances, and Redis has more
advanced features both from the POV of programmer API facing and operations.
But if we want to zoom on Reddit itself, the fact of using data structures and
changing caching paradigm to really use Redis the proper way, that is, storing
metadata in Redis and not just only in their main DB, I suspect would provide
a very big boost to Reddit. For some reason Reddit always has been an "anti
Redis" shop. I'm sure they have their good reasons, and btw I love Reddit too
much to complain, whatever they do to run it, I don't care as long as they
provide such a wonderful service to the community :-)

But... their use case is IMHO one that you can accelerate tremendously by
using Redis. I ran the most popular Italian Reddit-alike site for years, and I
wrote a simple Reddit clone that uses Redis so I had the opportunity of
exploring the problem a bit.

~~~
gooeyblob
Reddit is most definitely not an "anti Redis" shop! If we were to rewrite it
today, we might start with Redis. We've just put so much work into making
memcached reliable and easy to understand for developers that the possible
benefits of switching to Redis for many of these use cases don't outweigh the
operational knowledge we have from years of running memcached at this scale.

That said - we use Redis in at least 2 (soon to be 3) capacities across the
site for different services (outside of the monolith) and it works really
well.

~~~
antirez
Thanks! That's great to know. Please ping me anytime if you want certain
features or alike. Thanks for your work on Reddit, I love it.

------
akjainaj
Taking into account how reddit performs, I'll take this as a guide on how not
to use cache.

(This is a joke please. Understand it as such. I know reddit has the problems
it has because it is severely understaffed)

~~~
dageshi
Honestly I do recall periods when reddit wasn't performing particularly well
but recently it's seemed fine to me.

~~~
akjainaj
For me all pages take 3+ seconds to load (generation time), which means reddit
is easily the slowest site I visit.

~~~
praneshp
Have you tried turning off all adblockers on your browser? Or is your time
before all those come into play?

~~~
pfg
FWIW, a GET / (front page) while logged in takes 1-1.5s for me (TTFB). ~3s
till load with ads blocked, ~5.5s with ads. Both numbers without caching, but
that only seems to shave off about 200ms (most of the time is spent in JS and
rendering the site).

~~~
praneshp
Okay, thank you.

