

How Google Taught Me to Cache and Cash-In - timf
http://highscalability.com/how-google-taught-me-cache-and-cash

======
amix
Following this guide blindly is a bad idea. Doing premature caching can be
dangerous - you may end up caching stuff that gets hit once, you may miss
caching stuff that get hit a lot of times and you may end up making your
caching a lot more complex than it needs to be.

A better caching strategy is to collect data (for example, what are the most
used queries) and then do caching optimization based upon this data. I.e.
don't apply caching blindly, but apply it by profiling your application. This
approach also works for testing out how new caching strategies affect your
cache and your database.

~~~
lucifer
You let the cache expire stale objects and update it on db IO. What's the
problem, exactly?

~~~
amix
Updating the cache is far from easy. An example that show cases the non-
triviallity is MySQL's query cache: it will purge the whole cache on a table
update. If you get a lot of reads, this works great, but if you get a lot of
updates, then MySQL query cache will be very inefficient as the cache will be
purged on every update. MySQL query cache has a lot information available that
could be used implement a smarter cache, unfortunately this problem is far
from easy to solve even with a lot of information available.

The bottom line: good luck on expiring the cache on database updates ;)

~~~
caffeine
What I did in Rails on a previous app was:

\- Tie hooks into the logger so I can track exactly which ActiveRecord models
are accessed in a given page view.

\- Have 100% test coverage AND make sure coverage pattern (i.e. frequency
distribution of line execution) at least looks something like production AND
use subset of production data for testing.

When tests are run:

\- Use the logger hooks to build the _tree_ of partials included in each view,
and models included in each view/partial (i.e. we got a User which had Friends
and an Avatar which had a PicResource .. etc.), where the models are linked by
foreign key relations, and pages/partials linked by inclusion. This is why you
need 100% coverage & real data - you don't want to miss anything.

Then:

\- Tie cache-busting code to ActiveRecord's after-save hook. When a table is
updated, you go to every tree in which that model appears and go up until you
hit the highest partial. You bust the partial and then you bust the top-level
page.

\- If you want to tune this further (for expensive pages), you can
differentiate between CREATE and UPDATE. On CREATE you need to bust the whole
cache for those partials/pages. On UPDATE you only need to bust existing
cached renders of those pages. So you can just store the N most-viewed
pages/partials with those models, and keep their IDs around.

Why we did this:

I realize this isn't typical for a web app. Our app had to render binaries for
embedded devices, so we had to compile everything into a single file before it
went out the door. This was a highly interactive app, so pages changed all the
time. Using this system, we could:

a) maintain quick response times by almost never rendering on page view
(rendering was compute-intensive and required about 500ms per page)

b) pre-render often-used and often-changed pages

c) save our precious database from wear and tear

d) do all this WITHOUT harassing the page/content developers, who could throw
together new rails views/controllers using whatever AR models they please,
without having to maintain onerous "cache-busting" lists or strict model-
controller ties.

(I've omitted some details here, like dealing with cyclical dependencies .. no
fun otherwise :)

------
alexandros
I wonder if anyone's gone the next step and written updates to the caches
directly, leaving out the database write from the critical path.

~~~
laut
Do you mean that when you get an update, you both write to the database and
perform an update in the cache? If so, you would want it to be a transaction
so that the cache and database is not out of sync.

Maybe a write cache, like on hard drives with write back caching. If you first
wrote the update to a cache (queue of updates). Then it would be written to
both the cache (cached html for instance) and database in one transaction.

But I think it sounds like more trouble than it's worth.

~~~
alexandros
transactions are unnecessary, unless you are looking for 100% consistency
between cache and database at all times. Usually this is not the case and if
it is, you will have extremely serious problems scaling. All i'm saying is: If
you have an update that you know will invalidate some cached pages, why
invalidate them and then rerender them from the database once the update is
commited instead of updating them directly. and lazy queueing the update,
taking the load of the database (less pressure to commit instantly, and less
reads). In the facebook inbox example, you know that a new message to the user
will increase their message count by one. Just update the cache directly with
this new info. When/if it explodes, by all means, rerender everything.

------
Edinburger
Another idea: populate ('warm') the cache automatically after server restarts
to get up to speed quickly.

------
jasonkester
One extra point: Don't Add Caching Yet.

1\. wait until something presents itself as a bottleneck

2\. optimize it until it's not a bottleneck anymore

3\. wait until it presents itself as a bottleneck again anyway.

...Then add caching.

You'll never find the low-hanging yet dog-slow fruit if you put your
aggressive caching scheme in place right off the bat. If you have a bunch of
poorly optimized code running with a ton of caching to hide it from you and
you suddenly have enough traffic to cause scaling pain, that's a big problem.

Speed it up. _Then_ cache it.

------
desu
Seems to mainly be about Reddit.

Didn't seem to answer the #1 question I had in my mind - how do they cache
votes for logged-in users? Do they, say, cache a generic front page and then
apply the user's prior voting choice via JS? Or do they cache the page only
for the general public, and for logged-in users just cache their voting
history and generate it on-the-fly?

However they do it, they do a damn good job. I'm always impressed by how
responsive Reddit is, along with the other highly dynamic / high load sites
like Digg, etc. They manage to stay up and running fine even with all that
going on; but just a link on their front page to someone's blog, which should
be almost static, crushes that server into goo. Shows the power of good
design!

~~~
uggedal
Use the source Luke: <http://code.reddit.com/>

~~~
desu
Thanks. I want to know the answer, but not so much that I'm willing to spend
the several hours necessary to familiarise myself with a large foreign code
base. I was hoping someone here could give a quick top-level explanation.

From looking around, though, I think they're caching the user's votes and then
just assembling the page on the fly from cache fragments. You could do that
pretty quickly. Might be wrong.

------
ecq
For people familiar with Oracle, 11g has caching built-in called result cache.
it's transparent and works well.We saw a 400-800+% increase in performance in
our application by just turning this feature on, without any change in our
application.Also note that you can still use memcached on top of result cache
for even better performance/scalability.

