I work on memcache at Facebook, and cache invalidation is a hard, hard problem. It's easy to think that we just overlooked this trivial fix, but the reality is that this just doesn't solve all possible cache consistency problems. Namely, how do you deal with racing writes where the clocks are skewed? How do you deal with multiple copies of the same data in different caches (different data centers, for example). What about axe-murdering your top level db because that data is inherently uncacheable using this scheme?
I'm not trying to hate on the post at all, I think it's awesome that product people are thinking about this problem, and the more of them that do the better the world is going to be for guys like me. I just want them to continue to think hard about it :)
good points. but to be fair most problems become hard, hard problems when your site operates at large, large scale. for small traffic sites, caching architecture is easy and most of the subtle vulnerabilities don't manifest.
background: i've been a senior staff engineer at Orbitz and i did a lot of work on the caches at Cheaptickets.com
It's good to see this written up somewhere. The technique is widely known in the Rails community but I haven't seen it explained in one place. (For example, the Rails API documentation and guides don't even mention that the cache helper can take an Active Record instance as the key, so you'd be hard pressed to discover this technique without reading the Action Pack source.)
As DHH says, the upside is that it's simple and, as long as you get the cascading updates right, makes sure you don't ever show stale content.
There are a couple of downsides.
Firstly, once a fragment becomes invalid, it's not regenerated until the next request that needs it, so someone somewhere gets a slow[er] page load; even worse, if yours is a large public site with lots of long-tail pages, that "someone" is likely to be the Googlebot, and the slower page times might adversely affect your search ranking. You could ameliorate this by having some kind of automated process for periodically pre-warming certain parts of the cache after certain kinds of update, but that can get difficult to maintain.
Secondly, this technique is very conservative: the goal is to invalidate all fragments which might have been affected by an update, rather than only the fragments which actually have been. (It's less brute-force than the even simpler "invalidate the entire cache every time any record is updated" strategy, but it's making a similar trade-off.) This works well for an application like Basecamp where the hierarchical presentation of the data matches its hierarchical organisation at the model level, but is more problematic in applications with lots of different ways of presenting the same underlying data; if you have lots of cached representations of projects which don't actually include any information about the child todo lists or todos, then tough shit, those are all going to be invalidated when a single todo is updated, and you'll have to spend (page load) time waiting for them to be regenerated. In principle you can get around this by having multiple timestamps per model to represent different kinds of update, but again that's more difficult to manage correctly over time.
An interesting alternative approach is James Coglan's experimental Primer (https://github.com/jcoglan/primer), which dynamically tracks which Active Record instances & attributes are used during the generation of a cached fragment, and then actively invalidates that cache when any of those attributes change. The downside to Primer is that it's more complex, not production-ready, and may contain traces of magic.
I'm interested in trying out this technique for a site we built but your comments about Googlebot and the associated downside with long-tail pages gives me pause. Do you have any suggestions on how to evaluate whether or not a technique like this would bring advantages, or is it better to simply give it a try and see how it works out?
In our case, the site features real estate listings, and at any given time there are about 50,000 listings. Each listing has its own page. There are numerous other pages, some of which are just index pages (lists of listings), a couple of search pages, etc. But the majority of pages are the single listing view pages, and then following that, the lists of listings. Let's call it around 60,000 pages or so.
The total number of pageviews per day (not counting search engine crawlers) is less than the total number of pages, but I would be willing to be that some individual listing pages are accessed at a much higher rate than others (namely, the newest ones). The listings themselves are updated on a daily basis, so the cache stores would get invalidated at least once a day.
In this situation, how do I determine the appropriate caching strategy?
If you have enough cache capacity to keep every individual listing in cache then you could simply write a script that periodically hits the detail page for each listing to make sure the latest version of each listing is always available in cache.
This might already be covered, but unless you're running a threaded server or have developed some other sort of centralized locking, it's not hard to get race conditions in the cache generation.
For Basecamp it probably doesn't matter that much. For complex actions that serve mostly public requests that hit mostly "all at the same time" (think of a Media Embargo lifting on a new product announcement) it can cause real problems.
Ideally you want the first request to block the rest, so that the cache is only generated once, the other requests just spinning until the first is done, the cache is ready, and the lock released. That way all those other requests just consume sockets instead of unnecessary CPU/IO/DB time.
See step 6. The view is pulling the cache key straight from the domain model. It relies on being the informed as well about changes in its children to do that, which is what the setup in step 5 is about.
You hit the DB for the top-level. So in step 6, we'll always hit the db for the latest project, but we will not hit the db for the todolists or the todos of those todolists if the project cache is fresh.
So you save the render plus any db calls that would be triggered as part of that render.
I was thinking the other day, wouldn't it be easyish and optimal to specify/calculate all the data that is used to generate a fragment, store the information in a tree, and invalidate the parents when a node changes.
This would only invalidate data that needs to be invalidated, and it doesn't sound too hard to implement... Does it exist anywhere? What are downsides I haven't thought of?
I assume this has the effect that memcached is always full. How do you know when the hot cache is starting to fill memcached? Ie, how do you know when your cache memory is closing up to beeing to small?
But that wont show anything until it most likely already is to late. You dont realy want that value to ever go down, then you already have customers with a worse experience, and when that starts, it goes very fast down the hill.
The general gist of it is that you store HTML snippets (e.g. a todo, a list containing, or a project containing lists) in some quick storage medium (in this case Memcached). Each snippet is stored under and accessed by a key that uniquely identifies the version saved (object type + ID + update timestamp). When a new version is cached, the old version stays in the cache, unused, until it is eventially dropped out by Memcached.
Now all you need to do is to ensure that whenever an object or its children change, you update their timestamp (the "touch" part), and if you make a changes to the code, you increment the version key for said cache piece. This way new requests will not hit old cached content.
Ok, so the first interesting part is that instead of caching the result of a database request we cache the complete rendered result. And the bigger rendered parts are just constructed by combining the smaller (also cached) parts, right?
How does the application learn which version an element is at right now? Don't you have to call the database to find out?
If you're rendering an object, you've likely already fetched it from the DB. Rails models (by default) will have a created_at and updated_at datetime columns that are automatically set and updated for you, and you can use them for exactly what is shown in the article.
when you cache some html wich depends on $some_object, use a key that includes that object "version" (e.g. a timestamp).
Then on first access you save the html for a key "object-x-v1" and on subsequent access you get the same html.
When the object changes (v=2), so will the key ("object-x-v2), and when you access the same content again you will have a cache miss and the html will be regenerated, while the stale cache will just stay in memory until it's naturally evicted from memcache.
It sounds really impressive, smart, innovative UI and underlying architecture. 37 signals did it again!
What is the minimum requirements of a team, programmers/skills/time, to get into such a project and actually deliver?