
Rethinking caching in web apps - martinkl
http://martin.kleppmann.com/2012/10/01/rethinking-caching-in-web-apps.html
======
jessedhillon
Keeping in mind that I've never built anything that is Rapportive-sized, it
seems that the problems Martin is talking about here can be mitigated thus.

1\. Never access models directly from controllers. Build an API layer that
exposes discrete methods which can store and retrieve data, and use these
exclusively from the higher layers of your app. Using the example given in the
article, (using a hypothetical Python + SQLAlchemy app):

This:

    
    
      post = Post.query.get(post_id)
      posts_by_date = Post.query.filter(Post.date >= start_date).filter(Post.date <= end_date).all()
      posts_by_author = Post.query.filter(Post.author_id = user_id)
    

Becomes:

    
    
      post = api.get_post(id=post_id)
      posts_by_date = api.get_posts(before=end_date, after=start_date)
      posts_by_author = api.get_posts(author_id=user_id)
    
    

2\. APIs should not use auto-querying collections whenever possible, and
should accept configurable options to use an abstract storage class
representing tables, buckets, interfaces to externals services or whatever.

So, this:

    
    
      def get_post(id):
          return Post.query.options(joinedload(Post.comments)).filter(Post.id == id).one()
    
    

becomes:

    
    
      def get_post(id):
          return PostStore.query(post_id=id, with_comments=True)
    
    

Essentially, the idea is to channel access to the data store through only
discrete paths. First, split the model layer into methods that support higher
layer needs -- these are what Martin is calling the dependencies I think.
Identify methods like `get_post` which contains logic and information such as
how to get posts by id, and whether or not comments should be eager loaded.

Second, abstract the actual dispatch of queries to the data store into domain-
specific stores. Instead of using a generic data model, write stores that know
about Posts, Comments, Users etc and wrap the generic model classes. In this
way, a RelationalPostStore knows what it means to get a post from the store,
along with its comments, users, author information etc. A quickly changed
user-configurable setting can switch that out with an HDFSPostStore, which
knows how to get those objects from a Hadoop backend. A CouchPostStore can do
the same, etc.

This is the pattern that has been emerging through my own repeated web dev
experiences. I'd be interested to know if there are obvious/subtle
improvements or flaws.

~~~
gfodor
Somehow I think adding yet another layer to software that ultimately consumes
some inputs and spits out some text is a little extreme. Hopefully we can
figure out a solution that doesn't require it.

~~~
jessedhillon
Two thoughts:

1) Which software does something other than merely operating on given inputs
and deterministically generating text, graphics, or more generally, colored
squares? I think that description fully characterizes _all_ non-hardware-
driving code.

2) Web apps do more than generate text. They modify stored data in domain-
specific ways, and ultimately that data is what's important -- it's everything
from opinions to orders.

------
gfodor
This is a great re-imagining of web application design through the lens of the
Lambda Architecture proposed by Nathan Marz. Decomposing problems into pure
functions and immutable data structures always seems to tease out these nice
ideas. I think this trend of re-imagining long held ideas through these
concepts can be traced back to Rich Hickey's work on Clojure (and surely
further, but Clojure really set the spark off to me), it's going to be amazing
to see all the intellectual dividends that work pays beyond the concrete
language itself.

The key though is these abstractions need to be water-tight and as easy to use
as their contemporary, side-effecting counterparts. It's just plain easy to
rig up a Rails model/controller binary system right now and get things drawing
on the screen. It stands clear on the "Easy not Simple" side of things while
these new concepts are still very much in the "Simple but not that Easy" side.
For these types of architectures to take hold they have to become natural,
conventional, and require little extra mental baggage to use them. Fortunately
its much easier to make difficult-to-use, simple things easy than complicated,
easy things simple. These simple ideas will eventually rise to be easy as
well, but as we are in the early wild-west stages of these ideas being turned
into systems we are going to see a lot of rapid evolution on the way there.

~~~
martinkl
Rich and Nathan are really leading the thinking in this area; they are doing
great stuff. But you're completely right that it is currently much easier to
throw together a side-effect-ful system with frameworks like Rails. That's not
surprising: the RDBMS way of thinking has been with us for a long time, and so
the tooling and people's understanding of the model has become very good.

The challenge for any new architecture is to create frameworks and tools that
are even better (easier to understand and learn, more efficient to work with,
easier to maintain, etc) than what is already out there. That's what will
determine their adoption, and that's a good thing: we all want good tools to
work with, so the bar should be high.

------
omarqureshi
For anyone using Postgres who doesnt mind getting their hands dirty with a bit
of plpgsql, there are always Materialized Views -
[http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized...](http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views)

------
lkrubner
For incredible scale, and heretic ideas, I think people should consider the
decisions that Colin Steele made as CTO of RoomKey. He was thinking about how
to build a database that could offer nearly real time searching and yet handle
any level of traffic. The idea he hit upon was compiling his code with a
static snapshot of his database. Think about this carefully: instead of lots
of webservers putting strain on a central database, every web server has its
own snapshot of the database. If traffic doubles, you can double the number of
instances you have on Amazon, and you are doubling both the number of web
servers you have, and also the number of databases you have. The team at
RoomKey has built this app with Clojure, and stores the data in an instance of
Solr that actually gets compiled with the code. With an automated build
system, you could probably roll out new snapshots of the database several
times a day, so the data is never very stale.

This is how Colin Steele wrote about it on his blog:

"Put another way, users of this system have a high tolerance for inconsistent
reads. Bob’s and Jane’s hotel universes need not be identical. (They can’t be
completely divergent; eventual consistency is fine.)

So: A-ha! The messy relational data could live in a secluded back-end “content
sausage factory”, whose sole purpose in life would be to produce a crisp, non-
relational version of the hotel universe as known best at that point in time.

This “golden master”, non-relational database of hotels could then be shipped
off to the live, operational system which faces users.

Moreover, different users might be exposed to different versions of the
“golden master” hotel database, allowing us to test and to do progressive and
continuous rollouts. Decision One: I put relational data on one side and
“static”, non-relational data on the other, with a big wall of verification
process between them.

This led to Decision Two. Because the data set is small, we can “bake in” the
entire content database into a version of our software. Yep, you read that
right. We build our software with an embedded instance of Solr and we take the
normalized, cleansed, non-relational database of hotel inventory, and jam that
in as well, when we package up the application for deployment.

Egads, Colin! That’s wrong! Data is data and code is code!

We earn several benefits from this unorthodox choice. First, we eliminate a
significant point of failure - a mismatch between code and data. Any version
of software is absolutely, positively known to work, even fetched off of disk
years later, regardless of what godawful changes have been made to our content
database in the meantime. Deployment and configuration management for
differing environments becomes trivial.

Second, we achieve horizontal shared-nothing scalabilty in our user-facing
layer. That’s kinda huge. Really huge. One of our crack developers and now our
Front-End Development Manager drove Decision Three. Our user-facing servers
render JSON. A Javascript application, bootstrapped by a static HTML page,
creates the entire UI, rendering elements into the brower’s DOM as it
interacts with our API. This so-called “fat client” or “single page Javascript
app” has been steadily coming into vogue for the last few years, but it was
far from an obvious choice at the time.

We reap several benefits from this decision. First, our UI gets faster without
us having to pay for new servers that take advantage of Moore’s Law. Second,
our UI gets faster because browser makers are dumping tons of time an energy
into improved JS performance. Third, HTML is verbose and JSON isn’t, saving us
both compute cycles and bandwidth on the server side.

These decisions en toto yield a picture of a three-layer cake, with a messy
relational tier on the bottom, an operational API and SOA tier in the middle,
and a fat client presentation layer on top."

[http://www.colinsteele.org/post/27929539434/60-000-growth-
in...](http://www.colinsteele.org/post/27929539434/60-000-growth-in-7-months-
using-clojure-and-aws)

~~~
erichocean
At my company, we're actually doing that as well, but we took it further: no
tiers at all.

We do load-balancing at the client, every server has a public IP address, we
use DNS round-robin only to distribute the login phase; after that, it's
location-based with re-connect logic in the app.

We don't store the entire database on every node, but we do the Cassandra-
style equivalent: each datacenter has all of the data. We can store all of the
data in a single datacenter.

We never have to worry if the "database" is down, because the app server is
also down whenever it is.

We even have global visibility on all machines in a Cassandra-like design --
something that took Google's Spanner team custom GPS and atomic timekeeping
hardware to achieve. We're doing it with a lightweight consensus algorithm
that achieves a reliable commit timestamp globally, and NTP. We can take a
global snapshot, or return a globally-consistent query across all data at any
time.

I think we'll see a lot of architectures built like this in the (near) future.
Collapsing the tiers is great for maintainability and performance.

\----

FWIW I found <http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html> to
be...lacking. He ignores all problems related to multi-update completely.
Nathan's solution does nothing to solve consistency problems except in the
trivial situation where every piece of data can be "updated" (really appended)
completely in isolation from any other update.

You actually _do_ need multi-data consistency _somewhere_ to write virtually
all non-trivial apps, and both Spanner (and the architecture we use) deal with
that in a sane way, that developers can actually use.

Nathan is definitely on the path towards how to implement such a beast;
hopefully his book will have a fuller implementation of where he's heading.

~~~
cwp
We do this at instandomainsearch.com too.

Our domain database is relatively small and changes slowly, so we do daily
updates and distribute snapshots to all the servers. We have copies in 4
different AWS regions, and all the load-balancing and failover is done in
client-side Javascript.

One server in each region is plenty for our current traffic levels, but if we
need to, scaling horizontally is trivial. More important for us is that the
in-browser load-balancing is based on measured latency, which ensures that
each user gets the snappiest interface that we can provide.

All in all, it works really well.

~~~
cvillecsteele
At Room Key we're using Amazon's DNS solution for load balancing (lowest
latency).

~~~
cwp
When Amazon came out with DNS-based load-balancing, we thought about
switching, but it'd be a downgrade from what we've got already. We direct
traffic to the server that's showing the lowest latency _right now_ and switch
to another server if the measured latency changes. Amazon's latency data will
be much slower to adapt to network conditions, and with DNS caching, it
wouldn't provide fail-over if the lowest-latency server happened to go down or
become unreachable.

Amazon's scheme is pretty nifty though, and we'd definitely use it if we
didn't already have a more dynamic solution already in place.

------
akurilin
Very interesting read. It's fascinating to see the functional paradigm being
applicable towards individual services and components rather than just
functions.

It's somewhat unfortunate that most of us will never have to deal with this
kind of scale/employee ratio, where engineers need to fully understand the
implications of dozens of major trade-offs like the ones Martin talks about. I
think the state of the art would advance much faster if more of us had to deal
with this kind of work on daily basis.

------
comain
I think this is just what every search engine does every day years ago.
Anyway, very nice idea!

------
joevandyk
I dunno, seems like a ton of work just to avoid doing a materialized view.

------
petewailes
K, I'm going to throw my hat in the ring here and say this is a good idea, and
outline (mostly through referencing other people's work as it's 00.38) how
I've spent the last few years building applications.

I'm going to intentionally simplify so that the less experienced here can keep
up (hopefully). For those with more code-fu, feel free to poke me for more
information.

Right, imagine we have a blog engine for something that needs to be able to
handle huge data requests. Our code base logic at its simplest looks something
like this:

Down:

Request ->

Router (turns URL structure into a set of useful data inputs) ->

Pre-processor (uses those inputs to fire up the classes we'll need, namely to
get data first) ->

Primary validation (makes sure everything that was put in is sensible and
safe) ->

Model pre-processor (calculates which data store we need to query, a key-value
store of all data, a cache of various pieces of data, or our single monolithic
store) ->

Data store (cache/database/whatver)

Up:

Model post-processor (where our data was anything other than our computer
store of values, generate whatever needs to be there and store it for next
time) ->

Secondary validation (ensure our output is sanitized) ->

Business logic post-processor (to apply any data transformation required) ->

Output

It's specifically the pre-processor that we're talking about here. There's
millions of ways to implement it, but something like what's outlined in the
original post is currently my favourite way, for both the logging/debugging
options it gives, as well as the data manipulation tools that it de facto
provides.

Whilst some of it isn't very sensible for modern web practice (XML/XSLT for
example), this by Tony Marston is well worth studying until you really grok
it, as a fuller representation of what we're talking about here:
<http://tonymarston.net/php-mysql/infrastructure.html>

This is essentially taking elements 3, 4, 5 & 6 and abstracting them out so
that rather than just being for talking to databases as data stores, they're
able to talk to a variety of other model data sources - cache levels, the
database and a globally pre-computered store of all possible values (for the
fastest level of access).

I'm also going to re-state the shout-out made in the original post to Nathan
Marz' work here: <http://nathanmarz.com/blog/how-to-beat-the-cap-
theorem.html>. Excellent reading for anyone interested in large-scale
application design. I'll also through in this excellent post on event sourcing
by Martin Fowler: <http://www.martinfowler.com/eaaDev/EventSourcing.html>

Questions?

------
programminggeek
The real problem is that the whole Rails, MVC, ActiveRecord structure ties
together three separate things - your app (biz logic, entities), your
framework (Rails), and your persistance mechanism (ActiveRecord). Rails should
only be used for what it's good at - routes, controllers, views, assets. Your
app should only be used for what it's good at - biz logic and entites. Your
persistance mechanism should only be used for retrieving and saving data, not
defining your application models.

Lately I've been going down the rabbit hole of Uncle Bob's Screaming
Architecture and Alistair Cockburn's Ports and Adapters Architecture to create
something that attempts to make it super obvious what is going on and where.

The idea is you have an "app" that is totally separate from Rails or Active
Record. It consists of Actions, Entities, and Contracts. Actions are basically
use cases, user stories, interactors, or some equivalent to that. Entities are
just objects that hold data and do validation. Contracts act as a proxy for
your data gateways and ensure the input/output formats.

Gateways are the persistance mechanism. They are objects defined by a contract
and are implemented as datasource specific drivers that are swappable, so long
as the driver adheres to the contract. The "app" side of things doesn't care
what kind of driver you use, it only cares that you send and receive data
using the right formats.

In the end you end up with your "app" being a nice bit of code that is totally
testable outside of a framework or the database. To make it run with a
database, you just write a gateway driver for your database that matches the
Contract you wrote. It uses the simplest form of Dependency Injection to make
this work.

The nice thing about this approach is that your persistance mechanism is
totally pluggable, so as you need to scale, you can swap out your gateway
driver without touching the app. Also, it's obvious when DB calls are
happening in this system because when you call say post_gateway.find you know
that it's going thorough the contract, is going to connect to whatever gateway
driver you specify, and that it will return in the right data format
regardless of if it hits memcache, mysql, mongo, redis, cassandra, a 3rd party
api or whatever.

Also, you can write the whole "app" part of your code TDD and get like 100%
test coverage. If you feel like you need to test against real data instead of
mocking, you can write an in-memory or filesystem driver and your tests will
still be crazy fast.

In this kind of system you could still use Rails or any other framework for
your views/routing/controllers. You can use ActiveRecord or anything else as
long as that code lives behind a gateway instead of being used to define your
business objects.

If you are interested in this kind of architecture, I'm planning on open
sourcing it soon once I have a few command line generators written to make it
a bit easier to get started on.

