The key though is these abstractions need to be water-tight and as easy to use as their contemporary, side-effecting counterparts. It's just plain easy to rig up a Rails model/controller binary system right now and get things drawing on the screen. It stands clear on the "Easy not Simple" side of things while these new concepts are still very much in the "Simple but not that Easy" side. For these types of architectures to take hold they have to become natural, conventional, and require little extra mental baggage to use them. Fortunately its much easier to make difficult-to-use, simple things easy than complicated, easy things simple. These simple ideas will eventually rise to be easy as well, but as we are in the early wild-west stages of these ideas being turned into systems we are going to see a lot of rapid evolution on the way there.
The challenge for any new architecture is to create frameworks and tools that are even better (easier to understand and learn, more efficient to work with, easier to maintain, etc) than what is already out there. That's what will determine their adoption, and that's a good thing: we all want good tools to work with, so the bar should be high.
1. Never access models directly from controllers. Build an API layer that exposes discrete methods which can store and retrieve data, and use these exclusively from the higher layers of your app. Using the example given in the article, (using a hypothetical Python + SQLAlchemy app):
post = Post.query.get(post_id)
posts_by_date = Post.query.filter(Post.date >= start_date).filter(Post.date <= end_date).all()
posts_by_author = Post.query.filter(Post.author_id = user_id)
post = api.get_post(id=post_id)
posts_by_date = api.get_posts(before=end_date, after=start_date)
posts_by_author = api.get_posts(author_id=user_id)
return Post.query.options(joinedload(Post.comments)).filter(Post.id == id).one()
return PostStore.query(post_id=id, with_comments=True)
Second, abstract the actual dispatch of queries to the data store into domain-specific stores. Instead of using a generic data model, write stores that know about Posts, Comments, Users etc and wrap the generic model classes. In this way, a RelationalPostStore knows what it means to get a post from the store, along with its comments, users, author information etc. A quickly changed user-configurable setting can switch that out with an HDFSPostStore, which knows how to get those objects from a Hadoop backend. A CouchPostStore can do the same, etc.
This is the pattern that has been emerging through my own repeated web dev experiences. I'd be interested to know if there are obvious/subtle improvements or flaws.
That's what models are for though. The problem is your models are acting the part of light wrappers over a database. That's not what models are, and treating them like that is bad.
post = api.get_post(id=post_id)
post = Post.get(post_id)
All remote calls and fancy caching gets added to the api module so that neither the model nor other consumers, like the view, need to be aware of it. Also this keeps a stable interface for consumers to build on while the model and view evolves (great for unit tests, other apps using your app, etc).
Further, this lets you reuse the same "business logic" in different frontend scenarios, such as exposing parts of the API over HTTP: https://github.com/shazow/pyramid-sampleapp/blob/master/foo/...
However, I would like to take it further. Even if you move the data store access into a separate API layer, that API is still written in an imperative language, which limits the transformations that a framework can perform on it. I dream of a fully declarative way of specifying the communication dependencies, which would open up completely new processing modes, such as running all the business logic in Hadoop.
1) Which software does something other than merely operating on given inputs and deterministically generating text, graphics, or more generally, colored squares? I think that description fully characterizes all non-hardware-driving code.
2) Web apps do more than generate text. They modify stored data in domain-specific ways, and ultimately that data is what's important -- it's everything from opinions to orders.
This is how Colin Steele wrote about it on his blog:
"Put another way, users of this system have a high tolerance for inconsistent reads. Bob’s and Jane’s hotel universes need not be identical. (They can’t be completely divergent; eventual consistency is fine.)
So: A-ha! The messy relational data could live in a secluded back-end “content sausage factory”, whose sole purpose in life would be to produce a crisp, non-relational version of the hotel universe as known best at that point in time.
This “golden master”, non-relational database of hotels could then be shipped off to the live, operational system which faces users.
Moreover, different users might be exposed to different versions of the “golden master” hotel database, allowing us to test and to do progressive and continuous rollouts.
Decision One: I put relational data on one side and “static”, non-relational data on the other, with a big wall of verification process between them.
This led to Decision Two. Because the data set is small, we can “bake in” the entire content database into a version of our software. Yep, you read that right. We build our software with an embedded instance of Solr and we take the normalized, cleansed, non-relational database of hotel inventory, and jam that in as well, when we package up the application for deployment.
Egads, Colin! That’s wrong! Data is data and code is code!
We earn several benefits from this unorthodox choice. First, we eliminate a significant point of failure - a mismatch between code and data. Any version of software is absolutely, positively known to work, even fetched off of disk years later, regardless of what godawful changes have been made to our content database in the meantime. Deployment and configuration management for differing environments becomes trivial.
Second, we achieve horizontal shared-nothing scalabilty in our user-facing layer. That’s kinda huge. Really huge.
We reap several benefits from this decision. First, our UI gets faster without us having to pay for new servers that take advantage of Moore’s Law. Second, our UI gets faster because browser makers are dumping tons of time an energy into improved JS performance. Third, HTML is verbose and JSON isn’t, saving us both compute cycles and bandwidth on the server side.
These decisions en toto yield a picture of a three-layer cake, with a messy relational tier on the bottom, an operational API and SOA tier in the middle, and a fat client presentation layer on top."
We do load-balancing at the client, every server has a public IP address, we use DNS round-robin only to distribute the login phase; after that, it's location-based with re-connect logic in the app.
We don't store the entire database on every node, but we do the Cassandra-style equivalent: each datacenter has all of the data. We can store all of the data in a single datacenter.
We never have to worry if the "database" is down, because the app server is also down whenever it is.
We even have global visibility on all machines in a Cassandra-like design -- something that took Google's Spanner team custom GPS and atomic timekeeping hardware to achieve. We're doing it with a lightweight consensus algorithm that achieves a reliable commit timestamp globally, and NTP. We can take a global snapshot, or return a globally-consistent query across all data at any time.
I think we'll see a lot of architectures built like this in the (near) future. Collapsing the tiers is great for maintainability and performance.
FWIW I found http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html to be...lacking. He ignores all problems related to multi-update completely. Nathan's solution does nothing to solve consistency problems except in the trivial situation where every piece of data can be "updated" (really appended) completely in isolation from any other update.
You actually do need multi-data consistency somewhere to write virtually all non-trivial apps, and both Spanner (and the architecture we use) deal with that in a sane way, that developers can actually use.
Nathan is definitely on the path towards how to implement such a beast; hopefully his book will have a fuller implementation of where he's heading.
One server in each region is plenty for our current traffic levels, but if we need to, scaling horizontally is trivial. More important for us is that the in-browser load-balancing is based on measured latency, which ensures that each user gets the snappiest interface that we can provide.
All in all, it works really well.
Amazon's scheme is pretty nifty though, and we'd definitely use it if we didn't already have a more dynamic solution already in place.
Your system sounds really interesting though, thanks for writing it up. For most people, the hardware requirement of a Spanner-like solution does seem impractical, so maybe that's a good tradeoff.
I'm curious what industry this is in? Or what company? 17ms is really low! I wish more sites would put that much effort into their latency.
afaik you have 3+ data centers, and need to ensure consistency between them. They agree on an exact time (hmmm how - Ntp offsets where each data center acts as stratum two for the others, eventually you see each others drift???)
anyway, exact time agreed, timestamps on every transaction, labelled uuids and then send them as updates to other centers. Is thatroughly it? Or am I way off
how do you deal with collisions - if I sell the same room twice in two data centers who loses, how is the business prcess handled?
afaik you have 3+ data centers, and need to ensure consistency between them. They agree on an exact time (hmmm how - Ntp offsets where each data center acts as stratum two for the others, eventually you see each others drift???) anyway, exact time agreed, timestamps on every transaction, labelled uuids and then send them as updates to other centers. Is thatroughly it? Or am I way off
The servers do not agree on exact time, but they do agree on the timestamp for a particular update, using a lightweight consensus protocol.
So, here's the properties that actually matter:
1. A client is connected to a node in a single datacenter.
2. Multiple clients are connected to multiple nodes across datacenters.
3. At timestamp T, all clients should be able to read the same data from every datacenter if it can read from any datacenter (this is what is meant by "global consistency").
4. It should always be possible to read at time T even if it is actually some time T' in the future. (This is what's needed to support snapshots.)
We label an arbitrary collection of records as an "entity". All records for an entity can be stored on a single machine, and are updated atomically (transactionally). On average, an entity should be < 1MB in size, but we currently allow them to get as large as 64MB if needed. (We serialize updates to the same entity in the same datacenter; in practice, we could use Paxos here (like Google), but our particular application doesn't need that complexity.)
Entities are identified by a UUID, and a particular commit id refers to a particular version of an entity. (We do append updates.) The commit id is a git-style hash of the contents and the parent commit(s). This facilitates both auditing and merging (see below).
I'll cover how that commit gets a timestamp and thus becomes globally visible as part of the next question.
The same entity can be updated at multiple datacenters containing it at the "same" time. If these updates are identical, they result in the same commit id (remember, it's a hash); otherwise, the commits will be different. (Unlike git, we DO NOT hash a timestamp with the commit.)
When committed, initially there is no timestamp. The commit is "in the system", but it has not yet become visible, except (conditionally) on the datacenter where it was just written to. It's not globally visible at this point, and we don't let any other client see it.
Next, the datacenter "votes" for the commit by associating it with a timestamp, and also forwards the commit/vote/timestamp to another datacenter. That datacenter receives the commit, and if it has no other, conflicting commit, votes with its timestamp, and then forwards it on to the next data center, recursively, until it has reached every datacenter.
A commit in our system becomes globally visible precisely when all datacenters have voted. The commit is made globally visible at the latest vote/timestamp.
At this point, we are guaranteed two things:
1. All datacenter have the commit.
2. They all have voted there is no conflict, and also provided a timestamp.
So what about the case of a conflict? When that occurs, by definition, one of the datacenters will not vote. Instead, what happens is our version of "read repair", which really, is just a merge between two commits (think: git). The merged commit and its parent commits are sent on to the next datacenter, and also given a vote/timestamp. (Our merge procedure is a pure function, and can be customized per entity.)
The only remaining details are as follows:
1. How does each datacenter know when it has received all votes?
2. How do we handle partions, which by definition, prevent voting?
We handle 1. with a global datacenter topology, that is distributed as an entity, so it too becomes globally visible atomically.
We handle 2. by arranging our datacenters in a tree topology. When a partition occurs, we disable branches lower in a tree, and we empower the parent node to "vote" for the disabled node.
In practice, thanks to this topology, we don't actually need to send around all of the votes; we can actually just forward the timestamp of the parent node to its parent, recursively.
NTP works for all of this because they all agree on the timestamp, since it's part of the vote. Each datacenter has the invariant that each new vote timestamp is monotonically the same or increasing up the datacenter tree, so we'll never have a situation where an "old" commit becomes visible after a new commit. A datacenter can only vote (for a particular dataset) with a timestamp greater than its previous vote.
At the business process level, we basically don't react to a commit until it's globally visible. That's when consensus is achieved and it's safe to move forward with a workflow, such as updating other entities.
Anyway, hope that helps!
Thats a fantastic response thank you. Sorry for the delay - ironically timezones are a problem !
I have been wanting to think a bit more about the time / global consistency architecture that is creeping up, so if you do not mind could I ask a couple more questions
1. your approach seems to conflate system time and real-world time. That is do you actually store the real world time an event took place as well as the time the datacenters all agreed. (I am assuming vote/timestamp is a MVCC style identifer.
2. If I added time.sleep(5) to datacenter X on each vote, it seems that there is a ratchet effect - since all datacenters take the latest timestamp, and no one can vote with an earlier timestamp than agreed, you will always start pushing out to the future - how can you recover? Or am I off?
3. Just how much traffic is there between your data centers - is it worth it?
In general I like the distributed datacenter / global consistency approach based on a form of MVCC and I like your voting approach. However 3. is my biggest question - sharding does seem such a simpler fix. What is the problem you are solving with everything in each datacenter?
We have two simultaneous goals overall:
1. Erlang-like uptime (preferably, as close to 100% as possible). Our product requires this, because downtime is extremely costly for our customers. Our product is correspondingly expensive, although not overly so.
2. Extremely low latency. This is why we replicate a dataset to multiple datacenters in the first place, and why all of them are always "online", in the sense that failover can be instantaneous. We want each device that connects (especially mobile devices) to be near the dataset.
We actually don't use the highest level datacenters (topologically speaking) to service clients most of the time, although we can if a particular client (usually, a mobile device) has better latency to it than to other datacenters. The physical datacenter costs goes up considerably as you proceed up the tree, since we increase the replication factor at each datacenter at each level (maximum of 7). We really, really don't want downtime.
So that's why we do this: crazy high uptime coupled with very low-latency. We also put a datacenter with a single dataset on a customer's premises, which gives us incredibly low latency on their local LAN for reads, which of course dominate our system. Our target is a 99.9% mean < 17ms measured at the client (so, round-trip). An edge datacenter is still part of the whole datacenter network, so if it goes down, the client connections will switch automatically to the next fastest datacenter containing their dataset (generally, the next one up in the tree).
Finally, our maximum datacenter size is currently 48 nodes. There is no one datacenter in our system that holds every dataset for every customer. We assign new datasets to existing datacenters using a scale-free (randomized) algorithm, which also helps in terms of downtime from random failures.
Timestamps are actually per dataset, so drift isn't a datacenter-wide issue. Nevertheless, persistent, large drift would cause timestamps to move ahead of clients, which would be weird. Currently, that never happens.
A parent rejects obviously bad data from a child datacenter, such as a timestamp 5 seconds in the future on a vote. That would simply (temporarily) mark the datacenter as down. Timestamps are not set by external clients, so this isn't really a concern in practice, it's more of a bug detection mechanism.
The traffic between datacenters is actually extremely minimal, akin to what git does in a push. We send only de-duplicated data between datacenters, not an entire entity. So, for example, if you had an entity with 100 records, and you updated one of them, we'd send the new commit and the updated record (only); the other 99 records would not be sent. The data is also compressed.
We do not replicate secondary indexes. This is one of the problems (IMO) with Cassandra's native multi-datacenter functionality. We expand a de-duplicated commit at each data center into a coalesced commit (everything in the Entity is now physically close to each other on-disk) and we fully expand it out into any secondary indexes, which can be 100x the data of the coalesced entity.
Notably, secondary indexes are also globally consistent in our system, not just the primary entity. I don't know if this is true in Spanner or not -- if anyone knows, do share! :)
Seems topology plays a fairly big role, larger than I expected.
Thank you. I am sorry if it seems like I was trying to steal your inner most secrets, but I find that unless I know enough to (think) I can rebuild it I really have not understood.
What I think is exciting now is that there is a rush of new distributed solutions from many different directions - the RDBMS has held sway since the late 70s and now hierarchical and network designs are coming back I to vogue - and almost anyone can play (not that you are just anyone, but you don't need to be oracle or sybase)
part of me is missing the old certainties of rdbmses and part of me is excited - I have to know the tech and choose the best option for the domain
one final question - more business orientated. You grew a solution to meet specific business needs - how long from the first idea till live client connect was it - and was there business support for the time it would take (ie were you constantly fighting for time or were you given space and time?)
hard to answer I suspect but interested anyway
So far it has worked really well, but I don't think it really offers many advantages scalability wise over other NoSQL solutions. Like most other solutions it trades fast writes/consistent reads for fast reads, which is fine in many cases. Around the time we were adding Solr though, I heard about Datomic (http://www.datomic.com/). I haven't had time to really investigate it, but it seems to me that Datomic provides fast reads in a manner very similar to the Solr configuration you described. Live versions of the database's index are deployed to agents that live on your app servers, but unlike the Solr configuration these agents receive updates to the index in real time. At a glance at least, it seems to offer all of the advantages of the Solr solution while still allowing relatively fast updates to the data.
This is an interesting talk on the topic by Rich Hickey (who created Datomic): http://www.youtube.com/watch?v=Cym4TZwTCNU. It's long but I thought it was worthwhile.
This is handled by grouping nodes together physically, and calling the resulting bundle a datacenter.
Data is stored using a Cassandra-like algorithm, that locates the data on each node in the cluster.
This solves both the "too much data for one machine" and the low-latency problem. Now the data just needs to fit in a single datacenter.
is this sharding or is the whole dbase in each instance
if the latter how do you deal with two instances selling the same room on same day ?
It seems nice but do they recieve updates from each other or is it fed in from the back end. Either way the nice simplicty gets all confused haning updates
It's somewhat unfortunate that most of us will never have to deal with this kind of scale/employee ratio, where engineers need to fully understand the implications of dozens of major trade-offs like the ones Martin talks about. I think the state of the art would advance much faster if more of us had to deal with this kind of work on daily basis.
I'm going to intentionally simplify so that the less experienced here can keep up (hopefully). For those with more code-fu, feel free to poke me for more information.
Right, imagine we have a blog engine for something that needs to be able to handle huge data requests. Our code base logic at its simplest looks something like this:
Router (turns URL structure into a set of useful data inputs) ->
Pre-processor (uses those inputs to fire up the classes we'll need, namely to get data first) ->
Primary validation (makes sure everything that was put in is sensible and safe) ->
Model pre-processor (calculates which data store we need to query, a key-value store of all data, a cache of various pieces of data, or our single monolithic store) ->
Data store (cache/database/whatver)
Model post-processor (where our data was anything other than our computer store of values, generate whatever needs to be there and store it for next time) ->
Secondary validation (ensure our output is sanitized) ->
Business logic post-processor (to apply any data transformation required) ->
It's specifically the pre-processor that we're talking about here. There's millions of ways to implement it, but something like what's outlined in the original post is currently my favourite way, for both the logging/debugging options it gives, as well as the data manipulation tools that it de facto provides.
Whilst some of it isn't very sensible for modern web practice (XML/XSLT for example), this by Tony Marston is well worth studying until you really grok it, as a fuller representation of what we're talking about here: http://tonymarston.net/php-mysql/infrastructure.html
This is essentially taking elements 3, 4, 5 & 6 and abstracting them out so that rather than just being for talking to databases as data stores, they're able to talk to a variety of other model data sources - cache levels, the database and a globally pre-computered store of all possible values (for the fastest level of access).
I'm also going to re-state the shout-out made in the original post to Nathan Marz' work here: http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html. Excellent reading for anyone interested in large-scale application design. I'll also through in this excellent post on event sourcing by Martin Fowler: http://www.martinfowler.com/eaaDev/EventSourcing.html
Lately I've been going down the rabbit hole of Uncle Bob's Screaming Architecture and Alistair Cockburn's Ports and Adapters Architecture to create something that attempts to make it super obvious what is going on and where.
The idea is you have an "app" that is totally separate from Rails or Active Record. It consists of Actions, Entities, and Contracts. Actions are basically use cases, user stories, interactors, or some equivalent to that. Entities are just objects that hold data and do validation. Contracts act as a proxy for your data gateways and ensure the input/output formats.
Gateways are the persistance mechanism. They are objects defined by a contract and are implemented as datasource specific drivers that are swappable, so long as the driver adheres to the contract. The "app" side of things doesn't care what kind of driver you use, it only cares that you send and receive data using the right formats.
In the end you end up with your "app" being a nice bit of code that is totally testable outside of a framework or the database. To make it run with a database, you just write a gateway driver for your database that matches the Contract you wrote. It uses the simplest form of Dependency Injection to make this work.
The nice thing about this approach is that your persistance mechanism is totally pluggable, so as you need to scale, you can swap out your gateway driver without touching the app. Also, it's obvious when DB calls are happening in this system because when you call say post_gateway.find you know that it's going thorough the contract, is going to connect to whatever gateway driver you specify, and that it will return in the right data format regardless of if it hits memcache, mysql, mongo, redis, cassandra, a 3rd party api or whatever.
Also, you can write the whole "app" part of your code TDD and get like 100% test coverage. If you feel like you need to test against real data instead of mocking, you can write an in-memory or filesystem driver and your tests will still be crazy fast.
In this kind of system you could still use Rails or any other framework for your views/routing/controllers. You can use ActiveRecord or anything else as long as that code lives behind a gateway instead of being used to define your business objects.
If you are interested in this kind of architecture, I'm planning on open sourcing it soon once I have a few command line generators written to make it a bit easier to get started on.