Do you happen to know any articles on people splitting their data across 2 types...

rjurney · on July 2, 2009

Well, for starters - most companies don't run reporting queries on their production SQL database. They mirror, summarize, partition, index and cube a separate reporting DB, so that big/mean queries don't cause massive latency on their site/product/production system. Which isn't quite the same as using a different data-store altogether, but some kind of split is commonplace. Which means that some kind of difference between queries/applications in reports/production is already common. They key here though, is that setting up an RDBMS that can handle analytics on even a moderately large data-set is a major task, can be complex/pricey, tends to use big iron, and only scales so far before it gets very, very expensive.

But, yes there are examples of what I just described. In practice, in many problem domains, most data of interest for reports does not change once it is written, so syncing up is not a major issue.

Streamy is a good example, I think. They use HBase for the front end, and run MapReduce jobs on the back end. http://wiki.apache.org/hadoop-data/attachments/HBase(2f)HBas... Another presentation is here: http://www.docstoc.com/docs/2996433/Hadoop-and-HBase-vs-RDBM... That is Hadoop and Hadoop, which is nice - but HBase is optimized for the front end and is fundamentally different than typical batch operation of Hadoop.

CouchDB sort of takes this approach, albeit with key/value and pre-defined and materialized map/reduce views on the same store. I think this dichotomy will become increasingly common, and will be less cumbersome than it currently is as the tools mature.

Key/Value for the front end and Map/Reduce on the back end makes a lot of sense for a lot of problems, since key/value is how many applications actually work, and there is the added benefit that systems like these scale linearly on commodity hardware using FOSS, can make it cost effective - and much simpler, than scaling a traditional RDBMs as an analytic data-store. The upside to this is too good for these systems not to win a big chunk of the market. And you can have your SQL - albeit on top of MapReduce - in reports, where it belongs :)