A. That is fine until the first time somebody has to write a report. B. With the...

rmaccloy · on July 2, 2009

If you anticipate having to write a report, obviously you should use the right tool. If you don't, use what's simple and gets the job done. It's not incredibly hard to switch later.

On the low-end, administration for MySQL is certainly not easy; for an app you build in a day or two on Rails or Django or Sinatra (or as a CLI tool for that matter) you can easily spend more time doing sysadmin work to provision, configure and maintain MySQL than you do writing software. (MySQL shared hosting isn't everywhere.) There are plenty of throwaway webapps out there that just aren't worth setting up a database server for.

On the high end, it's not a matter of what's easy; it's a matter of what's possible.

These are different use cases, and there are different software packages being advocated for them. No one who needs a disk-backed hash table is using Hypertable, and no one who needs massively parallel analytical capability is using Tokyo Cabinet. And no one's advocating reconsidering RDBMS use as a whole; just RDBMS use as the default choice.

rjurney · on July 2, 2009

I think the key insight in all this is that the same database you store your objects in... shouldn't be the same one you write reports from. You can have two. One object store for your application, and another analytic DB for your reports.

The complaints about SQL is that is optimized for report writing, not application development. Fine, so split that out.

Which is happening. Which is working.

div · on July 2, 2009

Do you happen to know any articles on people splitting their data across 2 types of storage this way ?

If there is any data that should be stored in 2 separate stores, I can see applications becoming messy rather fast in trying to replicate changes.

Unless of course this is usually done by having a clean separation between which data goes where, but I doubt if any domain can ever really be that easily split up.

rjurney · on July 2, 2009

Well, for starters - most companies don't run reporting queries on their production SQL database. They mirror, summarize, partition, index and cube a separate reporting DB, so that big/mean queries don't cause massive latency on their site/product/production system. Which isn't quite the same as using a different data-store altogether, but some kind of split is commonplace. Which means that some kind of difference between queries/applications in reports/production is already common. They key here though, is that setting up an RDBMS that can handle analytics on even a moderately large data-set is a major task, can be complex/pricey, tends to use big iron, and only scales so far before it gets very, very expensive.

But, yes there are examples of what I just described. In practice, in many problem domains, most data of interest for reports does not change once it is written, so syncing up is not a major issue.

Streamy is a good example, I think. They use HBase for the front end, and run MapReduce jobs on the back end. http://wiki.apache.org/hadoop-data/attachments/HBase(2f)HBas... Another presentation is here: http://www.docstoc.com/docs/2996433/Hadoop-and-HBase-vs-RDBM... That is Hadoop and Hadoop, which is nice - but HBase is optimized for the front end and is fundamentally different than typical batch operation of Hadoop.

CouchDB sort of takes this approach, albeit with key/value and pre-defined and materialized map/reduce views on the same store. I think this dichotomy will become increasingly common, and will be less cumbersome than it currently is as the tools mature.

Key/Value for the front end and Map/Reduce on the back end makes a lot of sense for a lot of problems, since key/value is how many applications actually work, and there is the added benefit that systems like these scale linearly on commodity hardware using FOSS, can make it cost effective - and much simpler, than scaling a traditional RDBMs as an analytic data-store. The upside to this is too good for these systems not to win a big chunk of the market. And you can have your SQL - albeit on top of MapReduce - in reports, where it belongs :)