*I wonder if we just have a generation of programmers who can't think in SQL.* I...

rmaccloy · on July 2, 2009

Indeed, "kids" (by which I assume you mean "assembly-line CS grads") these days are pretty much using Hibernate, ActiveRecord or (I guess?) LINQ. This is a good thing: The O/R mismatch exists. For good programmers ORMs are more productive in most cases (and you can write raw SQL for the exceptions); bad programmers have never gotten the relational model (and there's plenty of legacy code to prove it.)

However, the discussion about "kids" is a total red herring, because I'm pretty sure all of the people involved in talking up Tokyo Cabinet, memcachedb, Couch etc on one end and Cassandra, Hypertable or HTable on the other are fully aware of the difference between a class and a table. The truly clueless don't even know the discussion is happening.

The object/document database revival circa 2009 is about three things:

A. People are already using object-based access for almost everything, so if they're not using relational features it's trivial to drop in an object store and get better performance/less overhead.

B. On the low-end, sometimes a relational database is too much overhead -- if not in performance, in administration. (SQLite is cheap, sure, but sometimes you want just a disk-backed hash table.)

C. On the high-volume-data-analytics-end, RDBMSs don't perform well enough- unless you shell out serious cash for Oracle or Greenplum, Aster Data, Teradata, etc, and even then they still can't handle the volumes MapReduce and Bigtable-alikes can.

wvenable · on July 2, 2009

A. That is fine until the first time somebody has to write a report.

B. With the right product, administration is easy. MySQL is completely ubiquitous and something like SQL server is easy. This certainly conflicts with point C.

C. Most of these people aren't doing high-volume anything. If you are doing high-volume stuff (like Google) then of course it makes sense to use something very specific to your task. But that doesn't mean Bigtable, for example, makes for a good generic solution.

rmaccloy · on July 2, 2009

If you anticipate having to write a report, obviously you should use the right tool. If you don't, use what's simple and gets the job done. It's not incredibly hard to switch later.

On the low-end, administration for MySQL is certainly not easy; for an app you build in a day or two on Rails or Django or Sinatra (or as a CLI tool for that matter) you can easily spend more time doing sysadmin work to provision, configure and maintain MySQL than you do writing software. (MySQL shared hosting isn't everywhere.) There are plenty of throwaway webapps out there that just aren't worth setting up a database server for.

On the high end, it's not a matter of what's easy; it's a matter of what's possible.

These are different use cases, and there are different software packages being advocated for them. No one who needs a disk-backed hash table is using Hypertable, and no one who needs massively parallel analytical capability is using Tokyo Cabinet. And no one's advocating reconsidering RDBMS use as a whole; just RDBMS use as the default choice.

rjurney · on July 2, 2009

I think the key insight in all this is that the same database you store your objects in... shouldn't be the same one you write reports from. You can have two. One object store for your application, and another analytic DB for your reports.

The complaints about SQL is that is optimized for report writing, not application development. Fine, so split that out.

Which is happening. Which is working.

div · on July 2, 2009

Do you happen to know any articles on people splitting their data across 2 types of storage this way ?

If there is any data that should be stored in 2 separate stores, I can see applications becoming messy rather fast in trying to replicate changes.

Unless of course this is usually done by having a clean separation between which data goes where, but I doubt if any domain can ever really be that easily split up.

rjurney · on July 2, 2009

Well, for starters - most companies don't run reporting queries on their production SQL database. They mirror, summarize, partition, index and cube a separate reporting DB, so that big/mean queries don't cause massive latency on their site/product/production system. Which isn't quite the same as using a different data-store altogether, but some kind of split is commonplace. Which means that some kind of difference between queries/applications in reports/production is already common. They key here though, is that setting up an RDBMS that can handle analytics on even a moderately large data-set is a major task, can be complex/pricey, tends to use big iron, and only scales so far before it gets very, very expensive.

But, yes there are examples of what I just described. In practice, in many problem domains, most data of interest for reports does not change once it is written, so syncing up is not a major issue.

Streamy is a good example, I think. They use HBase for the front end, and run MapReduce jobs on the back end. http://wiki.apache.org/hadoop-data/attachments/HBase(2f)HBas... Another presentation is here: http://www.docstoc.com/docs/2996433/Hadoop-and-HBase-vs-RDBM... That is Hadoop and Hadoop, which is nice - but HBase is optimized for the front end and is fundamentally different than typical batch operation of Hadoop.

CouchDB sort of takes this approach, albeit with key/value and pre-defined and materialized map/reduce views on the same store. I think this dichotomy will become increasingly common, and will be less cumbersome than it currently is as the tools mature.

Key/Value for the front end and Map/Reduce on the back end makes a lot of sense for a lot of problems, since key/value is how many applications actually work, and there is the added benefit that systems like these scale linearly on commodity hardware using FOSS, can make it cost effective - and much simpler, than scaling a traditional RDBMs as an analytic data-store. The upside to this is too good for these systems not to win a big chunk of the market. And you can have your SQL - albeit on top of MapReduce - in reports, where it belongs :)

jrockway · on July 2, 2009

in an effort to be completely generic, the ORM generates really bad SQL.

Maybe your hand-rolled ORM does, but most off-the-shelf ORMs generate excellent SQL.

gaius · on July 2, 2009

It's a running joke where I work how bad Hibernate's SQL is compared to hand-written by experienced developers. Maybe it's "good enough" for some applications, but we don't really get out of bed for less than 5000 transactions/second...

jrockway · on July 2, 2009

There are potentially two problems:

Your database's query optimizer sucks, or

The overhead is of object inflation, not a slow SQL query.

gaius · on July 2, 2009

Hibernate generates optimal SQL in all situations for all databases and their different dialects? Really?

jrockway · on July 2, 2009

I'm just saying that the optimizer sucks if semantically identical queries don't execute the same instructions against the database.

I would certainly not be surprised if this happens in real life... but the solution is to not to hand-code every SQL query, it's to fix the database.

gaius · on July 2, 2009

You speak as if writing SQL was some unpleasant or arduous task, whereas in reality it's just a DSL for data. In most cases an ORM is just another layer of complexity.

gnaritas · on July 2, 2009

> In most cases an ORM is just another layer of complexity.

That unlike SQL, drastically reduces the amount of code the programmer is forced to write for the vast majority of applications that programmers write.

SQL can't do the one thing most programs actually need, give them the ability to select a starting point, and then navigate the conceptual graph of data as the user moves around the application. ORM's provide that, object database provide that, SQL doesn't.

wvenable · on July 3, 2009

You're right, and that's what great about ORMs! But if you're asking a question of your data, like "how many widgets did I sell today?" then navigating a conceptional graph of data is the slowest, most convoluted, way of getting at that information.