

The Trickery in "MapReduce vs. Parallel DBs" - neilc
http://everythingisdata.wordpress.com/2009/05/04/mapreduce-vs-parallel-dbs/

======
lacker
It doesn't make sense to compare mapreduce to databases. You might as well
compare C++ with a 5 iron.

Mapreduce is for batch processing; databases are for live queries. You often
use mapreduce _with_ a database; the mapreduce can fill a database with data
or take all the data out of a database and produce something of a different
format. You don't write 100,000 line SQL statements but mapreduce is designed
to run arbitrary code in the inner loop. You can't write a mapreduce-backed
website that kicks off a mapreduce with each request, but that's a great use
of SQL. Etc etc.

~~~
neilc
_databases are for live queries_

No, databases are also commonly used for batch queries. Are you familiar with
data warehousing? Teradata, Greenplum, Netezza, and similar products? This
sort of stuff, for example:

[http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-
ware...](http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/)

Parallel databases have been around since the late 1980s, and can be used for
many of the same things MapReduce is often used for (large-scale batch data
analysis). Hence the comparison.

~~~
lacker
It's true there is overlap. But the original article as well as that dbms2
link focus on "query" performance. Mapreduce is specifically not designed for
live queries, or really anything where the output is less than 10G or so. So
its performance is no good for those things. On the other hand, at least as
far as I am aware, with a data warehouse you can do some batch processing but
SQL UDFs are fundamentally restricted. It is implausible to communicate with
remote servers or parse html using a database. But those are normal things to
do in a mapreduce.

~~~
neilc
_the original article as well as that dbms2 link focus on "query" performance.
Mapreduce is specifically not designed for live queries_

"query" != "live query".

 _[MapReduce was not designed for] anything where the output is less than 10G
or so_

Where does the MapReduce architecture make any such assumption?

 _It is implausible to communicate with remote servers or parse html using a
database. But those are normal things to do in a mapreduce._

Sure -- while it is quite possible to do such things using SQL UDFs, I can
believe it isn't done very often. Nevertheless, there is _considerable_
overlap between the two tools, and plenty of scope for comparison.

