The Trickery in "MapReduce vs. Parallel DBs"

lacker · on May 5, 2009

It doesn't make sense to compare mapreduce to databases. You might as well compare C++ with a 5 iron.

Mapreduce is for batch processing; databases are for live queries. You often use mapreduce with a database; the mapreduce can fill a database with data or take all the data out of a database and produce something of a different format. You don't write 100,000 line SQL statements but mapreduce is designed to run arbitrary code in the inner loop. You can't write a mapreduce-backed website that kicks off a mapreduce with each request, but that's a great use of SQL. Etc etc.

neilc · on May 5, 2009

databases are for live queries

No, databases are also commonly used for batch queries. Are you familiar with data warehousing? Teradata, Greenplum, Netezza, and similar products? This sort of stuff, for example:

http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-ware...

Parallel databases have been around since the late 1980s, and can be used for many of the same things MapReduce is often used for (large-scale batch data analysis). Hence the comparison.

lacker · on May 5, 2009

It's true there is overlap. But the original article as well as that dbms2 link focus on "query" performance. Mapreduce is specifically not designed for live queries, or really anything where the output is less than 10G or so. So its performance is no good for those things. On the other hand, at least as far as I am aware, with a data warehouse you can do some batch processing but SQL UDFs are fundamentally restricted. It is implausible to communicate with remote servers or parse html using a database. But those are normal things to do in a mapreduce.

neilc · on May 5, 2009

the original article as well as that dbms2 link focus on "query" performance. Mapreduce is specifically not designed for live queries

"query" != "live query".

[MapReduce was not designed for] anything where the output is less than 10G or so

Where does the MapReduce architecture make any such assumption?

It is implausible to communicate with remote servers or parse html using a database. But those are normal things to do in a mapreduce.

Sure -- while it is quite possible to do such things using SQL UDFs, I can believe it isn't done very often. Nevertheless, there is considerable overlap between the two tools, and plenty of scope for comparison.

anamax · on May 5, 2009

There's a difference between batch processing and batch queries.

At least one clustered sql database vendor supports the use of map-reduce to do data transforms. The results of which are made available via standard sql queries.