

Parallel DBMSs faster than MapReduce - finin
http://database.cs.brown.edu/sigmod09/

======
andr
They are different beasts really. It'd be hard to implement something complex,
like the PageRank algorithm, in SQL. On the other hand, MapReduce is not
designed for tasks such as "SELECT * FROM table WHERE field > 5", which is one
of the benchmarks the paper uses.

Also, parallel DBMSs don't handle failure well. Ask any of the MapReduce
developers at Google and they'll tell you failure handling is the hardest part
of the problem.

With all that said, I'd be very interested to see a comparison between
Vertica, DBMS-X and MySQL, running in sharded or MySQL Cluster mode.

PS The paper is coauthored by Microsoft.

~~~
gaius
_Also, parallel DBMSs don't handle failure well._

It depends on the type of failure. You can simply skip offline partitions, for
example.

 _PS The paper is coauthored by Microsoft._

Microsoft were doing shared-nothing clustering on SQL Server LONG before MySQL
types had even coined the word sharding to mean the same thing...

~~~
andr
__Microsoft were doing shared-nothing clustering on SQL Server LONG before
MySQL types had even coined the word sharding to mean the same thing... __

Microsoft are heavily invested in SQL Server, so they are biased against
MapReduce.

~~~
gaius
_Microsoft are heavily invested in SQL Server, so they are biased against
MapReduce._

Microsoft is a huge company and parts of it a biased in favour of things like
this (e.g. the bits working on F#)

~~~
eob
The paper was co-authored by David DeWitt, not Microsoft. DeWitt is a well-
known database researcher from Wisconsin. And Microsoft, despite their awful
history of business practices, pours tons of funding into really good quality
research (via MSR).

I'm not saying people aren't biased by their associations -- certainly, the
traditional DBMS community feels like they have to defend their research &
products against new techniques like MapReduce.. but if you're going to
criticize the paper, criticize it because you found flaws in its contents, not
because of the organization listed beneath an author.

------
codeslinger
Did no one else notice that this paper was authored by the C-Store guys, the
ones that went on to create and found Vertica (specifically Stonebraker)? They
have an intense bias against map/reduce, as could be expected.

As well, one important fact that people seem to have missed is that they admit
that m/r was better at loading the tuning the data, a task which in its
regular name (ETL) dominates most data warehousing workloads.

Finally, please note the price tags on the solutions advocated by this paper.
They are _far_ out of the reach of any startup founders reading this paper. I
looked at them at both Commerce360 and Invite Media and they were just nowhere
near reasonable enough for us to replace our existing Hadoop infrastructure
with.

------
vicaya
The paper is written to sell Vertica, which is quickly becoming less appealing
given the rise of open source parallel DBs like HBase and Hypertable.

Hadoop is built for high throughput not low latency. If you read the paper,
Hadoop runs circle around Vertica etc for data loading.

It's entirely possible to build a low latency map-reduce implementation.

~~~
neilc
_Hadoop is built for high throughput not low latency._

That said, the parallel DBs still beat Hadoop at throughput for almost all of
the queries, just not the data loading.

------
strlen
Keep in mind that the article itself admits that the loading the data into an
RDBMS took _much longer_ than loading into onto a Hadoop cluster.

That's what the main crux is: if your data isn't relational to start with
(e.g. messages in a message queue, being written to files in HDFS), you're
better of with Map/Reduce. If your data is relational to start with, you're
better off doing traditional OLAP on an RDBMs cluster.

When I worked for a start-up ad network, we couldn't afford the
latency/contention implications of writing to a database each time we served
an ad. Initially, we moved the model to "log impressions to text files, rsync
text files over, run a cron job to enter them into an RDBMS". Problem is, that
processing the data (on a single node) was taking so long the data would no
longer by meaningful by the time it would be in RDMBS (and OLAP queries would
finish running).

I thought of "how can I ensure that a specific machine only processes a
specific log file, or a specific section of a log file" -- but then I realized
that's what the map part of map/reduce is. In the end I've setup a Hadoop
cluster, which made both "soft real time" (hourly/bi-hourly) data processing
and long term queries (running on multiple month, multi-terrabyte datasets)
much faster and easier.

Theoretically, yes: RDBMS is the most efficient way to run complex analytics
queries. Practically, we have to handle such issues as asynchronicity,
contention, ease of scalability and getting the data into a relational format
(which itself is something that could be done using Hadoop: one of the tasks
we used it for was transforms all sorts of text data -- crawled webpages, log
files -- into CSV data that could be loaded onto the OLAP/OLTP clusters with
"LOAD DATA INFILE").

~~~
neilc
_if your data isn't relational to start with (e.g. messages in a message
queue, being written to files in HDFS), you're better of with Map/Reduce_

If you query the data once and then throw it away, then certainly load
performance is a critical issue. If you load the data once and then query it
many times, load performance is far less important -- trading a longer load
time for much better query performance would be a good idea. So it depends on
the workload as much as the format the data happens to start in.

~~~
strlen
I think Hadoop and M/R seem to be best for:

"Take the data, transform it into another form _ONCE_ , query the transformed
using a low latency scheme many times".

(E.g. building a web index or in my case, building per ad/per user/per
publisher key/value pairs for ad targeting).

------
strlen
Another issue:

CAP Theorem: Consistency, Availability and (network) Partition Tolerance: pick
two.

RDBMS are excellent at providing consistency and availability with low
latency. Yet imagine building a massive parallel web map/crawling
infrastructure on top of an RDMBS -- particularly with multiple datacenters
and high latency links between these datacenters: it may be feasible for an
enterprise/intranet crawler, but not for something at the scale of Yahoo or
Google search (both of which build up their _datasets/indices_ using
map/reduce and then distribute these indices to search nodes for low latency
querying via more traditional approaches).

------
mjs
The conclusion is a pretty good summary, if you don't want to read the whole
lot.

A big thing that's missing from Parallel DBMSs is fault tolerance, which the
paper cops to (but also diminishes):

"MR is able to recover from faults in the middle of query execution in a way
most parallel database systems cannot. ... If a MR system needs 1,000 nodes to
match the performance of a 100 node parallel database system, it is ten times
more likely that a node will fail while a query is executing. That said,
better tolerance to failures is a capability that any database user would
appreciate."

Also: "Although we were not surprised by the relative performance advantages
provided by the two parallel database systems, we were impressed by how easy
Hadoop was to set up and use in comparison to the databases."

------
wmf
Previous HN discussions:

<http://news.ycombinator.com/item?id=562324>

<http://news.ycombinator.com/item?id=562732>

------
rgrieselhuber
Speed isn't necessarily the primary advantage that MapReduce provides; it's
more about shifting the burden of scaling backend process from the programmer
to the infrastructure.

That said (caveat: I don't have a lot of real world MapReduce experience), in
my line of work, I do a lot of analysis at less than web scale (there's still
a lot of business value there too, folks!), and just the thought of writing a
bunch of MapReduce jobs just to achieve the analytical capabilities in vanilla
SQL alone is not a pleasant one.

If people have experience in getting around this, I'd love to hear about it.

~~~
gaius
_it's more about shifting the burden of scaling backend process from the
programmer to the infrastructure_

Ermm, that's exactly what parallel query does. Most programmers never even see
the query plan. And you mean, from the programmer to the DBA ;-)

------
cnlwsu
The empirical evidence Google provides gives a strong argument that map reduce
can perform pretty well ' Results 1 - 10 of about 1,990,000,000 ... (0.13
seconds) ' I would say it really comes down to preference, the data set, and
how you plan to handle scaling/development.

~~~
almost
I doubt they're running each search as a mapreduce job. Your search is
probably making use of indexes and such generated by mapreduce jobs though.

I think you've got it spot on with your last sentence though.

