

MapReduce and RDBMS: Practice and Theory - grigoryy
http://grigory.us/blog/rdbms-mapreduce/

======
AtlasLion
I disagree with the last statement: "While there is room for apple-oranges
none of these seem to be successful so far. It seems to be common sense that
SQL-on-Hadoop just like an apple-on-orange is not a great idea in terms of
performance. Limited success in attempts such as Hive on Hadoop seem to prove
this so far. Using low-level programming languages such as C++ with RDBMSs is
also possible (see e.g. SQLAPI). However, described above advantages of RDBMSs
most likely vanish if you do so."

Actian Vortex has shown that you can have a fully ACID compliant database
running on top of Hadoop all while providing exceptional performance.
[http://wwwcdn.actian.com/wp-content/uploads/2014/06/AAP-
Hado...](http://wwwcdn.actian.com/wp-content/uploads/2014/06/AAP-Hadoop-SQL-
Edition-Benchmark.jpg)

------
mystique
MapReduce and RDBMS are apples and oranges - both are good at what they do and
are effective within their own use cases. One allows you to handle any type of
data and manage it whichever way, another allows you to understand your data
if you can live within some defined structure. It is silly to suggest to use
MapReduce to power a dashboard with sub second response time. Same way, it is
silly to suggest using MPP or RDBMS like techniques for processing highly
unstructured or even semi structured content.

Apache Spark is getting close to being able to do both, but still as a
developer building a data stack, I would not inspect terabytes of data every
single time if 80% of questions can be answered by looking at data once and
saving summarized results in relational format.

I thought Hadoop vs RDBMS was a fight settled may be 4-5 years ago! Amusing to
see it being raised at this time.

~~~
JakaJancar
We have stored many terabytes of unaggregated transaction records in Vertica
and analyzed large subsets fully ad-hoc, on-the-fly in less than 5s. We have
also used BigQuery to directly power dashboards with few-second response
times, also analyzing billions of records at a time.

On the other hand, we both build and use summary tables with Spark. (in a
relational format to boot, and using Spark SQL).

I think you would benefit from re-evaluating the assumptions you made 4-5
years ago.

------
JakaJancar
Parquet + Spark SQL makes me wonder if MapReduce and columnar MPP are really
so far apart?

I know they're orders of magnitude slower than e.g. Vertica today, but I
wonder if there is a fundamental reason for that, or is it just the
implementation?

~~~
jandrewrogers
Most of the performance differences are in how these systems handle
representation, I/O, and execution scheduling. In other words, implementation.
The challenge in closing the gap is that the architectures across systems are
quite different so you can't just add performance that is largely derived from
fundamental architecture.

Columnar MPP needs to be carefully defined to answer the question. Some
important data models that fit in these architectures operate on data types
that are not meaningfully order-able at a mathematical level i.e. you can't
sort them. A lot of columnar implementations, and virtually all in open
source, assume sortability as a property of the represented data types.

tl;dr: Columnar MPP sometimes is not far outside what you can express with
Parquet/Spark/MapReduce/etc, just much faster, but there are data models
supported in some advanced Columnar MPP systems that are not usefully
expressible with that stack. It depends on the platform and the use case.

~~~
rxin
Sorry I don't think the last paragraph you said is true. I dont see any
special data types or models that can be modeled by MPP architecture but
cannot be modeled in Spark. In short, I don't believe there is much difference
at the physical execution level between MPP and Spark.

I also don't understand what you mean by orderable. Spark does not require
records to be oderable. Maybe you can elaborate?

------
bra-ket
these are not directly comparable, but look at Apache Hbase+Phoenix which is a
distributed RDBMS on top on underlying Hadoop filesystem

