
Hadoop Is About Scalability, Not Performance - jacquesm
http://www.manamplified.org/archives/2008/11/hadoop-is-about-scalability.html
======
physcab
Hadoop is amazing for when you need to run huge batch-processing jobs that
would otherwise be too taxing to use on a relational database.

I've been working a lot with Hive lately--Facebook's creation that lets you
basically type SQL and the Map/Reduce jobs are automatically generated. Hive
vastly speeds up development, as you don't need to write custom Map/Reduce
jobs to do the tasks you had in mind when working with straight Hadoop.

If you're looking to learn more about Hadoop and Hive, Cloudera puts together
an awesome starter kit available here: <http://www.cloudera.com/distribution>

~~~
neilc
There's no reason you can't use a relational database for "huge batch-
processing jobs." You might not be able to use MySQL for such a task, but
there are a large collection of DBs that are designed for large-data
analytics: Greenplum, Aster Data, Teradata, Vertica, etc. Of course, most of
these have a non-zero price tag, but it is certainly possible to use a DB for
that sort of workload.

~~~
physcab
Right. I didn't say it wasn't possible to do these with relational databases.
Typically you use Hadoop for when you want to take the processing "offline",
like in doing data-mining of log data. For our company, our log data is
hundreds of millions of rows and if I were to do any significant processing,
it would block the ability for anyone else to do any less-intensive tasks. By
exporting this data first into HDFS, I can now run Map/Reduce jobs more
quickly and not hang up DB connections (we're using MySQL).

Another thing to note is that Hadoop becomes incrementally more useful as the
size of the files increase--which to my limited DB knowledge, is not the same
for relational DB's.

~~~
neilc
_By exporting this data first into HDFS, I can now run Map/Reduce jobs more
quickly and not hang up DB connections_

Right; the typical configuration is to have one database that does transaction
processing, and another separate database (a "data warehouse") that collects
data from multiple operational DBs and runs analytic queries over it. Hadoop
is basically just an alternative analytic query processor in this
configuration.

 _Hadoop becomes incrementally more useful as the size of the files increase_

I think the same would apply to parallel DBs: you are basically talking about
just partitioning the data over the storage nodes, which is a common feature
in both systems.

------
neilc
Yes, MapReduce is very scalable -- that is well understood. There is more to
building a system than pure scalability, though: for example, energy
efficiency. If Hadoop was less inefficient, it would be significantly more
green. As energy grows to become a greater portion of the total operating cost
of a typical datacenter, this will become increasingly important.

~~~
artsrc
Our SQL database experience is two servers (Active plus BCP) which are close
to 0% utilized 99% of the time. Then once a month another application on one
of the server runs a batch and our users response time increases
significantly. And we don't use MySQL.

The reality of SQL implementations with the leading vendors is infrastructure
overhead. Even scaling with many server inside one application is not really
recommended:

[http://wedonotuse.blogspot.com/2006/11/brief-intro-to-
market...](http://wedonotuse.blogspot.com/2006/11/brief-intro-to-marketing-or-
you-need.html)

Going to a truly shared infrastructure takes you out of the comfort zone so
far you might as well dump SQL databases, since people only pick them because
of familiarity.

------
bayareaguy
I'm still astonished about this post[1] showing Hadoop MapReduce using 40x as
much hardware as Greenplum MapReduce.

1- See [http://databeta.wordpress.com/2009/05/14/bigdata-node-
densit...](http://databeta.wordpress.com/2009/05/14/bigdata-node-density/)

------
jbellis
It's good to design for scalability first since it's harder, but if you don't
do both eventually, someone else will.

