

Democratizing Big Data - Is Hadoop Our Only Hope? - ahoff
http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/

======
martincmartin
I think the real question is "Is MapReduce Our Only Hope?"

MapReduce has a lot of limitations. It doesn't have a query language, instead,
you need to figure out the sequence of map and reduce steps and implement
those in your favourite low level language yourself.

And it can't do efficient joins. That means you need to visit each and every
row for each and every map-reduce stage. There's no b-tree or other "lookup
structure".

And it's a batch based framework, which means if you add 1% more data, you
have to re-analyze the entire data set, rather than update previous results
with the new 1%.

Disclaimer: I work at Endeca, which is about to launch Latitude, an Enterprise
platform for big data analysis. But (a) I work in Engineering, not sales or
marketing, so I spend my time thinking about the advantages and disadvantages
of various technologies, rather than how to sell them, and (b) I'm an actual
human being who has independent thoughts.

~~~
cdavid
I don't understand the claim that adding new data force you to recompute
everything. What requires re-computation will depend on the algorithm, but for
most simples cases I can think of, at least the map part will not require
computation if you record its result. I believe that's how couchdb view work,
for example

~~~
yummyfajitas
Hadoop is a fairly low level framework. So it requires the programmer to write
logic to incrementally calculate the map (i.e., input_data_may ->
map_results_may, repeat for june).

Similarly, it's up to the programmer to write a reducer which allows
incremental additions of data. And even then, you still need to make a pass
over all the data.

~~~
cdavid
That's a limitation of hadoop, though, not the MR idea by itself.

~~~
yummyfajitas
The limitation on the _map_ end is only a limitation for hadoop.

But the limit on the reducer is fundamental. Some reduce functions are not
associative, and some don't even have type [T x U] -> T x U. In those cases,
there is nothing to be done but redo the reduce.

~~~
cdavid
Indeed, reduce is the difficult part. OTOH, I think this limitation is seen in
many algorithms at a fairly fundamental level, and not just an artefact of MR.
The only alternative framework I can think of for dealing with really large
datasets in a distributed manner is sampling-based methods, with one-pass
algorithms (or mostly one pass algorithm).

------
gtuhl
I am not a big fan of Hadoop. It is a headache to configure and optimized for
installs with node counts only a few companies could make use of. I really
wish there were more options as I believe Hadoop is overkill for most of the
people using it.

For quick and dirty map reduce on a smaller node count I've started to really
like Disco (discoproject.org). You just pull down the backend with your
package manager, push your files into ddfs, write a python script, and run it.

~~~
mlmilleratmit
Interesting, I haven't looked at disco for quite awhile. How does disco
compare to hadoop streaming these days? (I'm highly biased, so I reach for
bigcouch most of the time now)

~~~
jamii
The latest release supports workers written in any language. Disco comes with
worker libraries for python and ocaml.

<http://discoproject.org/doc/howto/worker.html>

------
helwr
A good list of hadoop alternatives: [http://www.quora.com/What-are-some-
promising-open-source-alt...](http://www.quora.com/What-are-some-promising-
open-source-alternatives-to-Hadoop-MapReduce-for-map-reduce)

my personal favorite is BashReduce (~120 lines shell script vs ~600k lines of
java code in hadoop): <http://blog.last.fm/2009/04/06/mapreduce-bash-script>

If you're in bioinformatics you might be interested in this talk on handling
ridiculous amounts of data (PyCon 2011): [http://blip.tv/pycon-us-
videos-2009-2010-2011/pycon-2011-han...](http://blip.tv/pycon-us-
videos-2009-2010-2011/pycon-2011-handling-ridiculous-amounts-of-data-with-
probabilistic-data-structures-4899047)

~~~
helwr
Also see GraphLab, a New Parallel Framework for Machine Learning:
<http://www.graphlab.ml.cmu.edu/>

------
rryan
Hadoop is hardly our only hope -- off the top of my head there is Yahoo S4 for
expressing streaming topologies of large-scale data processing. There is
Google's Sawzall language for efficiently 'sawing' through and aggregating
stats about of large amounts of data. Databases like MongoDB are slowly
enabling the FLOSS community to process larger and larger datasets which
previously was very difficult for someone other than an engineer with a Google
datacenter, GFS, and BigTable at their disposal. And that's just scraping the
surface of great, Free and open-source projects available. AWS and Google App
Engine are making it affordable for the common man to run computations that we
could only dream about just a decade ago. I, for one, am very excited about
this and think we're doing just fine.

~~~
mlmilleratmit
Indeed there are plenty of alternatives, that was the subtle theme of my post,
but unfortunately I didn't have the space to go into any more detail... Follow
up piece!

------
Todd
I'm working on YAMRF (yet another MR framework) in C#. I wanted something like
Hadoop but something that was much simpler to setup and to hack on.

There's no question that Hadoop is the elephant in the room, so to speak. It
is very robust and performant, and there's a great ecosystem and community. It
is quite complex as a result and getting it set up and tuned can take a lot of
time and effort.

I've got the distributed file system working and am working on the processing
part now. The underlying framework is more general purpose than MR, working at
the level of data or record streams which can be run through LINQ, for
example. Dryad has this but it's a much more complicated beast.

Even though more general purpose computation is possible with such a
framework, it turns out that to achieve scale, your problem needs to be
parallelizable and MR is a good way to do that. I think that's why we aren't
seeing much in the way of alternatives, yet--it's a question of the "enemy of
good enough."

