Hacker News new | past | comments | ask | show | jobs | submit login
Democratizing Big Data - Is Hadoop Our Only Hope? (gigaom.com)
34 points by ahoff on June 11, 2011 | hide | past | web | favorite | 15 comments

I think the real question is "Is MapReduce Our Only Hope?"

MapReduce has a lot of limitations. It doesn't have a query language, instead, you need to figure out the sequence of map and reduce steps and implement those in your favourite low level language yourself.

And it can't do efficient joins. That means you need to visit each and every row for each and every map-reduce stage. There's no b-tree or other "lookup structure".

And it's a batch based framework, which means if you add 1% more data, you have to re-analyze the entire data set, rather than update previous results with the new 1%.

Disclaimer: I work at Endeca, which is about to launch Latitude, an Enterprise platform for big data analysis. But (a) I work in Engineering, not sales or marketing, so I spend my time thinking about the advantages and disadvantages of various technologies, rather than how to sell them, and (b) I'm an actual human being who has independent thoughts.

I don't understand the claim that adding new data force you to recompute everything. What requires re-computation will depend on the algorithm, but for most simples cases I can think of, at least the map part will not require computation if you record its result. I believe that's how couchdb view work, for example

Hadoop is a fairly low level framework. So it requires the programmer to write logic to incrementally calculate the map (i.e., input_data_may -> map_results_may, repeat for june).

Similarly, it's up to the programmer to write a reducer which allows incremental additions of data. And even then, you still need to make a pass over all the data.

That's a limitation of hadoop, though, not the MR idea by itself.

The limitation on the map end is only a limitation for hadoop.

But the limit on the reducer is fundamental. Some reduce functions are not associative, and some don't even have type [T x U] -> T x U. In those cases, there is nothing to be done but redo the reduce.

Indeed, reduce is the difficult part. OTOH, I think this limitation is seen in many algorithms at a fairly fundamental level, and not just an artefact of MR. The only alternative framework I can think of for dealing with really large datasets in a distributed manner is sampling-based methods, with one-pass algorithms (or mostly one pass algorithm).

I am not a big fan of Hadoop. It is a headache to configure and optimized for installs with node counts only a few companies could make use of. I really wish there were more options as I believe Hadoop is overkill for most of the people using it.

For quick and dirty map reduce on a smaller node count I've started to really like Disco (discoproject.org). You just pull down the backend with your package manager, push your files into ddfs, write a python script, and run it.

I just finished building a proof of concept with disco doing some simple analytics calculations on one machine. I did run into a couple of bugs (http://groups.google.com/group/disco-dev/browse_thread/threa...) and the documentation is awful but overall I'm pretty impressed. Now if I could just get discodex to run...

Interesting, I haven't looked at disco for quite awhile. How does disco compare to hadoop streaming these days? (I'm highly biased, so I reach for bigcouch most of the time now)

The latest release supports workers written in any language. Disco comes with worker libraries for python and ocaml.


A good list of hadoop alternatives: http://www.quora.com/What-are-some-promising-open-source-alt...

my personal favorite is BashReduce (~120 lines shell script vs ~600k lines of java code in hadoop): http://blog.last.fm/2009/04/06/mapreduce-bash-script

If you're in bioinformatics you might be interested in this talk on handling ridiculous amounts of data (PyCon 2011): http://blip.tv/pycon-us-videos-2009-2010-2011/pycon-2011-han...

Also see GraphLab, a New Parallel Framework for Machine Learning: http://www.graphlab.ml.cmu.edu/

Hadoop is hardly our only hope -- off the top of my head there is Yahoo S4 for expressing streaming topologies of large-scale data processing. There is Google's Sawzall language for efficiently 'sawing' through and aggregating stats about of large amounts of data. Databases like MongoDB are slowly enabling the FLOSS community to process larger and larger datasets which previously was very difficult for someone other than an engineer with a Google datacenter, GFS, and BigTable at their disposal. And that's just scraping the surface of great, Free and open-source projects available. AWS and Google App Engine are making it affordable for the common man to run computations that we could only dream about just a decade ago. I, for one, am very excited about this and think we're doing just fine.

Indeed there are plenty of alternatives, that was the subtle theme of my post, but unfortunately I didn't have the space to go into any more detail... Follow up piece!

I'm working on YAMRF (yet another MR framework) in C#. I wanted something like Hadoop but something that was much simpler to setup and to hack on.

There's no question that Hadoop is the elephant in the room, so to speak. It is very robust and performant, and there's a great ecosystem and community. It is quite complex as a result and getting it set up and tuned can take a lot of time and effort.

I've got the distributed file system working and am working on the processing part now. The underlying framework is more general purpose than MR, working at the level of data or record streams which can be run through LINQ, for example. Dryad has this but it's a much more complicated beast.

Even though more general purpose computation is possible with such a framework, it turns out that to achieve scale, your problem needs to be parallelizable and MR is a good way to do that. I think that's why we aren't seeing much in the way of alternatives, yet--it's a question of the "enemy of good enough."

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact