

Some Interesting Facts About Hadoop MapReduce [pdf] - snydeq
http://davidfrico.com/hadoop-considerations.pdf

======
VonCuattyaf9
Not sure what this is about. Most of his criticism can be made against any
form of computing clusters. Hadoop is optimized for processing raw
unstructered data fast. It's not an alternative to SQL and never was. It's
also optimized for data throughput and not like MPI for processing power.

MapReduce is an algorithm or paradigm that has to fit your problem. It is not
the fault of Hadoop that it is over hyped and mis-used for other problems.

Hadoop is not a database or a datastore. If you want to extract data out of
e.g. web-pages Hadoop is a great tool as it abstracts away the problems of
distributing the algorithm and the data for you. Google (re)invented MapReduce
and is a search engine, Yahoo was also a search engine. Hadoop solved Yahoo's
problems.

I really don't understand the point he's trying to make. If you want to run
your business database from Hadoop you are crazy. If you have 5 terabyte of
click data from your website and you want to cluster and analyze this data
Hadoop (or better Hbase) can help you to solve this problem.

It is a different paradigm. It it not like Java programming. It uses Java to
implement the MapReduce paradigm. If you don't want to think about how to fit
your problem into this paradigm or if your problem is not solvable using
Hadoop don't use it.

If you want to build a search engine Hadoop is a great fit. If you want to
replace your SQL database with Hadoop you are crazy.

Also if the network is the bottleneck in your Hadoop cluster you did something
wrong. Hadoop is designed for data locality. Usally the output of a Mapper
consists of less data than the input.

* I've used Hadoop as a student for several academic projects. Mostly information retrieval. So I can't comment on "real world buisseness" usage. I've just have the impression that Hadoop is pretty overhyped and not well understood.

------
ripperdoc
And if you click back to his website, you'll find that it seems to have been
solely built using WordArt, which is quite an impressive feat!

------
stevedomin
This is quite an interesting document. I'd love to have pov from Hadoop
"power-user" on this.

~~~
blibble
I'd agree with most of his points, but he has conflated mapreduce, yarn and
hdfs into "hadoop".

the "assembly language" stuff is only accurate if you're writing raw mapreduce
jobs, but he's ignored projects like hbase, hive, pig, impala and cascading.

the HA issues with the namenode have mostly gone away with CDH4.1.

the one thing that's a complete pain is the lack of snapshotting functionality
in hdfs, making consistent backups nearly impossible.

the APIs for hdfs and yarn are poorly designed and pretty buggy, they're very
obtuse and hard to debug. if you manage to find any documentation it's almost
certainly wrong for the version you're using.

nine times out of ten it's quicker to step into the source code rather than
digging into the documentation.

however once you've abstracted away the poor APIs it does work very well, and
the pay is good...

