

Apache Hadoop: Best Practices and Anti-Patterns - yarapavan
http://developer.yahoo.net/blogs/hadoop/2010/08/apache_hadoop_best_practices_a.html

======
earl
Interesting collection. Unfortunately, as per my experience, these aren't that
useful -- they pretty much come down to understanding how hadoop works, making
sure mappers and reducers have reasonable work and jobs are balanced, etc.
Essentially things you would easily see from the jobtracker page. Others stem
from defects in hadoop design -- why should users have to care about file
compression? Why should multiple output files from reducers be a problem? What
idiot decided to make the bloody jobtracker keep all counters and statistics
IN RAM (I've heard of these things called databases that are useful for
offline storage of structured data)... For the interested, we actually bounce
the jobtrackers on bigger lanes daily to increase reliability and avoid long
gc times with tons of accumulated in-ram stats.

More interesting would be how to solve hard problems -- one that took me quite
a while was how to do a distributed balanced exact rank, particularly when the
field being ranked is not uniformly distributed so you can't simply partition
it.

