
Improving MapReduce Performance - tlipcon
http://www.cloudera.com/blog/2009/12/17/7-tips-for-improving-mapreduce-performance/
======
strlen
Great article Todd,

One key thing to highlight is the importance of compression and using a
streaming compression algorithm. Compression means there's less data to
transfer (across the network and -- even more importantly -- from disk), which
means the transfers will complete faster.

Not only does it allow your compressed files to be splittable (not possible
with a conventional compression algorithm which requires all compressed data
to have its own Huffman tree), it runs very quickly and easily adopts to a
_stream_ (rather than a monolithic chunk) of data.

We've just added support for LZF (a similar arithmetic/streaming compression
codec) into Voldemort and performance results have been great:

[http://groups.google.com/group/project-
voldemort/browse_thre...](http://groups.google.com/group/project-
voldemort/browse_thread/thread/cb366257d3714da3)

Here's some background: <http://en.wikipedia.org/wiki/Arithmetic_coding>
<http://en.wikipedia.org/wiki/Lempel_Ziv>

(I had the good fortune to take an information theory class during undergrad)

------
jganetsk
Liked the post!

Interesting point about allocating too many Writables. This problem is an
indication that *Writable classes are poorly designed. Instead of having
public constructors, they should each have some sort of static method, akin to
that of a factory class, that implements some sort of intelligent pooling and
reuse.

Also, NullWritable is awesome! I don't think you mentioned it. Very useful for
counters!

~~~
tlipcon
The issue with intelligent pooling is that you (a) might end up with a lot of
bookkeeping, or (b) people will forget to return things to the pool. It's way
easier to do that kind of design in C++ where you can use scope and copy
constructors to automatically refcount and then return back.

NullWritable is pretty useful sometimes, but what's wrong with Counter objects
for counters?

~~~
jganetsk
Here's an example of improving the design of your interfaces to improve the
problem...

The mapper function now receives two extra arguments, one Writable of key type
and one Writable of value type.

The emit method now has zero arguments. When called, it emits the key-value
pair represented by the Writables passed in to the mapper function.

Now, you are forced to use and reuse the Writable objects passed in to your
mapper. Sure, you can allocate new ones, but they would be worthless since you
can't do anything with them. This would, hopefully, stop programmers from
allocating them.

No need to explicitly return things to any pools.

------
houseabsolute
Has any non-Googler here actually used Hadoop or any of the other public
MapReduce solutions?

~~~
strlen
We use it _extensively_ at LinkedIn. Many of them data-driven features you see
on the site are powered by it. It's greatly improved not only the speed at
which the data can be built but most importantly what can be done.

I've also introduced it to a start-up I worked for 1.5-2 years ago. I
essentially did much of the optimization Todd described "in the dark"; this
was before Cloudera was formed and when #hadoop on FreeNode only had a handful
of Yahoo, Facebook and Rapleaf people.

Despite the amount of work involved (including introducing Java-based project
to a LAMP-based start-up which was very ambivalent towards Java), it was a
great productivity boost (over the mix of ad-hoc shell/Perl/PHP scripts a
MySQL datawarehouse we started with).

MapReduce isn't just about scalability and performance. You don't need to have
a "scalability crisis" to benefit from it. It's also about being able to do a
great deal more by applying parallelizable algorithms (see for example, the
Mahout or Katta projects).

------
brg
On the topic of MapReduce, does anyone have pointers to articles detailing
different implementations of the shuffle phase?

