

A Guide to Python Frameworks for Hadoop - laserson
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/

======
stevejohnson
Former[1] primary mrjob maintainer here, thanks for the shout-out! I'd like to
make a couple of notes and corrections. In particular, I thank you for
recommending mrjob for EMR usage, as it's something we've made a point of
trying to be the best at.

First of all, _all of these frameworks use Hadoop Streaming._ As mentioned in
the mrjob 0.4-dev docs [2] (pardon the run-on sentence):

"Although Hadoop is primarly designed to work with JVM code, it supports other
languages via Hadoop Streaming, a special jar which calls an arbitrary program
as a subprocess, passing input via stdin and gathering results via stdout."

mrjob's role is to give you a structured way to write Hadoop Streaming jobs in
Python (or, recently, any language). When your task runs, it's taking input
and output the same way as your raw Python example does, except mrjob is
passing it through to the methods you've defined after running each line
through a deserialization function. It picks the code to run based on command
line arguments such as --mapper, --reducer, etc. The output of the code is
again serialized. The methods of [de]serialization are defined declaritively
in your class, as you showed.

So why did you find mrjob to be slower than bare Hadoop Streaming? I don't
know! In theory, you're running approximately the same code between the mrjob
and the bare Python versions of your script. If anyone has time to dig into
this and find out where that time is being spent, I would be grateful. Results
should be sent to the issue tracker on the Github page [3].

Feel free to ask clarifying questions. I realize I may not be explaining this
effectively to people unfamiliar with the ins and outs of Python MapReduce
frameworks.

I'm thinking of organizing the 2nd "mrjob hackathon" in the near future, so
please ping me if you're interested in contributing to an easy-to-handle lots-
of-low-hanging-fruit OSS project. (Particularly if you're a Rubyist, because
we have an experimental way to use Ruby with it.)

[1] mrjob is maintained by Yelp, where I worked until recently. It's still
under active development, though it's slowed somewhat since Dave and I moved
on.

[2]
[http://mrjob.readthedocs.org/en/latest/guides/concepts.html#...](http://mrjob.readthedocs.org/en/latest/guides/concepts.html#hadoop-
streaming-and-mrjob)

[3] <http://www.github.com/yelp/mrjob/issues/new>

~~~
stevejohnson
One last thing I forgot to mention: there was a typedbytes support pull
request for mrjob at one point, but it was behind the dev branch enough that
it wasn't practical to merge. It is possible (probable?) that typedbytes
support will make it into a future release if an interested party could put in
the time. I would have done it myself if I had been able to make head or tail
of the typedbytes documentation.

Also, if you're just starting out or want to look over the docs, I'd recommend
using the dev version hosted on readthedocs instead of the PyPI version, as
the author did: <http://mrjob.readthedocs.org/en/latest/index.html>

~~~
laserson
I did only a bit of profiling on mrjob, but it seemed that the ser/de was
significantly slowing down the overall computation.

~~~
stevejohnson
Here's the code that's doing the actual line parsing:
[https://github.com/Yelp/mrjob/blob/master/mrjob/protocol.py#...](https://github.com/Yelp/mrjob/blob/master/mrjob/protocol.py#L144)

It's just splitting on tab for input and re-joining on tab for output. Some
extra logic for lines without tabs.

EDIT: The example in the post is using JSON for communication between
intermediate steps, while the Hadoop Streaming example is using a custom
delimiter format. So this isn't really a fair comparison; the mrjob example
could just as easily use the same efficient intermediate format.

~~~
laserson
Here is the profiling output: <https://gist.github.com/4478737>

The input is RawProtocol, which simply splits on tab. But after that, mrjob
defaults to using JSON internally, and this is causing a lot of slowdown.

~~~
stevejohnson
So mrjob isn't slower at all! You just chose to use JSON for your intermediate
steps in the mrjob example, instead of the delimiter format you used in the
raw Python example. I believe the mrjob docs do say what the defaults are.
RawProtocol can be used for intermediate steps just fine.

Please either mention this difference in your post or update the code and
conclusions. If there's a place in the documentation where we should mention
optimizations or details like this, I'd be interested to know.

I should have thought of this before. Oh well.

~~~
laserson
My goal was to use mrjob's features, not strip them down for performance. I
find mrjob's natural use of JSON very appealing in terms of user-experience.
It means that keys can be more complex types without the user having to
manually figure out the best way to encode them. I make it clear that this is
an appealing property of mrjob, and I make it clear that this is the reason
for the slowdown. As is, the code will not work with RawProtocol internally
because the key is a tuple of words.

~~~
stevejohnson
I can't find the part of the post where you explain that JSON parsing is the
reason for the slowdown. You just say mrjob itself is slower. While I agree
that mrjob's defaults encourage the use of JSON, I think it's unfair to blame
lack of optimization on the framework, given that the bare Python example
could just as easily have used JSON.

One real issue with mrjob is that it assumes you're only going to have one key
and one value. It isn't straightforward to use multiple key fields. The
workaround is to write a custom protocol (which, btw, is very simple [1]) that
uses the line up to the first tab as the key, and the rest of the line as the
value, probably splitting it on tab as well and passing it through as a tuple.
If we had made multipart keys simpler to use, maybe you would have chosen to
use a more efficient format.

Anyway, the main part I take issue with is:

"mrjob seems highly active, easy-to-use, and mature...but it appears to
perform the slowest."

That's just not true. It would be fair to say that optimizing jobs with
multipart keys isn't straightforward and therefore encourages non-optimal
code, but that's moot if you're just using one key and one value, as most
people do.

I'm really not trying to dump on you here. I liked the post! I would just
prefer that it was more precise about these things.

[1] [http://mrjob.readthedocs.org/en/latest/guides/writing-
mrjobs...](http://mrjob.readthedocs.org/en/latest/guides/writing-
mrjobs.html#writing-protocols)

EDIT: If anyone's thinking about downvoting this guy (someone did), don't.
This is a discussion in good faith.

------
throwaway54-762
Slightly tangential, but I'd like to shamelessly plug my HDFS client library
(with nice Python bindings)[0].

If you want to access files on HDFS from your Python tasks outside of your
typical map / shuffle inputs from the streaming API, it might be handy? It
doesn't go through the JVM (the library is in C), so it might save a little
latency for short Python tasks.

[0]: <https://github.com/cemeyer/hadoofus>

Also, I'm pretty new to publishing my own open source libraries. If people
would be so kind, I'd love some constructive criticism. Thanks HN!

~~~
stevejohnson
The main thing you need to do is publish your Python package to PyPI. This
should fill in the gaps in your Python packaging knowledge:
<http://guide.python-distribute.org/>

~~~
throwaway54-762
It's a C library, though — is that acceptable for PyPI? Thanks for the
feedback!

~~~
stevejohnson
Yep - if it's a Python library that can be 'python setup.py install'ed, then
it should be on PyPI. That's true of modules requiring C extensions.

------
bcbrown
The only framework I've used of the ones discussed is Hadoop Streaming with
Python. For our use case (rapid prototyping of statistical analytics on
several-terabyte structured data), it worked perfectly and was almost
frictionless.

As the article calls out, we had to detect the boundaries between keys
manually, but that didn't add much complexity. We called into Python through
Hive scripts. Our group had little prior experience; some had used Python
before, no one had used Hive/Hadoop much or at all, but we were all productive
within a day of ramp-up time. I'm positive we were more productive than if
we'd implemented everything in Java, even for the people more experienced in
Java than Python.

If I have any future projects requiring similar analysis, I'd like to use Hive
with Hadoop Streaming and Python again.

------
monstrado
A while back I did some tests to compare the Hadoop Streaming library to the
Dumbo library. I have to agree, if your MapReduce jobs are not crazy
complicated, and don't require chaining...it's best to just write it using the
raw Streaming library.

------
seany
You can do hbase stuff via streaming as well with this library
<https://github.com/vanship82/hadoop-hbase-streaming>

