
Mrjob: A Python 2.5+ package that helps you write and run Hadoop Streaming jobs - ColinWright
http://pythonhosted.org/mrjob/
======
stevejohnson
Oh hey, I work on this. It's come up on HN before, sometimes with
misinformation attached. I'm happy to answer questions.

~~~
andrewguenther
You may want to actually link to the project GitHub on your "Contribute" page

~~~
stevejohnson
Good point.
[https://github.com/Yelp/mrjob/pull/677](https://github.com/Yelp/mrjob/pull/677)

(merged.)

------
brendoncrawford
MrJob is great. If you do not want to run full blown Hadoop and are not using
Amazon, Gearman is also a great alternative.

------
kevnin
Does anyone know how this framework stacks up against Dumbo (Last.fm) and/or
Pydoop? I've used Dumbo before and had great luck.

With Dumbo, the only problem is the lack of consolidated documentation. Much
of the knowledge is lost in the maintainer's blog

~~~
mattj
(original author of mrjob here)

Steve's post is 100% correct. I originally wrote mrjob as an internal tool at
yelp out of my frustration with using dumbo for multi-step jobs. Specifically,
I found myself writing the same incantation of "wrap a mapper / reducer
function with an encoding scheme" over and over again. I tried to add protocol
support into dumbo (so you could specify that your job reads json, uses pickle
for intermediate data, and writes thrift), but I had a hard time working with
the dumbo codebase (disclaimer: I haven't looked at it since, so it might be
easy to do this now). I also wanted to represent mappers and reducers as
python generators, which makes writing memory-performant steps natural (eg you
normally want to rely on the shuffle / sort to perform the hard work of
aggregating by key). Finally, I wanted my jobs to be easy to test both from
unittest and from the command line - debugging hadoop streaming jobs is way
more of a pain in the ass than it should be.

------
miga
Would be nice to have something multilanguage for users of minority languages
(Erlang, Haskell, or Lisp.)

~~~
stevejohnson
It's actually pretty close. It's theoretically possible for mrjob to run a
script written in any language as long as it supports a simple stdin/stdout
protocol. We just don't have any users dedicated enough to implement that
stuff.

Also, you can write a very small Python script that uses subprocess to call
your script. That way you can still use mrjob to set up all your dependencies
and handle AWS for you. If anyone is actually interested in that sort of
thing, I'd be willing to write a tutorial.

