

Dumbo - Python module that allows you to easily write and run Hadoop programs - coderdude
https://github.com/klbostee/dumbo

======
timr
At one point we used Dumbo for a lot of stuff at Yelp, but we finally ended up
writing our own framework:

[http://engineeringblog.yelp.com/2010/10/mrjob-distributed-
co...](http://engineeringblog.yelp.com/2010/10/mrjob-distributed-computing-
for-everybody.html)

It's open source and easy to use, and makes hadoop on EMR exceptionally clean.
Check it out.

~~~
samt
Nice work on mrjob! We've been really happy with it.

------
bretthoerner
See also: <https://github.com/bwhite/hadoopy> and
<https://github.com/Yelp/mrjob>

~~~
coderdude
And for anyone who didn't see it earlier: <http://yahoo.github.com/oozie/>

These four are the only one's I know of. Does anyone know of any others?

~~~
bretthoerner
Unlike Dumbo, mrjob & hadoopy, Oozie isn't a Python Streaming library, so I'm
not sure what you're looking for by 'others'?

~~~
coderdude
I didn't mean to imply that it's another Python Streaming library. I included
it because it's a library for working with Hadoop jobs. Unless I am mistaken
in that? I haven't yet tried Oozie myself.

~~~
bretthoerner
Right, it's just that there's a whole slew if you decide to leave the Python
Streaming world (which I figured people reading this would care about).

Cascading, cascalog, wukong, innumerable others.

~~~
coderdude
You were certainly justified in questioning what I was talking about. I didn't
know there were so many others actually. Thanks for the names of a few,
though.

------
boyter
<http://discoproject.org/>

Thats another project of similar vein. Although it has native support for
Python and you dont need to install Hadoop (which may be better for some).

~~~
bretthoerner
To be clear, Disco is a full replacement for Hadoop (including MapReduce,
HDFS, etc). I've heard good things, but you also leave the (rather large)
Hadoop ecosystem.

As a Python developer I've found Dumbo / mrjob to be easy to use.

~~~
coderdude
Which do you prefer using, Dumbo or mrjob?

~~~
bretthoerner
They're pretty similar, it's hard to make map reduce very complex in Python.

What mrjob has going for it is that it's very easy to run a job on Amazon's
EMR (Elastic Map Reduce) so you don't have to do any Hadoop setup. You can run
Dumbo on EMR, too, it's just more manual (last I checked).

That said, Dumbo has been around longer and seems to be smarter about
input/output formats. It seems that by default it can ready just about any
type of Hadoop format (look up SequenceFiles, for example) and by default
it'll output compressed SequenceFiles, too, which help save space and I/O.

