

mrjob: Yelp open sources its Elastic MapReduce framework for Python - pretz
http://engineeringblog.yelp.com/2010/10/mrjob-distributed-computing-for-everybody.html

======
stevejohnson
This past week I started working on a Python 3 port of this, mostly to learn.
No EMR unfortunately, but Hadoop should be possible. I just got back from a
trip, so it's still not very far along, just runs the "local" version, but it
should get a bit farther next week.

I can confirm that it is a _great_ way to learn about MapReduce.

Link: <http://github.com/irskep/mrjob/tree/py3k>

I will likely totally restart the py3k port now that I know what I am doing a
bit better. I've been writing Python 3 for about, oh, two weeks.

------
ashika
Amazon EMR is an amazing value proposition for virtually any research need,
and it's very cool to see wrapper frameworks targeting it directly. Still, for
anyone managing their own compute clusters and wanting to do MR in python, I'd
suggest checking out Disco.

Disco (<http://discoproject.org>) is a really elegant MR framework implemented
in erlang and python, with additional support for jobs in C and Java. I've
used it for a little over a year and am convinced it is the superior MR
platform (Hadoop's terasort victories notwithstanding). New features are being
integrated very quickly, the core platform is rock solid, management is simple
and it's extremely flexible.

------
derwiki
this was a game changer for us -- instead of everyone contending for the
Hadoop cluster, each developer has their own personal arsenal of Hadoop
clusters. huge win.

~~~
bravura
But then don't you have a lot of CPUs going unused, because you are
partitioning your resources?

Is it really difficult to automatically allocate shared resources?

~~~
timr
We're allocating EMR clusters as needed. When they're no longer needed, they
go away. Waste is minimal.

------
deathflute
On this note, does anyone know a good tutorial on map reduce for experienced
programmers? Basically, I want to learn how to frame advanced problems in
terms of MR - I am particularly interested in expressing my discrete event
simulation in terms of MR.

~~~
snotrockets
You want exercises, not a walk-through.

The thing with using higher-order functions isn't learning the definitions
(which are rather simple, really,) but figuring out how to use those tools.

And for that you need practice. Start from describing trivial problems (word
count, for example), and advance to more complicated ones. Any good book on
functional programming would have lots of exercises
(<http://mitpress.mit.edu/sicp/> is probably the most famous, but is surely
not the only one.)

I grokked functional programming by learning Calculus of variations, but YMMV.

------
FraaJad
Nice to see one more production use of Cython.

~~~
sumeeta
mrjob doesn’t contain any Cython. The author was just stating it was a
challenge getting Yelp’s codebase (which contains some Cython) running on EMR.

~~~
FraaJad
I understood that from the article. But, in the light of recent discussion on
Cython, i though it was interesting to note a "2.0" company like yelp using
Cython.

------
LiveTheDream
So does most of your data live in S3 in JSON format?

