

Mincemeat.py: a single-file Python implementation of MapReduce - michaelfairley
http://mincemeatpy.com

======
adamtj
What do people use this sort of thing for? The example is neat, but it's
trivial and would take 4 lines of non-mapreduce code. What sorts of problems
do people have that are big enough to _need_ mapreduce? It seems to me that
problems big enough to need it are going to be big enough to bother with
hadoop or maybe rolling something custom so you can control the details. Is
that not true?

~~~
michaelfairley
I spun this our of some work I was doing for an NLP related thesis. Large
scale text processing where there's already a shared file system is a perfect
fit for mincemeat.py.

I basically wanted something that was much, much easier to develop for, setup,
and run than Hadoop. I've heard many academic researchers complain that
although they had a algorithm that would fit neatly into MapReduce, they
didn't want to bother setting up Hadoop and importing all of their data for a
process that would only get run a few times (and that was already coded in
Python).

~~~
StavrosK
The only downside for me is that not only do you have to find a massively
parallel problem to work on, but the computation function also needs to be
much slower than network latency. With network latency being in the ms range,
the algorithm solving problem needs to be _really_ slow to benefit.

~~~
leif
Not necessarily. It can also just be a vast amount of data, in which case
bandwidth (which is generally pretty good), not latency, is your limiting
factor with mapreduce.

Also, you only need one part of your infrastructure to require mapreduce's
parallelism in order to argue for using it across the board. If you have
simpler problems to solve, you may as well solve them with mapreduce since
you'll be thinking in that computation model anyway, and you can easier use
the results later in a computation that may require mapreduce.

~~~
StavrosK
Your problem still has to have the property that loading and processing the
dataset needs to be slower than sending it over the network and getting the
result back, though...

------
joshhemann
Thanks for sharing this, I will definitely be taking a look at it for a big
bootstrap (Monte Carlo) simulation I do. Have you used the parallel extensions
in IPython? I have used IPython with great success for the task farming needed
in my embarrassingly-parallel context, where the same algorithmic steps need
to be applied to hundreds or thousands of data sets. I'll be interested to use
your approach and consider the pros and cons, but I like how simple it appears
to use.

------
leif
I worry that in your example, vs is passed in to reducefn as an array. A
generator would be more memory-efficient (though from the example it isn't
possible to differentiate and I haven't looked at the source).

------
mark_l_watson
Definitely cool, but Hadoop is fairly easy to set up and use, especially if
you use Elastic MapReduce.

