

Map Reduce in Python - transparently done over a network - ditados
http://remembersaurus.com/mincemeatpy/

======
jwegan
Another way to implement map reduce in Python is to use the Hadoop Streaming
Jar (<http://hadoop.apache.org/common/docs/r0.20.1/streaming.html>). You
basically write a mapper and reducer as python scripts and ship them out with
the job in a tarball. The scripts just need to read input from stdin and write
to stdout. You then pass arguments to the streaming jar when you kick off the
job with the shell command to invoke your python scripts for the map and
reduce tasks.

~~~
espeed
Yes, for Python you can use Dumbo for this ([http://yz.mit.edu/wp/no-nonsense-
standalone-hadoop-and-dumbo...](http://yz.mit.edu/wp/no-nonsense-standalone-
hadoop-and-dumbo-on-ubuntu/)).

------
munin
this is pretty cool, but be aware that he is using marshal to send function
objects to remote systems, so you are marrying yourself to its list of
requirements as well: <http://docs.python.org/library/marshal.html>

it'll work great for simple use cases but since he also does nothing to handle
dependency capturing, your map and reduce functions had better use standard
python objects or python objects that you are sure will be installed on the
remote systems ...

~~~
iandanforth
Would you suggest pickle or a more generic compressed JSON?

~~~
michaelfairley
Unfortunately, python doesn't let you serialize a function through pickle or
other non-marshal serializations. The only cross-python solution is to
actually have python read the source code and send it (rather than the
bytecode) over the wire, which seems much more brittle to me.

See [http://stackoverflow.com/questions/1253528/is-there-an-
easy-...](http://stackoverflow.com/questions/1253528/is-there-an-easy-way-to-
pickle-a-python-function-or-otherwise-serialize-its-code)

~~~
munin
oh yeah, if you run this script with different python versions, you will
experience mysterious failures probably.

in some limited testing I did, I was able to serialize a function on 64-bit
Linux and have it execute on 32-bit Windows. I even went crazy and wrote using
ctypes against the Win32 API on Linux and sent a string to be loads()ed by
code on Windows... and that worked.

(this isn't really python being awesome as much as bytecode languages working
as designed... but as far as I know this is all working by happy co-incidence,
unlike say Java, where the bytecode is designed to be cross-platform...)

------
diwank
Amazing! Map Reduce couldn't get simpler.

------
ashrust
I'd love to see one of these python MR approaches get some traction, until
then I expect most companies will be following Facebook's approach of putting
data into HDFS/Hive and then using Python, or whatever, to parse the output of
HiveQL.

