

FileMap: File-Based Map-Reduce - JensRantil
https://github.com/mfisk/filemap

======
dap
We've been using the file-oriented, shell-based map-reduce model for a while
with Joyent's Manta, and it's been a great fit for a variety of tasks. We've
used it internally for everything from log analysis to video transcoding to
Mario Kart analytics[0]. Map-reduce is a great model for distributing work,
and the shell is a great model for expressing _what_ that work is.

Disclaimer: I work at Joyent and helped build Manta. :)

[0] [http://kartlytics.com](http://kartlytics.com)

------
jon-wood
If you're just looking to parallelise an operation over some file then GNU
Parallel is an fantastic tool as well. On several occasions recently I've
combined the split command and Parallel to break a large CSV file up into
smaller chunks, and then run a Ruby process on each of those chunks.

Parallel is apparently also able to distribute a command over several hosts
using SSH, although I've not tried that one.

~~~
JensRantil
Parallel is awesome. The only problem I have with it is that it doesn't
support data localization. If you have 500 GB of data you don't want to copy
it to a machine to run a command on it. Better to have the data stored on your
cluster directly, which is what FileMap does.

------
Cseraphi
Anyone willing to share some real-world examples of FileMap jobs? The examples
in the github page seem geared toward explaining things in terms familiar to
Hadoop users, which I am not one of. Seeing an actual command line (as opposed
to a contrived snippet of one) would be useful to me.

~~~
JensRantil
This example wasn't enough?
[https://github.com/mfisk/filemap/wiki/Examples](https://github.com/mfisk/filemap/wiki/Examples)
Hadoop experience is not necessary, but you need to know the basic idea behind
MapReduce
([https://en.wikipedia.org/wiki/MapReduce](https://en.wikipedia.org/wiki/MapReduce)).

------
awhitty
As someone who has never used Map-Reduce before, something about this
implementation makes the technology feel 100% more accessible to me.

~~~
andrewguenther
You should check out mrjob[1]. It wraps the Hadoop streaming API and makes it
super easy to write MapReduce jobs in Python. I find it much easier to
understand than this implementation.

[1] [https://pythonhosted.org/mrjob/](https://pythonhosted.org/mrjob/)

~~~
JensRantil
mrjob looks nice. If you have a Hadoop cluster. But for "medium sized big data
problems" FileMap is a very viable alternative if you have people who know
their way around a terminal. Especially if you'd like to have something set up
pretty fast. For 500 GB of data setting up Hadoop (steep learning curve; name
nodes, jobtrackers, thrift API:s, datanodes, zookeeper and whatnots) is a lot
of heavy lifting. Not to mention administering it.

Sure you have Cloudera et al., but I'm still trying to figure out if they
really make things easier or worse when it two weeks later comes to figuring
out why something is broken, or how to start/install additional Hadoop
components.

~~~
andrewguenther
mrjob has really good integration to Amazon's Elastic Map Reduce, makes it
totally painless. I had to analyze 1TB of logs for my thesis and in less than
8 hours I discovered mrjob, wrote my job, and successfully ran it on EMR.
Granted I have prior experience with MapReduce, but even to a newcomer, I
can't imagine that would add too much time.

~~~
JensRantil
If your security policy allow putting stuff into Amazon... ;)

