FileMap: File-Based Map-Reduce

dap · on May 29, 2014

We've been using the file-oriented, shell-based map-reduce model for a while with Joyent's Manta, and it's been a great fit for a variety of tasks. We've used it internally for everything from log analysis to video transcoding to Mario Kart analytics[0]. Map-reduce is a great model for distributing work, and the shell is a great model for expressing what that work is.

Disclaimer: I work at Joyent and helped build Manta. :)

[0] http://kartlytics.com

jon-wood · on May 29, 2014

If you're just looking to parallelise an operation over some file then GNU Parallel is an fantastic tool as well. On several occasions recently I've combined the split command and Parallel to break a large CSV file up into smaller chunks, and then run a Ruby process on each of those chunks.

Parallel is apparently also able to distribute a command over several hosts using SSH, although I've not tried that one.

JensRantil · on May 29, 2014

Parallel is awesome. The only problem I have with it is that it doesn't support data localization. If you have 500 GB of data you don't want to copy it to a machine to run a command on it. Better to have the data stored on your cluster directly, which is what FileMap does.

icebraining · on May 29, 2014

Distributed over SSH is awesome. Parallel can push the input files, run the command and pull the output (and then cleanup) all by itself.

Cseraphi · on May 29, 2014

Anyone willing to share some real-world examples of FileMap jobs? The examples in the github page seem geared toward explaining things in terms familiar to Hadoop users, which I am not one of. Seeing an actual command line (as opposed to a contrived snippet of one) would be useful to me.

JensRantil · on June 1, 2014

This example wasn't enough? https://github.com/mfisk/filemap/wiki/Examples Hadoop experience is not necessary, but you need to know the basic idea behind MapReduce (https://en.wikipedia.org/wiki/MapReduce).

awhitty · on May 29, 2014

As someone who has never used Map-Reduce before, something about this implementation makes the technology feel 100% more accessible to me.

jasode · on May 29, 2014

If you're talking about other "Map-Reduce" as in Hadoop MapReduce being "less accessible", it's because Hadoop has a ton of housekeeping infrastructure. Hadoop is aware of the health of other machines and therefore slave jobs can be self-healing, any job errors are properly routed and centralized, etc. Leaving out the housekeeping stuff, the core "map & reduce" part of Hadoop is very simple. This FileMap doesn't seem to have all that infrastructure intelligence, hence it's "simpler".

It's like talking about bwr nuclear reactors. On the one hand, it's just a big glorified pot of boiling water. What's so hard about that?! The complexity isn't boiling the water, it's temperature regulation and redundant failsafes that prevent an explosion of radioactive materials[1] that makes it harder to implement than a kettle of hot water for brewing tea.

[1]http://en.wikipedia.org/wiki/Chernobyl_disaster

andrewguenther · on May 29, 2014

You should check out mrjob[1]. It wraps the Hadoop streaming API and makes it super easy to write MapReduce jobs in Python. I find it much easier to understand than this implementation.

[1] https://pythonhosted.org/mrjob/

JensRantil · on May 29, 2014

mrjob looks nice. If you have a Hadoop cluster. But for "medium sized big data problems" FileMap is a very viable alternative if you have people who know their way around a terminal. Especially if you'd like to have something set up pretty fast. For 500 GB of data setting up Hadoop (steep learning curve; name nodes, jobtrackers, thrift API:s, datanodes, zookeeper and whatnots) is a lot of heavy lifting. Not to mention administering it.

Sure you have Cloudera et al., but I'm still trying to figure out if they really make things easier or worse when it two weeks later comes to figuring out why something is broken, or how to start/install additional Hadoop components.

andrewguenther · on May 29, 2014

mrjob has really good integration to Amazon's Elastic Map Reduce, makes it totally painless. I had to analyze 1TB of logs for my thesis and in less than 8 hours I discovered mrjob, wrote my job, and successfully ran it on EMR. Granted I have prior experience with MapReduce, but even to a newcomer, I can't imagine that would add too much time.

JensRantil · on May 30, 2014

If your security policy allow putting stuff into Amazon... ;)

JensRantil · on May 29, 2014

Yes, FileMap was also extremelly easy to set up.