
Map/Reduce in Bash - coderdude
http://github.com/erikfrey/bashreduce
======
gfodor
a cool hack, but a large part of what makes map reduce map reduce isn't just
the "map" and "reduce" but all the necessary fault tolerance features to
ensure that when you run a job that's computable, it damn well is going to
finish even if a meteor takes out half your data center.

~~~
saurik
That is not difficult to handle in bash. The expectation going in should be
that individual computations and even computers might fail in ways that keep
them from reporting failure, at which point the fact that it is written in a
language running on a VM with a poor memory manager (as it really wasn't
designed for this kind of stress, and rapidly dominates RAM and fragments its
heap to hell) is not a problem. (This happens to be a sore point of mine, as I
had a rather unenlightened professor in grad school who refused to believe I
could effectively solve this one massive distributed problem that came up with
a "simple shell script for doing the distribution", giving me a bad grade in
his class and never forgiving me for having made the claim. This being despite
the fact that I then sent him the finished result three days later, having
narrowed the search space with some light intelligence enough to have gotten
the same answer his system took a week to do on better hardware, and his
beloved not-a-shell-script. _sigh_ ;P)

~~~
gfodor
It seems easier said than done, unless you think Hadoop's architecture and
fault tolerance logic is over-engineered.

------
mark_l_watson
Interesting, but considering the rich set of tools growing up around Hadoop
infrastructure (e.g., Mahoot, Cascading, etc.) I think that it makes much more
sense to scale out horizontally using Hadoop and related technologies.

------
chuhnk
Very cool, something I'll try utilitize internally for processing large files.

------
wazoox
I don't know if this really qualify as "map-reduce". It's very cool, anyway :)

~~~
StavrosK
Why not? There appears to be both a mapper and a reducer, so...

~~~
btilly
There is both a mapper and a reducer internally, but a real map-reduce
implementation should let you run custom code on both the map and the reduce,
and a full-featured one should let you independently specify how many nodes
you have mapping and reducing.

I may have missed it, but I don't see those things.

(I'm ignoring the fact that it is nice to be able to pass around arbitrary
data structures. While true, I wouldn't expect that out of a bash
implementation.)

~~~
adamtj
map/reduce is a very simple, very general idea. It is also Google's and
Hadoop's specific implementations with lots of bells and whistles. It's also
everything in between. This implementation has more than the minimum two
necessary features. If you are arguing that it isn't a true map/reduce
implementation, then I would argue that you aren't a true software developer.
<http://en.wikipedia.org/wiki/No_true_scotsman>

~~~
btilly
The idea of MapReduce was introduced to the world in the paper
<http://labs.google.com/papers/mapreduce.html>. The abstract starts off with
_MapReduce is a programming model and an associated implementation for
processing and generating large data sets. Users specify a map function that
processes a key/value pair to generate a set of intermediate key/value pairs,
and a reduce function that merges all intermediate values associated with the
same intermediate key. Many real world tasks are expressible in this model, as
shown in the paper._

If you lack that feature set, then you can't solve most of the problems that
MapReduce is actually used for in practice. Conversely if you have that
feature set, then you can, though not necessarily with good reliability or at
optimal speed. Therefore that description is what I take as the canonical
description of what it means to have implemented MapReduce.

As for your arguing that I am not a true software developer, guilty as
charged. I've been joking for years that I am merely a profitably displaced
mathematician, and my current job role is a weird cross between software
development and system administration.

------
gcb
heh. considering that a shell by definition must rely on other programs to do
something, a simpler implementation can be:

#!/bin/sh /bin/hadoop start-all

~~~
gcb
...also close your eyes and imagine a line break there.

