

Write your first MapReduce program in 20 minutes - the_gws
http://michaelnielsen.org/blog/write-your-first-mapreduce-program-in-20-minutes/

======
wedesoft
And here's the equivalent Ruby program. In Ruby it is usually 'collect' and
'inject' instead of 'map' and 'reduce'.

    
    
        result = Dir.glob('test?.txt').collect do |file_name|
          File.new(file_name, 'r').read.split(' ').collect do |word|
            word.downcase.tr '.,\'', ''
          end.inject Hash.new(0) do |hash,word|
            hash[word] += 1
            hash
          end
        end.inject do |all,hash|
          (all.keys + hash.keys).uniq.inject Hash.new(0) do |acc,word|
            acc[word] = all[word] + hash[word]
            acc
          end
        end
        p result
    

Edit: The Python example in the article is better because it merges hashes in
the reduce step which facilitates parallelisation.

~~~
cdcarter
You can also use Enumerable#map and Enumerable#reduce, to use names that match
the pattern (and #map is arguably more idiomatic than #collect).

~~~
chrismealy
I finally got inject() when they added the reduce() alias.

~~~
cdcarter
The thing that confused me most about #inject was the term 'memo'...

------
derwiki
MRJob is a great way to get off the ground and running with MapReduce:

[http://engineeringblog.yelp.com/2010/10/mrjob-distributed-
co...](http://engineeringblog.yelp.com/2010/10/mrjob-distributed-computing-
for-everybody.html)

<https://github.com/Yelp/mrjob>

------
cdcarter
If you really want to try and play with MapReduce on a dataset without going
through setting up a Hadoop node (or 4), check out CouchDB. It's designed
around MapReduces (though not distributed), and you even get to deal with
solving re-reduce problems.

~~~
klaruz
BigCouch will let you build indexes in parallel.

------
pangram
I wrote a little shim that allowed to write me to write Hadoop jobs in
Clojure, and had two small test functions that would apply a map / reduce to a
test file -- it made development of Hadoop jobs a bit easier. See:
[https://github.com/brool/hadoop-
shim/blob/master/wordcount.c...](https://github.com/brool/hadoop-
shim/blob/master/wordcount.clj)

