
Mapreduce Bash Script - danw
http://blog.last.fm/2009/04/06/mapreduce-bash-script
======
antirez
I guess that Hadoop should rethink it's "user interface" to the programmer if
a bash script can be more handy sometimes. A lot of great code lacks so much
taste in the interface that I simply refuse to use it. I want not say it's the
case of Hadoop but the "interface with the programmer", and "it should be
simple to do simple things" is somewhat not in the culture of many project
leaders.

------
ashleyw
One thing I've always wondered about MapReduce...is what it's used for? I mean
what kind of data do you put in, and what do you aim to get out?

~~~
inerte
You put lists, and you take lists :p

Imagine a key-value store/database. Each key is a word, the value is a list of
keys from webpages, these keys are the webpage contents.

Get every value for the word "hacker", get every value for the word "news",
intersect these values (distributing the computation, or DTC), get the
webpages for this intersection. Now you have webpages that contain the term
"hacker news".

Key -> Value (word, webpage ids)

hacker -> page_1,page_255,page_600,page_5041

news -> page_5,page_600,page_1001,page_5041

(so, intersect == page_600,page_5041)

Key -> Value (webpage ids, contents)

page_600 -> "hacker news new threads comments leaders"

page_5041 -> "where I can find news for hackers"

Now let's sort these webpages. Take the relevancy algorithm, apply to your
list of webpages (DTC), so now you have another list. Now take the list of
urls that the user has "banned" (think Google results wiki), and remove them
from the list (DTC). Now take the content from the webpages, and select a
snippet where the words "hacker" and "news" appears, and wrap them around bold
tags (you guessed... DTC).

The thing with the so called MapReduce is that this distribution is somewhat
made easier. You map your data, and you reduce, ad-infinitum-or-how-much-you-
want, each time distributing the computation. I think I read somewhere in the
past that a single query on Google can use up to 100 machines.

