I agree. I hate this example. Not sure why it's so ubiquitous. Think of like thi...

I agree. I hate this example. Not sure why it's so ubiquitous. Think of like this: Each record in your dataset goes through the "map" phase. This is really just a categorization. You look at each record and assign it, or some parts of it, a "key" based on what it contains, and send those KV pairs back out into the ether. The framework will then automatically group those by key and send them, in no guaranteed order and in parallel, to your "reduce" phase. You get a key, and a list of all of those data points that were assigned that key. You do some computation on them, and return the result. When all of this is done, the process is complete.

I think what makes this example so confusing (as well as the other common hadoop example where you do a partial count in the mapper "hello":3, "goodbye":2) is that the keys mean nothing to the final result (unless you really wanted to know the frequency of each word) - they are only used to shard the work.

Sorry I don't have a real-world example (crunching log files as you mentioned is a common one, but it's really just the same thing: counting frequencies)