However, this is not bad.
Stochastic gradient descent (non-parallel, simply sequential) also computes the "wrong" gradient. If you compute the gradient over the entire example set, you get the "true" gradient.
Nonetheless, SGD (stochastic gradient descent) converges faster than batch gradient descent, because the cost of each update is cheaper (one example vs. all example gradient computation). More importantly, doing a stochastic gradient descent sometimes leads to finding better local optimum. Batch tends to get you into easy-to-find local optimum, and you have no stochasticity that temporarily pops you out of a local minimum and allows you to find a better one. (See Lecun paper from 97 or 98, "Efficient Backprop".)
[edit: I initially wrote the following: "So computing a stochastic gradient over each example in parallel, and them aggregating them, isn't necessarily a bad idea. It's particularly a good idea when you are operating over zipf-ian sparse distributions, like natural language inputs, where many words will be quite rare and infrequently updated in a sequential SGD approach."
Jimmy Lin presents that approach, but points out that the aggregator step is a bottleneck. So, as pointed out by iskander's child comment, each mapper trains SGD on a subset of the examples, and the trained models are aggregated into an ensemble. Thank you for the correction. The point stands that stochastic updates can give superior generalization accuracy to batch updates.]
Wait a minute, I don't think that's what the paper is suggesting. The communication costs of aggregating every micro-step of SGD would be extreme. I think Jimmy Lin is saying you should partition the data, train one model per partition using SGD, and then combine the models in an ensemble.
A more sophisticated alternative is the Terascale Learning Algorithm (http://www.slideshare.net/pauldix/terascale-learning) from John Langford & friends.
1. Seed every worker with a random initial parameter
2. Run SGD locally for some large number of steps
3. Choose the parameter with lowest training (or validation, but never test) error
4. Re-seed all workers with the best parameter
5. Go to 5 until you're satisfied.
All I'm really aware of are Google Proprietary Magic™ (motherfuckers. I want colossus baaaaad), CFS from DataStax, Storm (sorta-not-really?), and Spark.
Instead of supplying map and reduce routines, you implement generic "user defined functions". This gives you some more flexibility about how the work is handled, though if you want to just implement map and reduce UDFs, it supposedly gets better performance than Hadoop.
It's also designed to support distributing work over WANs. I think Hadoop really wants every compute node to be on the same LAN.
Fucking a-right it does. You should see the labyrinthine depths people descend to in order to scale Hadoop. Sub-clusters of sub-clusters, rack-local clusters, Zookeeper nodes all over the place.
It's like fuckin' 'Nam all over again man.
I don't get it... what's the point of creating awesome software if you don't even make the effort to put the links out to help people find it?
I tend to observe that MapReduce is often used to do "search requests" that would be more easily answered with indexes.
Am I missing something?
MapReduce isn't a database. It avoids mutable state such as an index. It's more like a command line tool such as grep.
You could also use MapReduce to build your index.
Finally, it is a simple example. Like how people use fib to demo parallelism.
Searching isn't really a MapReduce problem anyway - think, what is the map? What is the reduce? They're not really any kind of computation are they?
If you want to understand why MapReduce, find a better motivating example than search. PageRank is the classic.
Have you read the paper? If not, that would be the best start.
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of Operating Systems Design and Implementation, 2004.
(map = search a partition and return top k results, reduce = combine multiple n*k result lists into a single result list)