

"MapReduce is Good Enough" by Twitter scientist - Peteris
http://arxiv.org/pdf/1209.2191v1.pdf

======
rm999
I don't quite understand the gradient descent section. How do you parallelize
stochastic gradient descent? Each iteration changes theta, so any parallel
computations will be computing the wrong gradient. His citation for that
section doesn't seem to address this, but I only skimmed that paper so I could
be wrong.

~~~
bravura
You are correct that each iteration changes theta. You are also correct that
parallel computations will be computing the wrong gradient.

However, this is not bad.

 _Stochastic_ gradient descent (non-parallel, simply sequential) also computes
the "wrong" gradient. If you compute the gradient over the entire example set,
you get the "true" gradient.

Nonetheless, SGD (stochastic gradient descent) converges faster than batch
gradient descent, because the cost of each update is cheaper (one example vs.
all example gradient computation). More importantly, doing a _stochastic_
gradient descent sometimes leads to finding _better_ local optimum. Batch
tends to get you into easy-to-find local optimum, and you have no
stochasticity that temporarily pops you out of a local minimum and allows you
to find a better one. (See Lecun paper from 97 or 98, "Efficient Backprop".)

[edit: I initially wrote the following: "So computing a stochastic gradient
over each example in parallel, and them aggregating them, isn't necessarily a
bad idea. It's particularly a good idea when you are operating over zipf-ian
sparse distributions, like natural language inputs, where many words will be
quite rare and infrequently updated in a sequential SGD approach."

Jimmy Lin presents that approach, but points out that the aggregator step is a
bottleneck. So, as pointed out by iskander's child comment, each mapper trains
SGD _on a subset_ of the examples, and the trained models are aggregated into
an ensemble. Thank you for the correction. The point stands that stochastic
updates can give superior generalization accuracy to batch updates.]

~~~
iskander
>So computing a stochastic gradient over each example in parallel, and them
aggregating them, isn't necessarily a bad idea.

Wait a minute, I don't think that's what the paper is suggesting. The
communication costs of aggregating every micro-step of SGD would be extreme. I
_think_ Jimmy Lin is saying you should partition the data, train one model per
partition using SGD, and then combine the models in an ensemble.

------
batgaijin
Funny how Twitter happens to have the guy who invented Storm...

~~~
heretohelp
On that note, what are the viable alternatives?

All I'm really aware of are Google Proprietary Magic™ (motherfuckers. I want
colossus baaaaad), CFS from DataStax, Storm (sorta-not-really?), and Spark.

~~~
bunderbunder
Sector/Sphere is what the Sloan Digital Sky Survey uses.

Instead of supplying map and reduce routines, you implement generic "user
defined functions". This gives you some more flexibility about how the work is
handled, though if you want to just implement map and reduce UDFs, it
supposedly gets better performance than Hadoop.

It's also designed to support distributing work over WANs. I think Hadoop
really wants every compute node to be on the same LAN.

~~~
heretohelp
>I think Hadoop really wants every compute node to be on the same LAN.

Fucking a-right it does. You should see the labyrinthine depths people descend
to in order to scale Hadoop. Sub-clusters of sub-clusters, rack-local
clusters, Zookeeper nodes all over the place.

It's like fuckin' 'Nam all over again man.

------
shin_lao
I have a naive question regarding MapReduce.

I tend to observe that MapReduce is often used to do "search requests" that
would be more easily answered with indexes.

Am I missing something?

~~~
johngibbon
Where does the index come from?

~~~
shin_lao
Indexing can be done on the fly, when adding or modifying entries.

~~~
johngibbon
Only if your data is stored in such a way that you can monitor adds and
modifications. MapReduce is often used on huge bags of data pulled from random
places or as one pass in a massive sequence of passes (think pagerank), where
any index will be destroyed by each pass.

MapReduce isn't a database. It avoids mutable state such as an index. It's
more like a command line tool such as grep.

You could also use MapReduce to build your index.

Finally, it is a simple example. Like how people use fib to demo parallelism.

~~~
shin_lao
Thanks for your answer. My point was exactly what you're implying: wouldn't it
be cheaper to insert the data into an ad hoc "database" instead of using
Hadoop?

~~~
johngibbon
What kind of database are you going to insert into if you have 1000 TB of
data? Let alone an open source one. What kind of database is going to allow
you to set everything up and strip it all down just for one of a million
passes of your data? Do you have a simple database that you can distribute
over thousands of nodes?

Searching isn't really a MapReduce problem anyway - think, what is the map?
What is the reduce? They're not really any kind of computation are they?

If you want to understand why MapReduce, find a better motivating example than
search. PageRank is the classic.

Have you read the paper? If not, that would be the best start.

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large
clusters. In Proceedings of Operating Systems Design and Implementation, 2004.

~~~
iskander
It seems like a pretty straightforward MapReduce, though I agree that if
you're doing searches repeatedly or your data is small enough you should use a
database.

(map = search a partition and return top k results, reduce = combine multiple
n*k result lists into a single result list)

------
npguy
What are the author's thoughts on Dremel?

------
elchief
But dremel...

