"MapReduce is Good Enough" by Twitter scientist

rm999 · on Sept 14, 2012

I don't quite understand the gradient descent section. How do you parallelize stochastic gradient descent? Each iteration changes theta, so any parallel computations will be computing the wrong gradient. His citation for that section doesn't seem to address this, but I only skimmed that paper so I could be wrong.

bravura · on Sept 14, 2012

You are correct that each iteration changes theta. You are also correct that parallel computations will be computing the wrong gradient.

However, this is not bad.

Stochastic gradient descent (non-parallel, simply sequential) also computes the "wrong" gradient. If you compute the gradient over the entire example set, you get the "true" gradient.

Nonetheless, SGD (stochastic gradient descent) converges faster than batch gradient descent, because the cost of each update is cheaper (one example vs. all example gradient computation). More importantly, doing a stochastic gradient descent sometimes leads to finding better local optimum. Batch tends to get you into easy-to-find local optimum, and you have no stochasticity that temporarily pops you out of a local minimum and allows you to find a better one. (See Lecun paper from 97 or 98, "Efficient Backprop".)

[edit: I initially wrote the following: "So computing a stochastic gradient over each example in parallel, and them aggregating them, isn't necessarily a bad idea. It's particularly a good idea when you are operating over zipf-ian sparse distributions, like natural language inputs, where many words will be quite rare and infrequently updated in a sequential SGD approach."

Jimmy Lin presents that approach, but points out that the aggregator step is a bottleneck. So, as pointed out by iskander's child comment, each mapper trains SGD on a subset of the examples, and the trained models are aggregated into an ensemble. Thank you for the correction. The point stands that stochastic updates can give superior generalization accuracy to batch updates.]

iskander · on Sept 14, 2012

>So computing a stochastic gradient over each example in parallel, and them aggregating them, isn't necessarily a bad idea.

Wait a minute, I don't think that's what the paper is suggesting. The communication costs of aggregating every micro-step of SGD would be extreme. I think Jimmy Lin is saying you should partition the data, train one model per partition using SGD, and then combine the models in an ensemble.

rm999 · on Sept 14, 2012

I still don't get how it's parallelized, what's the reduce step? Is it averaging all the gradients from the maps? If so, that seems like what you were originally saying from your edit (and sounds like the functional equivalent of batch gradient descent).

iskander · on Sept 14, 2012

He doesn't! Each mapper trains a model using stochastic gradient descent on a subset of the data and they are then combined into an ensemble.

A more sophisticated alternative is the Terascale Learning Algorithm (http://www.slideshare.net/pauldix/terascale-learning) from John Langford & friends.

rm999 · on Sept 14, 2012

I used to do a cheap man's version of this for neural networks: train many models in parallel on subsets of my data and then choose the one with the best validation performance (I avoided ensembles for speed reasons). But I often found with good model parameters and good quality data most of the models came out very similarly, except a few unlucky ones who found bad local minima. So I usually just downsampled the hell out of my data and built one or two models.

iskander · on Sept 14, 2012

My officemate was playing with training neural networks in parallel and he found that iterating what you were doing ended up with faster training times and lower error in the long run.

   1. Seed every worker with a random initial parameter
   2. Run SGD locally for some large number of steps
   3. Choose the parameter with lowest training (or validation, but never test) error
   4. Re-seed all workers with the best parameter
   5. Go to 5 until you're satisfied.

rm999 · on Sept 14, 2012

What do you mean by parameter? If you mean hyperparameters, that's pretty standard, it's a hyperparameter search followed by training. Randomized hyperparameter search is actually pretty state of the art though (as opposed to a grid search, for example): http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergst...

iskander · on Sept 14, 2012

I mean the actual weights of the neural network. After the first iteration, each worker starts with the same weights but performs SGD updates on different subsets of the data. After each epoch, you re-seed the workers with the best weights.

rm999 · on Sept 14, 2012

Ah, very interesting.

batgaijin · on Sept 14, 2012

Funny how Twitter happens to have the guy who invented Storm...

heretohelp · on Sept 14, 2012

On that note, what are the viable alternatives?

All I'm really aware of are Google Proprietary Magic™ (motherfuckers. I want colossus baaaaad), CFS from DataStax, Storm (sorta-not-really?), and Spark.

bunderbunder · on Sept 14, 2012

Sector/Sphere is what the Sloan Digital Sky Survey uses.

Instead of supplying map and reduce routines, you implement generic "user defined functions". This gives you some more flexibility about how the work is handled, though if you want to just implement map and reduce UDFs, it supposedly gets better performance than Hadoop.

It's also designed to support distributing work over WANs. I think Hadoop really wants every compute node to be on the same LAN.

heretohelp · on Sept 15, 2012

>I think Hadoop really wants every compute node to be on the same LAN.

Fucking a-right it does. You should see the labyrinthine depths people descend to in order to scale Hadoop. Sub-clusters of sub-clusters, rack-local clusters, Zookeeper nodes all over the place.

It's like fuckin' 'Nam all over again man.

chad_walters · on Sept 14, 2012

There is also S4: http://incubator.apache.org/s4/

batgaijin · on Sept 14, 2012

WTF that project is awesome and doesn't even have a page on Wikipedia?!?!?

I don't get it... what's the point of creating awesome software if you don't even make the effort to put the links out to help people find it?

seanmcdirmid · on Sept 14, 2012

You should write a page for it. You aren't supposed to create pages for your own projects/products on wikipedia; they should come from neutral parties.

shin_lao · on Sept 14, 2012

I have a naive question regarding MapReduce.

I tend to observe that MapReduce is often used to do "search requests" that would be more easily answered with indexes.

Am I missing something?

lmm · on Sept 14, 2012

Even leaving aside the cost of storing an index, if you're only going to use that dataset for a small number of queries then it's cheaper to just perform them directly rather than construct an index first. As the paper points out at the end, Hadoop is fundamentally batchy; it works best when you have discrete chunks of data (e.g. weekly) and you want to reduce each one into some summarized form, then never use the raw data again.

bunderbunder · on Sept 17, 2012

The major search engines are using map/reduce to build the indexes that searches are run against.

johngibbon · on Sept 14, 2012

Where does the index come from?

shin_lao · on Sept 14, 2012

Indexing can be done on the fly, when adding or modifying entries.

johngibbon · on Sept 14, 2012

Only if your data is stored in such a way that you can monitor adds and modifications. MapReduce is often used on huge bags of data pulled from random places or as one pass in a massive sequence of passes (think pagerank), where any index will be destroyed by each pass.

MapReduce isn't a database. It avoids mutable state such as an index. It's more like a command line tool such as grep.

You could also use MapReduce to build your index.

Finally, it is a simple example. Like how people use fib to demo parallelism.

shin_lao · on Sept 14, 2012

Thanks for your answer. My point was exactly what you're implying: wouldn't it be cheaper to insert the data into an ad hoc "database" instead of using Hadoop?

johngibbon · on Sept 14, 2012

What kind of database are you going to insert into if you have 1000 TB of data? Let alone an open source one. What kind of database is going to allow you to set everything up and strip it all down just for one of a million passes of your data? Do you have a simple database that you can distribute over thousands of nodes?

Searching isn't really a MapReduce problem anyway - think, what is the map? What is the reduce? They're not really any kind of computation are they?

If you want to understand why MapReduce, find a better motivating example than search. PageRank is the classic.

Have you read the paper? If not, that would be the best start.

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of Operating Systems Design and Implementation, 2004.

iskander · on Sept 14, 2012

It seems like a pretty straightforward MapReduce, though I agree that if you're doing searches repeatedly or your data is small enough you should use a database.

(map = search a partition and return top k results, reduce = combine multiple n*k result lists into a single result list)

shin_lao · on Sept 14, 2012

That's what I was telling you. A lot of people use MapReduce for simple search tasks, hence my remark.

npguy · on Sept 14, 2012

What are the author's thoughts on Dremel?

elchief · on Sept 14, 2012

But dremel...