Hacker News new | comments | show | ask | jobs | submit login
"MapReduce is Good Enough" by Twitter scientist (arxiv.org)
36 points by Peteris on Sept 14, 2012 | hide | past | web | favorite | 29 comments

I don't quite understand the gradient descent section. How do you parallelize stochastic gradient descent? Each iteration changes theta, so any parallel computations will be computing the wrong gradient. His citation for that section doesn't seem to address this, but I only skimmed that paper so I could be wrong.

You are correct that each iteration changes theta. You are also correct that parallel computations will be computing the wrong gradient.

However, this is not bad.

Stochastic gradient descent (non-parallel, simply sequential) also computes the "wrong" gradient. If you compute the gradient over the entire example set, you get the "true" gradient.

Nonetheless, SGD (stochastic gradient descent) converges faster than batch gradient descent, because the cost of each update is cheaper (one example vs. all example gradient computation). More importantly, doing a stochastic gradient descent sometimes leads to finding better local optimum. Batch tends to get you into easy-to-find local optimum, and you have no stochasticity that temporarily pops you out of a local minimum and allows you to find a better one. (See Lecun paper from 97 or 98, "Efficient Backprop".)

[edit: I initially wrote the following: "So computing a stochastic gradient over each example in parallel, and them aggregating them, isn't necessarily a bad idea. It's particularly a good idea when you are operating over zipf-ian sparse distributions, like natural language inputs, where many words will be quite rare and infrequently updated in a sequential SGD approach."

Jimmy Lin presents that approach, but points out that the aggregator step is a bottleneck. So, as pointed out by iskander's child comment, each mapper trains SGD on a subset of the examples, and the trained models are aggregated into an ensemble. Thank you for the correction. The point stands that stochastic updates can give superior generalization accuracy to batch updates.]

>So computing a stochastic gradient over each example in parallel, and them aggregating them, isn't necessarily a bad idea.

Wait a minute, I don't think that's what the paper is suggesting. The communication costs of aggregating every micro-step of SGD would be extreme. I think Jimmy Lin is saying you should partition the data, train one model per partition using SGD, and then combine the models in an ensemble.

I still don't get how it's parallelized, what's the reduce step? Is it averaging all the gradients from the maps? If so, that seems like what you were originally saying from your edit (and sounds like the functional equivalent of batch gradient descent).

He doesn't! Each mapper trains a model using stochastic gradient descent on a subset of the data and they are then combined into an ensemble.

A more sophisticated alternative is the Terascale Learning Algorithm (http://www.slideshare.net/pauldix/terascale-learning) from John Langford & friends.

I used to do a cheap man's version of this for neural networks: train many models in parallel on subsets of my data and then choose the one with the best validation performance (I avoided ensembles for speed reasons). But I often found with good model parameters and good quality data most of the models came out very similarly, except a few unlucky ones who found bad local minima. So I usually just downsampled the hell out of my data and built one or two models.

My officemate was playing with training neural networks in parallel and he found that iterating what you were doing ended up with faster training times and lower error in the long run.

   1. Seed every worker with a random initial parameter
   2. Run SGD locally for some large number of steps
   3. Choose the parameter with lowest training (or validation, but never test) error
   4. Re-seed all workers with the best parameter
   5. Go to 5 until you're satisfied.

What do you mean by parameter? If you mean hyperparameters, that's pretty standard, it's a hyperparameter search followed by training. Randomized hyperparameter search is actually pretty state of the art though (as opposed to a grid search, for example): http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergst...

I mean the actual weights of the neural network. After the first iteration, each worker starts with the same weights but performs SGD updates on different subsets of the data. After each epoch, you re-seed the workers with the best weights.

Ah, very interesting.

Funny how Twitter happens to have the guy who invented Storm...

On that note, what are the viable alternatives?

All I'm really aware of are Google Proprietary Magicâ„¢ (motherfuckers. I want colossus baaaaad), CFS from DataStax, Storm (sorta-not-really?), and Spark.

Sector/Sphere is what the Sloan Digital Sky Survey uses.

Instead of supplying map and reduce routines, you implement generic "user defined functions". This gives you some more flexibility about how the work is handled, though if you want to just implement map and reduce UDFs, it supposedly gets better performance than Hadoop.

It's also designed to support distributing work over WANs. I think Hadoop really wants every compute node to be on the same LAN.

>I think Hadoop really wants every compute node to be on the same LAN.

Fucking a-right it does. You should see the labyrinthine depths people descend to in order to scale Hadoop. Sub-clusters of sub-clusters, rack-local clusters, Zookeeper nodes all over the place.

It's like fuckin' 'Nam all over again man.

WTF that project is awesome and doesn't even have a page on Wikipedia?!?!?

I don't get it... what's the point of creating awesome software if you don't even make the effort to put the links out to help people find it?

You should write a page for it. You aren't supposed to create pages for your own projects/products on wikipedia; they should come from neutral parties.

I have a naive question regarding MapReduce.

I tend to observe that MapReduce is often used to do "search requests" that would be more easily answered with indexes.

Am I missing something?

Even leaving aside the cost of storing an index, if you're only going to use that dataset for a small number of queries then it's cheaper to just perform them directly rather than construct an index first. As the paper points out at the end, Hadoop is fundamentally batchy; it works best when you have discrete chunks of data (e.g. weekly) and you want to reduce each one into some summarized form, then never use the raw data again.

The major search engines are using map/reduce to build the indexes that searches are run against.

Where does the index come from?

Indexing can be done on the fly, when adding or modifying entries.

Only if your data is stored in such a way that you can monitor adds and modifications. MapReduce is often used on huge bags of data pulled from random places or as one pass in a massive sequence of passes (think pagerank), where any index will be destroyed by each pass.

MapReduce isn't a database. It avoids mutable state such as an index. It's more like a command line tool such as grep.

You could also use MapReduce to build your index.

Finally, it is a simple example. Like how people use fib to demo parallelism.

Thanks for your answer. My point was exactly what you're implying: wouldn't it be cheaper to insert the data into an ad hoc "database" instead of using Hadoop?

What kind of database are you going to insert into if you have 1000 TB of data? Let alone an open source one. What kind of database is going to allow you to set everything up and strip it all down just for one of a million passes of your data? Do you have a simple database that you can distribute over thousands of nodes?

Searching isn't really a MapReduce problem anyway - think, what is the map? What is the reduce? They're not really any kind of computation are they?

If you want to understand why MapReduce, find a better motivating example than search. PageRank is the classic.

Have you read the paper? If not, that would be the best start.

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of Operating Systems Design and Implementation, 2004.

It seems like a pretty straightforward MapReduce, though I agree that if you're doing searches repeatedly or your data is small enough you should use a database.

(map = search a partition and return top k results, reduce = combine multiple n*k result lists into a single result list)

That's what I was telling you. A lot of people use MapReduce for simple search tasks, hence my remark.

What are the author's thoughts on Dremel?

But dremel...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact