For reference, I have a copy of my solutions here: https://github.com/danluu/UFLDL-tutorial. Debugging broken learning algorithms can be tedious in a way that's not particularly educational, so I tried to find a reference I could compare against when I was doing the exercises, and every copy I found had bugs. Hope having this reference helps someone.
1. When does it make sense to apply deep learning? Could it potentially be applied successful applied to any difficult problem given enough data? Could it also be good at the type of problems that Random Forest, Gradient Boosting Machines are traditionally good at versus the problems that SVMs are traditionally good at (Computer Vision, NLP)? 
2. How much data is enough?
3. What degree of tuning is required to make it work? Are we at the point yet where deep learning works more or less out the box?
4. Is it fair to say that dropout and maxout always work better in practice? 
5. What is the computational effort? How long e.g. does it take to classify an ImageNet image (on a CPU / GPU)? How long does it take train a model like that?
6. How on earth does this fit into memory? Say in ImageNet your have (256 pixels * 256 pixels) * (10,000 classes) * 4 bytes = 2.4 GB, for a NN without any hidden layers.
 I am overgeneralizing somewhat, I know. It's my way to avoid overfitting.
 My lunch today was free.
#5)  has a some python code and timings mixed in to the docs. One such example (stacked denoising autoencoders on MNIST):
By default the code runs 15 pre-training epochs for each layer,
with a batch size of 1. The corruption level forthe first layer is
0.1, for the second 0.2 and 0.3 for the third. The pretraining
learning rate is was 0.001 and the finetuning learning rate is
0.1. Pre-training takes 585.01 minutes, with an average of 13
minutes per epoch. Fine-tuning is completed after 36 epochs in
444.2 minutes, with an average of 12.34 minutes per epoch. The
final validation score is 1.39% with a testing score of
1.3%. These results were obtained on a machine with an Intel Xeon
E5430 @ 2.66GHz CPU, with a single-threaded GotoBLAS.
It's ironic that deep neural networks have become the biggest machine learning breakthrough of 2013: they were also the biggest machine learning breakthrough of 1957. The idea dates back to the Perceptron, one of the oldest ideas in AI.
One thing to note: although there was a lot of initial excitement about Restricted Boltzman Machines, Auto-encoders, and other unsupervised approaches, the best results in the last year or so have all used conventional the back-propagation algorithm from 1974, with a few tweaks.
Ben Lorica wrote a good article on the latest deep learning research from Google, and what's changed since neural networks were last popular in the 1980's:
What's old is new again.
The next big boom in AI (ignoring some logic/rules-based research in the 70s that I don't think is very interesting from an AI perspective) occurred in the 80s, when computational power increased and researchers discovered/rediscovered neural approaches, including the obscure 1974 research on backpropagation. This led to tons of press and funding from governments who dreamed of killer AI robots and what-not. But, once again, imagination raced ahead of reality and funding dried up when said robots didn't materialize. The field didn't really die off, but funding in AI went way down, leading to another major "AI winter".
I'd say the next big era of AI is the one we're in, driven largely by applied statistics that became known as "machine learning". This has been by far the most successful era, and has probably added 100s of billions of dollars to the economy (I'd argue Google is a machine learning company, for example). I think it's also the most pragmatic era, as people in the field have really learned from the past mistakes of overpromising. In fact, when I was studying "AI" in grad school, my professors warned me to always refer to what I did as machine learning because the concept of "intelligence" was such a joke to so many in the field.
1. Linear Regression (which, admittedly, was amazing)
2. Fourier Analysis (which is linear regression on orthonormal bases of functions. it blew people's minds)
3. Perceptrons (which is linear regression but with a logistic loss. it went back to its old name of "logistic regression" once its insane cachet of biological plausibility faded)
4. Neural Networks (stack of logistic regressors. popular with people who didn't know how to filter their inputs through fixed or random nonlinearities before applying linear regression)
5. Self Organizing Maps and Recurrent Nets (which were neural nets that feed back on themselves)
6. Fractals (which is recursion. they were useful for enticing children into math classes)
7. Chaos (which is recursion that's hard to model. useful for movie plots)
8. Wavelets (which is recursive Fourier analysis, and probably still way under-used)
9. Support Vector Machines (which replaces logistic regression's smooth loss with a kink that makes it hard to use a fast optimizer. often conflated with the "kernel trick", which appealed to people who didn't want to pass their inputs through nonlinearities explicitly)
9. Deep Nets (which are bigger neural networks. the jury's out whether they work better because they're deeper, or because they're bigger and require a lot of data to train, or because they require a programmer to spend years developing a learning algorithm for each new dataset. also whether they do actually work better).
Once this Deep Net thing blows over again, my money's on Kernelized Recurrent Deep Self Organizing Maps.
(On a serious note: MNIST is considered a trivial dataset and doesn't require the heavy machinery of deep nets. linear regression on almost any random nonlinearity applied to the data (say f(x;w,t)=cos(w'x+t) with w~N(0,I) and t~U[0,pi2/]) will get you >98% accuracy on MNIST.)
I would say that the next bing thing is more:
Realizing even more that Neural Nets is an optimization problem, and instead of using some heuristics, wait for some Russian mathematician to derive the right SGD schedule / batch solver for the problem. Then what the 1,000 of Google computers have been able to do for the cat face detector, we'll be able to do it on a smartphone chip.
People have to realize that Deep Learning is a bit of a "brute force" solution for the moment (each node is a linear model). We need to derive smarter algorithms.
One of the best breakthroughs has been this notion of layer-wise pretraining, which allows the backpropagation algorithm to not get stuck in local minima so easily. It provides a good guess to the starting starting points for the weights. Otherwise, the biggest issue with backpropagation historically has been the diffusion of weights as the layers increase; it is hard to attribute the causality or what portion of the update weighting should be applied to each node since it grows exponentially. This pretraining idea helps against that.
In 2006, Hinton introduced greedy layer-wise pretraining, which was intended to solve the problem of backpropagation getting stuck in poor local optima. The theory was that you'd pretrain to find a good initial set of connection weights, then apply backprop to "fine-tune" discriminatively. And the theory seemed correct since the experimental results were good:
Does pretraining truly help solve the problem of poor local optima? In 2010, some empirical studies suggested the answer was yes:
But that same year, a student in Geoff Hinton's lab discovered that if you added information about the 2nd-derivatives of the loss function to backpropagation ("Hessian-free optimization"), you could skip pretraining and get the same or better results:
And around ~2012, a bunch of researchers have reported you don't even need 2nd-derivative information. You just have to initialize the neural net properly. Apparently, all the most recent results in speech recognition just use standard backpropagation with no unsupervised pretraining. (Although people are still trying more complex variants of unsupervised pretraining algorithms, often involving multiple types of layers in the neural network.)
So now, after seven years of work, we're back where we started: the plain ol' backpropgation algorithm from 1974 worked all along.
This whole topic is really interesting to me from a history of science perspective. What other old, discarded ideas from the past might be ripe, now that we have millions of times more data and computation?
Definitely going to be researching this more throughout the year.
Machine learning is really just a form of non-human scripting. After all, every ML system running on a PC is either Turing equivalent or less. An analogy would be something that tries to generate the minimal set of regular expressions (that match non deterministically) which cover given examples. The advantage of an ML model vs a collection of regexes is many interesting problems are vulnerable to calculus (optimize) or counting (probability, integration etc.)
So like good notation, the stacking allows more complicated things to be said more compactly. But more complicated things need more explanation and more thinking to understand.
This sounds very interesting. How do you property initialize the weights? Do you have a link to a paper about this?
Practical recommendations for gradient-based training of deep architectures, Y. Bengio
There is a section on weight initialization on page 15. In general, this paper has a lot of good information in one place.
(Thus concludes the smartest thing I've said all day.)
DISCLAIMER: This is my event
We'll also post that and any updates to the main site on Friday.
It appears to be a bug with a combination of the ::selected pseudoelement in conjunction with the font Georgia. My guess is it's a Chrome-on-mac bug (Firefox is fine), not a site coding error.
Disabling either the font or the selection style fixes it. Most likely a text rendering issue. At work we've noticed Chrome getting buggier in relation to that, as well as retaining DOM node properties via redraws.
I know how to code through self learning, and I've pretty much solely done web development. So I barely know much CS. Also not very good at math.
So what are the essential prerequisites you would say are necessary for doing neat, useful stuff with machine learning?
Linear algebra. Bayesian statistics. MUST know these inside out, upside down.
Vector calculus. Convex optimization.
A boatload of machine learning literature. The ideas coalescing into deep learning are based more than a decade of research.
If you know nothing about math... I can't imagine getting to the point of understanding deep learning (which is a fairly rapidly evolving area) without at least 2-3 years of very hard work.
This class is a reasonable attempt to give a quick intro to one major source for DBNs https://www.coursera.org/course/neuralnets understanding this course is a good benchmark.
Also, Andrew Ng's coursera course on machine learning is amazing (https://www.coursera.org/course/ml) as well as Norvig and Thrun's Udacity course on AI (https://www.udacity.com/course/cs271)
NIPS, a big ML conference, is in December, so expect to see a large amount of new ideas and applications re: deep learning to come out of that.