
Deep Learning 101 - mbeissinger
http://markus.com/deep-learning-101/
======
luu
Personally, I've found that I don't retain much of this sort of material
without working through exercises. If you learn the same way, you might want
to check out the series of progressive exercises from Andrew Ng here:
[http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial](http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial)

For reference, I have a copy of my solutions here:
[https://github.com/danluu/UFLDL-tutorial](https://github.com/danluu/UFLDL-
tutorial). Debugging broken learning algorithms can be tedious in a way that's
not particularly educational, so I tried to find a reference I could compare
against when I was doing the exercises, and every copy I found had bugs. Hope
having this reference helps someone.

~~~
msvan
For some more elementary material, I also recommend Andrew Ng's machine
learning course on Coursera. He's a great teacher.

~~~
kot-behemoth
Link for the impatient
[https://www.coursera.org/course/ml](https://www.coursera.org/course/ml) Looks
great indeed!

------
ma2rten
I've followed the developments in Neural Networks somewhat, but have never
applied deep learning so far. This is seems like a good place to ask a couple
of question I've been having for a while.

1\. When does it make sense to apply deep learning? Could it potentially be
applied successful applied to any difficult problem given enough data? Could
it also be good at the type of problems that Random Forest, Gradient Boosting
Machines are traditionally good at versus the problems that SVMs are
traditionally good at (Computer Vision, NLP)? [1]

2\. How much data is enough?

3\. What degree of tuning is required to make it work? Are we at the point yet
where deep learning works more or less out the box?

4\. Is it fair to say that dropout and maxout always work better in practice?
[2]

5\. What is the computational effort? How long e.g. does it take to classify
an ImageNet image (on a CPU / GPU)? How long does it take train a model like
that?

6\. How on earth does this fit into memory? Say in ImageNet your have (256
pixels * 256 pixels) * (10,000 classes) * 4 bytes = 2.4 GB, for a NN without
any hidden layers.

[1] I am overgeneralizing somewhat, I know. It's my way to avoid overfitting.

[2] My lunch today was free.

~~~
dwiel
I don't have great answers to the other questions, though I too am interested
in them.

#5) [1] has a some python code and timings mixed in to the docs. One such
example (stacked denoising autoencoders on MNIST):

    
    
        By default the code runs 15 pre-training epochs for each layer,             
        with a batch size of 1. The corruption level forthe first layer is          
        0.1, for the second 0.2 and 0.3 for the third. The pretraining              
        learning rate is was 0.001 and the finetuning learning rate is              
        0.1. Pre-training takes 585.01 minutes, with an average of 13               
        minutes per epoch. Fine-tuning is completed after 36 epochs in              
        444.2 minutes, with an average of 12.34 minutes per epoch. The              
        final validation score is 1.39% with a testing score of                     
        1.3%. These results were obtained on a machine with an Intel Xeon           
        E5430 @ 2.66GHz CPU, with a single-threaded GotoBLAS.
    

#6) The size of the NN is not typically num_features * num_classes, but rather
num_features * num_layers where num_layers is commonly 3-10 or so. If you want
a (multi-class) classifier, you first feed your neural network a bunch of
examples, unsupervised. Then once you've got your NN built, you feed the
outputs of the NN to a classifier like SVM or SGD. The idea is that the net
provides more meaningful features than you would have if you used hand crafted
features or the raw input data itself.

[1]
[http://deeplearning.net/tutorial/SdA.html#sda](http://deeplearning.net/tutorial/SdA.html#sda)

~~~
ma2rten
I understand that this unsupervised approach is out of fashion already.

[https://plus.google.com/+YannLeCunPhD/posts/UVT2fYTfoAC](https://plus.google.com/+YannLeCunPhD/posts/UVT2fYTfoAC)

------
brandonb
This is a cool tutorial!

It's ironic that deep neural networks have become the biggest machine learning
breakthrough of 2013: they were also the biggest machine learning breakthrough
of 1957. The idea dates back to the Perceptron, one of the oldest ideas in AI.

One thing to note: although there was a lot of initial excitement about
Restricted Boltzman Machines, Auto-encoders, and other unsupervised
approaches, the best results in the last year or so have all used conventional
the back-propagation algorithm from 1974, with a few tweaks.
[http://en.wikipedia.org/wiki/Backpropagation](http://en.wikipedia.org/wiki/Backpropagation)

Ben Lorica wrote a good article on the latest deep learning research from
Google, and what's changed since neural networks were last popular in the
1980's: [http://strata.oreilly.com/2013/10/deep-learning-oral-
traditi...](http://strata.oreilly.com/2013/10/deep-learning-oral-
traditions.html)

What's old is new again.

~~~
rm999
The history of AI is really interesting. Perceptrons were extremely oversold
by their inventor, Frank Rosenblatt after he introduced then in 1958. This led
to a lot of funding and interest in AI and perceptrons. Then, in 1969, Marvin
Minsky coauthored a book _Perceptrons_ which harshly criticized how
underpowered perceptrons were. Most famously, the book proved that a
perceptron could not model a simple XOR function. In other words, a technique
that many had been led to believe would one day emulate human-like
intelligence couldn't even emulate a dead-simple logic gate! The book was
devastating and effectively led to a dark age of AI where funding and interest
dried up (later, the term "AI winter" was coined).

The next big boom in AI (ignoring some logic/rules-based research in the 70s
that I don't think is very interesting from an AI perspective) occurred in the
80s, when computational power increased and researchers
discovered/rediscovered neural approaches, including the obscure 1974 research
on backpropagation. This led to tons of press and funding from governments who
dreamed of killer AI robots and what-not. But, once again, imagination raced
ahead of reality and funding dried up when said robots didn't materialize. The
field didn't really die off, but funding in AI went way down, leading to
another major "AI winter".

I'd say the next big era of AI is the one we're in, driven largely by applied
statistics that became known as "machine learning". This has been by far the
most successful era, and has probably added 100s of billions of dollars to the
economy (I'd argue Google is a machine learning company, for example). I think
it's also the most pragmatic era, as people in the field have really learned
from the past mistakes of overpromising. In fact, when I was studying "AI" in
grad school, my professors warned me to always refer to what I did as machine
learning because the concept of "intelligence" was such a joke to so many in
the field.

~~~
rahimiali
Signal processing mysticism repeats itself every 20 years and has been fueled
by tremendous hype since its debut 400 years ago:

1\. Linear Regression (which, admittedly, was amazing)

2\. Fourier Analysis (which is linear regression on orthonormal bases of
functions. it blew people's minds)

3\. Perceptrons (which is linear regression but with a logistic loss. it went
back to its old name of "logistic regression" once its insane cachet of
biological plausibility faded)

4\. Neural Networks (stack of logistic regressors. popular with people who
didn't know how to filter their inputs through fixed or random nonlinearities
before applying linear regression)

5\. Self Organizing Maps and Recurrent Nets (which were neural nets that feed
back on themselves)

6\. Fractals (which is recursion. they were useful for enticing children into
math classes)

7\. Chaos (which is recursion that's hard to model. useful for movie plots)

8\. Wavelets (which is recursive Fourier analysis, and probably still way
under-used)

9\. Support Vector Machines (which replaces logistic regression's smooth loss
with a kink that makes it hard to use a fast optimizer. often conflated with
the "kernel trick", which appealed to people who didn't want to pass their
inputs through nonlinearities explicitly)

9\. Deep Nets (which are bigger neural networks. the jury's out whether they
work better because they're deeper, or because they're bigger and require a
lot of data to train, or because they require a programmer to spend years
developing a learning algorithm for each new dataset. also whether they do
actually work better).

Once this Deep Net thing blows over again, my money's on Kernelized Recurrent
Deep Self Organizing Maps.

(On a serious note: MNIST is considered a trivial dataset and doesn't require
the heavy machinery of deep nets. linear regression on almost any random
nonlinearity applied to the data (say f(x;w,t)=cos(w'x+t) with w~N(0,I) and
t~U[0,pi2/]) will get you >98% accuracy on MNIST.)

~~~
nrmn
Could you explain the filtering "their inputs through fixed or random
nonlinearities"? I haven't heard of this before.

~~~
rahimiali
you've actually probably done this yourself. it's often called
"featurization". for example, instead of applying a linear learner on vectors
x in R^d, you apply it to vectors f(x), where f computes a bunch of features
on x. a popular choice for f are the d-th order monomials. hashing families
are another good idea (Alex Smola does this). more generally, any random
nonlinear function f is a good candidate (i call that analysis "Random Kitchen
Sinks"). when x is structured data, f usually just returns counts in histogram
bins of some kind.

------
shon
Google, Twitter, Netflix, Yelp, Pandora and more are speaking on Deep Learning
and RecSys this Friday at MLconf in San Francisco. We're trying to get a
streaming solution going as well for those who can't make it.
[http://mlconf.com](http://mlconf.com)

DISCLAIMER: This is my event

~~~
mbeissinger
I'll definitely check this out if you get a stream going.

~~~
shon
Check mlconf.com on Friday. The default will be Ustream here:
[http://www.ustream.tv/search?q=mlconf](http://www.ustream.tv/search?q=mlconf)

We'll also post that and any updates to the main site on Friday.

------
cocoflunchy
OT but the text selection behavior on this page is fascinating! (Or horrific
if you don't want to be nice). I've never seen anything like it.

[https://www.dropbox.com/s/4k72g8b2tl3mgzt/Screenshot%202013-...](https://www.dropbox.com/s/4k72g8b2tl3mgzt/Screenshot%202013-11-13%2020.29.48.png)

~~~
mbeissinger
Ahh what are you viewing it on?

~~~
cocoflunchy
Chrome 31.0.1650.48 on OSX

~~~
sudont
31.0.1650.48 here as well.

It appears to be a bug with a combination of the ::selected pseudoelement in
conjunction with the font Georgia. My guess is it's a Chrome-on-mac bug
(Firefox is fine), not a site coding error.

Disabling either the font or the selection style fixes it. Most likely a text
rendering issue. At work we've noticed Chrome getting buggier in relation to
that, as well as retaining DOM node properties via redraws.

------
Noxchi
What does it take to be good at machine learning such as this? In terms of
mathematics, and computer science knowledge?

I know how to code through self learning, and I've pretty much solely done web
development. So I barely know much CS. Also not very good at math.

So what are the essential prerequisites you would say are necessary for doing
neat, useful stuff with machine learning?

~~~
elq
Math. math. math.

Linear algebra. Bayesian statistics. MUST know these inside out, upside down.

Vector calculus. Convex optimization.

A boatload of machine learning literature. The ideas coalescing into deep
learning are based more than a decade of research.

If you know nothing about math... I can't imagine getting to the point of
understanding deep learning (which is a fairly rapidly evolving area) without
at least 2-3 years of very hard work.

This class is a reasonable attempt to give a quick intro to one major source
for DBNs
[https://www.coursera.org/course/neuralnets](https://www.coursera.org/course/neuralnets)
understanding this course is a good benchmark.

------
dave_sullivan
This is a really good write up. For people looking for practical experience
with these types of methods, I'd also recommend checking out theano and/or
pylearn2 (which is built w/ theano).

__theano:[http://deeplearning.net/tutorial/](http://deeplearning.net/tutorial/)

__pylearn2:[http://deeplearning.net/software/pylearn2/](http://deeplearning.net/software/pylearn2/)

NIPS, a big ML conference, is in December, so expect to see a large amount of
new ideas and applications re: deep learning to come out of that.

~~~
mbeissinger
Thanks for putting those links up - the Theano documentation has some great
tutorials for how to code these in practice.

------
faxmulder
Very interesting stuff written in a clear way. I'm actually finishing my
master thesis on music genre recognition through machine learning, which is
focused more on traditional ensemble learning, but I think that it would be
nice to study deep learning in greater detail. Thanks!

~~~
mbeissinger
Awesome, do you have any demos for the music genre recognition?

~~~
faxmulder
not yet, I've still some work to do. One question: do you think that Optimum-
Path Forests could be used also in the context of deep learning?

------
albertzeyer
Can anyone recommend a good book on the topic? And maybe other recent neural
network research topics; I'm esp. also interested in recurrent networks like
LSTM.

------
arjunrajjain
This is awesome!

~~~
mbeissinger
Thanks!

------
sidcool
An excellent 101 article.

~~~
mbeissinger
Thanks!

