
How the backpropagation algorithm works - oskarth
http://neuralnetworksanddeeplearning.com/chap2.html
======
dave_sullivan
This book is really coming together. It's been a while since I've put together
a (100% not comprehensive) list of good places to start if you're looking to
learn more and/or use deep learning in your projects.

 _Open source_

Pylearn2 (used to win kaggle galaxies competition):
[http://deeplearning.net/software/pylearn2/](http://deeplearning.net/software/pylearn2/)

Theano (symbolic math library used by Pylearn2):
[http://deeplearning.net/software/theano/](http://deeplearning.net/software/theano/)

Deep learning tutorials with theano (build your own neural networks):
[http://www.deeplearning.net/tutorial/](http://www.deeplearning.net/tutorial/)

 _Demos_

Convnet JS:
[http://cs.stanford.edu/people/karpathy/convnetjs/](http://cs.stanford.edu/people/karpathy/convnetjs/)

Sentiment Analysis:
[http://nlp.stanford.edu:8080/sentiment/rntnDemo.html](http://nlp.stanford.edu:8080/sentiment/rntnDemo.html)

3d word cloud (webgl):
[http://wordcloud.ersatz1.com/](http://wordcloud.ersatz1.com/)

 _Commercial_

Ersatz (I'm a co-founder, it's a PaaS providing neural network software with
cloud GPU servers): [http://www.ersatzlabs.com](http://www.ersatzlabs.com)

 _Good Reading_

Deep learning of representations: looking forward
[http://arxiv.org/pdf/1305.0445v2.pdf](http://arxiv.org/pdf/1305.0445v2.pdf)

Zero-Shot Learning Through Cross-Modal Transfer:
[http://arxiv.org/pdf/1301.3666v2.pdf](http://arxiv.org/pdf/1301.3666v2.pdf)
<\-- C'mon, that's pretty amazing...

Solution for the Galaxy Zoo challenge:
[http://benanne.github.io/2014/04/05/galaxy-
zoo.html](http://benanne.github.io/2014/04/05/galaxy-zoo.html)

Pylearn2 in practice: [http://fastml.com/pylearn2-in-
practice/](http://fastml.com/pylearn2-in-practice/)

~~~
ma2rten
I while ago I posted a comment here on HN which got number of upvotes but no
answer. Could you maybe take stab at it?

I've followed the developments in Neural Networks somewhat, but have never
applied deep learning so far. This is seems like a good place to ask a couple
of question I've been having for a while.

1\. When does it make sense to apply deep learning? Could it potentially be
applied successful applied to any difficult problem given enough data? Could
it also be good at the type of problems that Random Forest, Gradient Boosting
Machines are traditionally good at versus the problems that SVMs are
traditionally good at (Computer Vision, NLP)? [1]

2\. How much data is enough?

3\. What degree of tuning is required to make it work? Are we at the point yet
where deep learning works more or less out the box?

4\. Is it fair to say that dropout and maxout always work better in practice?
[2]

5\. What is the computational effort? How long e.g. does it take to classify
an ImageNet image (on a CPU / GPU)? How long does it take train a model like
that?

6\. How on earth does this fit into memory? Say in ImageNet your have (256
pixels * 256 pixels) * (10,000 classes) * 4 bytes = 2.4 GB, for a NN without
any hidden layers.

[1] I am overgeneralizing somewhat, I know. It's my way to avoid overfitting.

[2] My lunch today was free.

~~~
dave_sullivan
Sure, I'll give it a shot -- feel free to email me if you have further
questions, email is in my profile.

1\. I think it makes sense to try them for any classification, regression, or
feature extraction problem. They don't work all the time, sometimes you really
don't need the extra depth--one hidden layer can be fine, and they can be
pretty slow to train (even with GPU). I've also seen people try to build their
own, implement it wrong, get bad results, then complain NNs don't work. So
test for yourself, just make sure you're not doing it wrong.

2\. It really depends. More is almost always better.

3\. Training a bunch of models using Bayesian optimization to optimize the
model hyperparameters (so you don't have to pick them) and putting the last
few in an ensemble and averaging results is pretty close to out of the box.
This is the workflow we use with ersatz.

4\. Despite lunches not being free, you should probably use dropout. It's
ridiculously good at preventing over fitting but can take longer to train
(although there's been some work w/ "fast dropout" to speed it up)

5\. GPU gets you ~40x speed up over CPU. So if you're using CPU and I'm using
GPU, I can do in 1 day what would take you a month and a half. And then I
might train for a week or more on GPU (I think the imagenet models were
trained for a week or two, but not sure how many GPUs used). Otherwise,
computational effort varies.

6\. You use mini batches, so you load on as many samples as fit in GPU memory
(with the model params) and then pull those into smaller batches. You rotate
the "large batch" periodically. Neural networks can continue taking in new
data and updating their model (online learning) and are particularly
attractive for very large data sets.

General points: use GPU, don't build your own unless as an academic exercise,
use dropout, test empirically on your own data. And check out Bayesian
optimization of hyperparameters, I'm becoming more and more convinced it's
better at picking them than human experts anyway.

~~~
ma2rten
Thanks.

------
agibsonccc
This is a great explanation of backpropagation. For those who just want the
formulas, my personal favorite has been stanford's ufldl resource:

[http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Alg...](http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm)

The general intuition behind backprop is that, taking prediction error in to
account (think how many labels it got wrong) How far off were the predictions?
Based on that go back and penalize the weights that caused the error by that
much.

Multi layer perceptrons (as well as multi layer deep nets) have multiple
layers whereupon you send the input through the network and make a guess.

Then you basically keep updating the weights (iteratively via gradient
descent, conjugate gradient, LBFGS,...) till it doesn't change much. It does
this by conducting a search navigating using the cost: or objective function.
For more in depth, obviously the above book covers this quite in depth.

For those who want to just use deep learning, I will be giving talks at both
OSCon and Hadoop Summit this year on distributed deep learning using 2
different frameworks I commit to [1] and [2]. Happy to answer questions!

[1]: [http://deeplearning4j.org/](http://deeplearning4j.org/) [2]:
[http://github.com/jpatanooga/Metronome/](http://github.com/jpatanooga/Metronome/)

------
avaku
Can I come up with a bit of criticism? This book does provide a great
description of the details of the algorithm inner workings (very cute demons
too). However, after reading this chapter (sorry I haven't looked at the other
ones), there is still a feel of a bit of mystery about _why_ it works, and
even more why it might not work. Possibly is is covered in other parts of the
book, so I apologise if this criticism is not justified. I am personally a big
fan of Christopher Bishop's book Pattern Recognition and Machine Learning,
where backprop is described as an architecture for efficient computation of
multiple stochastic gradient descents... I was involved with NNs before, but
only after understanding where the algorithm for individual neurons comes
from, I could properly appreciate the benefits of backprop (and understand the
drawbacks).

------
oskarth
This is chapter two of Michael Nielsen's book on Neural Networks and Deep
Learning [1].

If you haven't heard about it before, I highly suggest you check it out.

[http://neuralnetworksanddeeplearning.com/about.html](http://neuralnetworksanddeeplearning.com/about.html)

------
zackmorris
I can't emphasize enough how great it is the he actually provides CODE right
from chapter one that implements a network with an arbitrary number of neurons
at each layer using NumPy. I've read (arguably) better explanations of how
neural networks function, but the code was always either archaic or
nonexistent.

To me, the ability to learn artificial intelligence concepts but also pass
them on to others in a way that they can be tinkered with signals a tipping
point in the field.

I would like to see all of the basic building blocks of AI (such as Bayes
classifiers, genetic algorithms, etc) packaged up this way into an API that is
as approachable as OpenGL. Then I would like to see multiprocessing libraries
like OpenCL/CUDA incorporated internally so that training can happen in
milliseconds instead of minutes or hours. With enough eyeballs looking at
these things, we might be able to get from heuristics and rules of thumb for
training values to something more concrete. It seems like every time I learn a
new paradigm it devolves into wishy-washiness because there are just not
enough years in a researcher's life to discover the subtle rules at work
behind the scenes. The progression seems to always be the same: failure to
achieve success over 50%, then reaching 95% after some hours/days of
tinkering, then finding the model hits a maximum of 99% and another model must
be learned. Rinse, repeat. If that changes, and we’re able to link up various
models without having to choose arbitrary constants, machine learning will
have arrived IMHO. We could throw hardware at it and let an array of agents
evolve in parallel without human intervention until we see which arrangements
work best. Eventually that could lead to a theory of mind that actually works
because it could learn anything a human could learn, for the most part
unsupervised.

I’ve learned just enough about this stuff to wonder about the endgame. I have
a hunch that it will involve something akin to version control, so that an AI
can try different approaches until it finds a solution, but with the ability
to roll back in case it goes off the rails. Does anyone have a starting point
for things like imagination in AI, or trial runs that happen in simulation
before the AI acts in real life? And maybe how to merge new solutions into
existing ones?

~~~
michael_nielsen
Author here. I'd love if you could give me pointers to your favourite
explanations of how neural networks function (even better, if you can say what
you particularly liked about them).

(I really enjoyed reading the remainder of your comment.)

~~~
zackmorris
Wow cool, sorry I didn't mean to sound critical, I should have used (perhaps)
instead of (arguably). I had to search my bookmarks but honestly can't
remember if one of these was the explanation I remember (it could have been in
a book when Borders bookstore was still around):

[http://natureofcode.com/book/chapter-10-neural-
networks/](http://natureofcode.com/book/chapter-10-neural-networks/)

[http://zerkpage.tripod.com/index.htm](http://zerkpage.tripod.com/index.htm)

[http://www.ai-junkie.com/ann/evolved/nnt1.html](http://www.ai-
junkie.com/ann/evolved/nnt1.html)

Then I used to skim this compilation before it went away:

[https://web.archive.org/web/20030601074826/http://www.emsl.p...](https://web.archive.org/web/20030601074826/http://www.emsl.pnl.gov:2080/proj/neuron/neural/systems/shareware.html)

Mostly what I remember about the explanation that stood out was that it was
concise. So the way it described backpropagation just "clicked" in a way that
the previous articles that used tons of summation/matrix math/probability and
calculus had not. I’ve read some positively atrocious papers where it was like
the authors were showing their notation prowess rather than conveying
information.

Your chapters have a good balance, although for someone who’s been out of
college for 15 years like me, it might be nice to have a tiny refresher about
derivatives before you dive in, especially over several variables/dimensions.
I mean like a paragraph, to explain the notation, because the knowledge is in
my head but foggy. Everything else was very good, for example when you
explained how bias comes from the threshold factor on the other side but makes
notation easier. I don’t remember seeing it explained quite that way before.
Oh and the sigmoid function always seemed arbitrary too (and frankly turned me
off from neural nets because it seemed too analog), but explaining how it
simplifies derivatives makes perfect sense now.

Unfortunately with other books, I was also never able to find off-the-shelf
code that was approachable or up to date, so I ended up forgetting everything
I had learned and had to start over each time until I got tired of trying. I
very much like your 72 line code example, where you provide a backpropagation
function without explaining it yet. That’s ok, as programmers we encounter
that all of the time, and it’s actually kind of fun to decipher the algorithm
before reading how it works later. I believe it even allows for multiple
hidden layers which is just fantastic.

As for the other stuff, well, I just look at it like this:

Ask a developer to write an iOS app that say, interacts with the twitter or
facebook API and presents the user’s N closest friends in a list view and then
submit that to the app store, and they can do it from start to finish with no
mystery (other than a little help from stack overflow). Each of those steps
replies on some pretty heavy lifting like REST APIs, possibly SQL, message
passing in objective-c, even a rudimentary understanding of security and
encryption for networking and app submission, yet those things have become
mainstream.

But ask a developer to do even the simplest machine learning task, like a
little data mining or spam filtering, heaven forbid weak AI like image
recognition, and it’s a whole different can of worms. Why hasn’t anyone
standardized this stuff in the languages that developers use every day, as
opposed to MATLAB or prolog or whatnot? Why can’t I literally read a text file
and pipe it through a shell script that does a Bayes classifier? Why doesn’t
iOS have an AI.framework to go along with its Social.framework? To me these
aren’t all that much to ask.

I really feel that we’re still working at the assembly language or hand rolled
affine texture mapping level from the 80s and 90s with respect to AI today. We
don’t have something like the web metaphor yet to catapult it into the
mainstream. That said, I’m extremely optimistic that concurrent (or at least
parallelized) languages like Octave/NumPy/Go, and the more approachable
functional languages like Haskell/Scala/Clojure might help deliver APIs that
we can interact with from more mainstream languages. To me, most weak AI
problems have been solved now, and it’s time that programmers have an
intuitive understanding of how the algorithms work so that they can combine
them in novel ways and get to strong AI in, well, my lifetime hah. Thanks for
your efforts and keep it up!

~~~
michael_nielsen
Belated thanks, that's extremely helpful. I'll take a look at those links
which are unfamiliar to me (I've already read the Nature of Code link, it's a
great book.) Tidbits:

\+ Yes, the code presented "works" for multiple hidden layers. But it
converges too slowly to be useful, except for very small networks. Later
chapters introduce new ideas which help backprop converge faster, and which
start to make multiple hidden layers quite feasible.

\+ On the broader machine learning front, I think scikit learn is a pretty
nice general library: it's approachable, easy to use, and has reasonable docs.
It'll be interesting to see how it develops.

\+ For neural nets in particular, elsewhere in this thread Dave Sullivan has
done a nice job distilling out a list of good libraries. I think it'll be a
while before there's a real out-of-the-box solution for neural nets, though,
since setting parameters for neural nets is something of an art.

------
cjauvin
This seems like it will be a very interesting book. If anyone is interested, I
have written a short and compressed intro to NNs, using very simple Python
code:

[http://cjauvin.blogspot.ca/2013/10/neural-
network-101.html](http://cjauvin.blogspot.ca/2013/10/neural-network-101.html)

------
jey
What are some problems where Deep Learning shines? Does it outperform other
algorithms on those problems? Is there an understanding of why?

Context: I have some background in statistical sorts of machine learning
algorithms and am genuinely puzzled by this "deep learning" phenomenon and why
it's catching on.

~~~
ma2rten
Deep learning can expose hidden non-linear relationships in the data. It's
state-of-the-art in applications such as object recognition from images and
voice recognition. There have also been promising results in the natural
language processing field. What all of these have in common is that they are
"real" artificial intelligence, i.e. teaching computers things that they have
been historically bad at compared to humans.

Search for a talk by Geoffrey Hinton if you are genuinely interested.

------
turbolent
There's a really nice explanation at
[http://galaxy.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html](http://galaxy.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html)

------
plg
Nice description of backprop for sure.

Aren't people using conjugate gradient descent to optimize NN weights now?
Sure you need the partial derivatives but ... that's what GPUs are for, right?
:)

~~~
nimish
Backprop lets you calculate the gradient efficiently. What you do with that
gradient is up to you (I would try L-BFGS or something akin to a stochastic
variation). So you could use conjugate gradient or some other optimization
method

------
j2kun
Can someone explain to me why NN's are always in layers?

~~~
_ikke_
From the first chapter of this book, it is described that each layer adds more
abstraction. The first layer (input) works on individual pixels, the next
layer on a part of the image. Each deeper layer can make more high-level
decissions.

~~~
j2kun
I understand why more layers adds complexity, but that doesn't answer my
question: why layers instead of an arbitrary acyclic graph?

~~~
jbooth
I'm not an expert, but my guess is because it makes it really easy to
materialize the weights between layers as a 2d matrix and we have some really
good library code out there for dealing with matrices.

There's probably room for experimenting with more novel constructions and even
some papers out there looking into the matter.

~~~
j2kun
Efficiency is the expected answer. I'm just wondering if there's a more
theoretical reason, such as "every function that can be computed by a non-
layered acyclic network can be computed by a complete layered network using
only a small number of extra nodes/layers."

~~~
jbooth
I think that it can. With some weights of 0 and some weights of 1, you can
trivially map 'jumps' that skip from a node in one layer to a node a couple
layers distant, by means of some incorporate-no-other-inputs intermediate
nodes, right? Sigmoid function on 1 is still 1? Once you have those, it's just
a matter of how many layers you need for any acyclic structure, I think.

Although if you wanted to come up with difficult scenarios, it's not hard to
think of structures that would make some of those middle layers really tall,
or add a lot of middle layers.

~~~
j2kun
As I mentioned in another branch of this thread, selectively choosing edges
between nodes isn't an option, because in the standard model you have complete
incidence between nodes in adjacent layers.

------
argc
Or "Why I wish I was better at math."

