
Improving the way neural networks learn - fanfantm
http://neuralnetworksanddeeplearning.com/chap3.html
======
TrainedMonkey
Neural Networks class on coursera covered a lot of the same topics with both
heavy math theory crafting and hefty amount of practical application.
[https://www.coursera.org/course/neuralnets](https://www.coursera.org/course/neuralnets)

~~~
mailshanx
It's a pity the lectures are not accessible any longer. How i wish i could
find them somewhere!

~~~
varelse
"View Course Record" is your friend...

And then you can download them with coursistant or coursera-dl and have them
wherever you go...

------
heurist
>You have to realize that our theoretical tools are very weak. Sometimes, we
have good mathematical intuitions for why a particular technique should work.
Sometimes our intuition ends up being wrong [...] The questions become: how
well does my method work on this particular problem, and how large is the set
of problems on which it works well.

I'm not very familiar with this field. Has anyone made any progress on
formalizing ways to measure the capabilities of intelligent systems? If the
theory is weak, there must be someone working on improving it, right?

~~~
varelse
Probably Dileep George et al. at Vicarious....

But since that's a $55M Black Hole with no published results other than a
mostly meaningless claim to having solved Captcha (which wasn't all that tough
a task to begin with), there's no way to tell since it doesn't seem like
practitioners of the art are the ones evaluating his prospects for further
funding. But don't believe some random dude on HN, here's Yann Le Cun saying
pretty much the same thing:

[https://plus.google.com/+YannLeCunPhD/posts/Qwj9EEkUJXY](https://plus.google.com/+YannLeCunPhD/posts/Qwj9EEkUJXY)

~~~
michael_nielsen
The OP is quoting LeCun, so this is not a coincidence!

~~~
varelse
Hey Michael, I loved your book on Quantum Computing, but don't get me started
on D-Wave or as I see it: $15M for a huge magic box that _might_ be faster
than a $15,000 GPU cluster for _some_ problems.

But seriously, the book rocked, and this one's coming along nicely.

------
araes
It seems like the implicit target in the document is to achieve a critically
damped system with no ringdown on learning. However, if they're trying to go
for speed, then it seems like they should accept possible overshoot, and use
non-linear control theory for their weights so that they're underdamped during
the initial descent, and then transition into critically damped gradient
descent as they move into the flat zone. Something like a variable "damper" or
weights/springs based on current error. Perhaps that is done elsewhere though,
and just not described as a technique here.

~~~
Houshalter
That sort of sounds like Rprop which is supposedly the fastest learning
algorithm. There are also other "adaptive learning rate" algorithms.

[https://en.wikipedia.org/wiki/Rprop](https://en.wikipedia.org/wiki/Rprop)

~~~
araes
Very cool. I had never heard of Rprop, but that's a neat way to trigger your
learning that it needs to rapidly damp. Kind of like a limiter in CFD.

------
kghose
I found that the statement about the cross entropy not true. When y==a the
function is non-monotonic with 0 at the extremes but not at the middle. So the
"proof" shown is confusing to me.

~~~
michael_nielsen
Which statement about cross entropy is confusing?

~~~
kghose
Hi!

The statement "if the neuron's actual output is close to the desired output,
i.e., y=y(x) for all training inputs x, then the cross-entropy will be close
to zero"

is not true. The function peaks in the middle (~ 0.7)

Thanks! -Kaushik

~~~
michael_nielsen
This is addressed in the marginal note attached to the sentence you quoted.

The essential point is that we're considering classification problems, for
which the output is intended to be 0 or 1. I address the more general case of
regression problems (where y may take any value) in a later exercise.

Hope that helps!

~~~
kghose
I see it now, thanks!

------
saosebastiao
The sections on Regularization and Dropout have some amazing prose. I haven't
read any of the other chapters, but just skimming through those sections have
helped enlighten me on quite a few things that have confused me for years in
completely different mediums...Such as why a random forest made up of randomly
selected simple CARTs generally predicts better than a single complex CART, or
why fitting a distribution to empirical data can benefit from using AIC or BIC
methods.

