
I have question about information theoretic view of backprop - britcruise
I&#x27;m trying to explain the highest level reason why (modern) back propogation works better than previous methods (such as rosenblatts back propogation in 1958 - yes it&#x27;s that old). without going into the calculus, i want to just look at the information side of things.<p>I want to say something like the following:<p>back in the 1960&#x27;s we tried back propogation with binary neurons. And so when we tuned the parameters backwards (from output to input) no magnitude information passed back through the net (only DIRECTION - i.e. turn this knob left or right).  this is similar to how we can&#x27;t reverse an operation in modulo arithmetic. so it was a very &#x27;coarse&#x27; training process due to a lack of information. (took forever)<p>When we moved to non-linear units (such as relu) there was now a direct relationships (or analog) between output and input magnitude, and so when we passed backwords through the net we had MAGNITUDE and DIRECTION information (i.e. turn this knob to the right by x). that allowed us to train the net much faster because information from every neuron touched every other neuron. put simply, &quot;we knew how much to turn them, and what direction&quot; during training.<p>thoughts? what am I glossing over?<p>thank you.
======
sharemywin
don't under estimate the 50Mx improvement in computer performance since 1970.

Also, Relu's help with Vanishing/Exploding Gradient problem which allows the
information to propagate without sending it in to la la land.

CNNs helped because they don't have to calculate across a fully connected
network.

~~~
britcruise
thanks for the comment.

Yes the performance boost in training is critical.

So what I'm saying is this performance boost is thanks to the switch to non-
binary neuron (which allows a reversible operation to transmit a magnitude) -
that's MOST important.

And separately ReLu are just better at this, because they are linear they
don't have "vanishing edges" (which prevents the vanish/explode)

Separately I'm glad you brought up CNNs because CNN's are old, and go back to
rosenblatt (1958), his perception had a first layer of local connections in it
based on the findings in biological systems.

and of course that's because nature has found it's more efficient, and so the
efficiency is huge.

but the point is CNN = fewer knobs to train.

and there are lots of simple ways to help reduce knobs \- fix some connections
\- drop some connections

