
Why Do Neural Networks Need an Activation Function? - strikingloo
http://www.datastuff.tech/machine-learning/why-do-neural-networks-need-an-activation-function/
======
ml_thoughts
The posted article isn't particularly fascinating, but for a bit of fun,
there's an OpenAI project where they demonstrate that due to the non-linear
rounding of Float32 values you can actually train "non-linear" linear
networks: [https://openai.com/blog/nonlinear-computation-in-linear-
netw...](https://openai.com/blog/nonlinear-computation-in-linear-networks/)

~~~
p1esk
Or any binarized or ternarized network.

------
united893
You don't need advanced math to answer this question. If there's no activation
function then all the weights in each layer can be multiplied together and the
whole network is just a linear classifier.

~~~
H8crilA
Of course you're right. There is linearity inherent to most neural networks
design that many people overlook - backpropagation is entirely linear.

~~~
p1esk
_backpropagation is entirely linear._

Huh?

~~~
H8crilA
I mean in th sense of the sensitivity of the backpropagation updates to the
loss function value.

Backpropagation is essentially a big chain rule application. If you order the
layers into a list (with identity transformations to pipe the values that do
not change in the particular layer) then the whole process is just multiplying
the derivative matrices, starting with the loss function and then the bottom-
most layer. That's why it's difficult to train long unrolled sequences of RNNs
- the many matrices that you have to multiply on the way back are either
contracting (in which case the signal dies out) or expanding (in which case
you get NaNs). That's also why people "cheat" backpropagation by using things
like gradient clipping.

[https://medium.com/@karpathy/yes-you-should-understand-
backp...](https://medium.com/@karpathy/yes-you-should-understand-
backprop-e2f06eab496b)

~~~
p1esk
I think you might be confused: if there are activation functions then backward
pass is also non-linear. Try to derive backprop equations by hand and you will
see that.

~~~
H8crilA
I am quite sure that it is linear in terms of sensitivity to loss function
changes. It's a simple derivative. For example - if you multiply the loss
function by 2 all the weight updates will also multiply by 2.

~~~
p1esk
_Any_ function is linear in terms of linear transformation of that function.
If y = f(x), then multiplying f(x) by 2 will result in 2y. Here we are talking
about linearity or non-linearity of f(x), so using your example, if we
multiply x by 2, which in this context would be the error signal in the output
layer, do we get weight updates times 2? No, because f = dL(error)/dW is not a
linear transformation of error (in general).

Here's an attempt to make the backprop linear:
[https://openreview.net/forum?id=ByfPDyrYim](https://openreview.net/forum?id=ByfPDyrYim)

~~~
H8crilA
Yeah but not for any function, given that f(x)=y, f(2x)=2y. Which is what
backpropagation tries to do (find x such that f(x) is something).

Anyhow we're talking past each other. I mean that it's linear in a very
strict/specific sense. Take a fixed newtork with a fixed loss function, for a
particular set of inputs and particular target outputs (think 1 batch, or even
just one example). Now as you backprop you continuously compute derivatives of
what's currently on the layer with respect to the loss value, starting from
the loss function, then the bottom most layer, then the previous layer and so
on.

Now this can be decomposed into a series of matrix multiplication (chain
rule), where yes, in any particular run the matrices depend on the weights,
inputs and outputs in a nonlinear fashion. But the overall effect is just a
composition of linear operators (matrix multiplication). That's why things go
to shit when the network is deep and the derivative matrices are all
contracting or all expanding.

And just to connect with your example: if f=d(error)/dW then yes
d(2error)/dW=2f.

~~~
p1esk
Ok, I see what you mean, and you're right. However, I'm not sure I see any
usefulness of this property.

~~~
H8crilA
It helps understand why NaNs or close to zero training speed happens. For
example - say you have an LSTM. The more you unroll it the more matrices get
multiplied on the way back, and the backprop signal can get amplified/muted in
an exponential fashion (as a function of the number of unrolled cells).

~~~
p1esk
I was talking about the property that gradients are linearly proportional to
loss value. I don't see whether this property helps or hurts vanishing
gradients problem (e.g. if it was not linear, maybe it would be worse).

