
Nonlinear Computation in Deep Linear Networks - darwhy
https://blog.openai.com/nonlinear-computation-in-linear-networks/
======
dbcurtis
? It sounds like the author is ignoring denormals?

-edit-

Yes, the author is ignoring gradual underflow and the resulting denormal
numbers.

So as you move from one binate to the next, the spacing between floating point
numbers doubles or halves depending on whether you are increasing or
decreasing the exponent. When you reach the binate with the most negative
possible exponent, you have two choices: a) round toward zero, which leads to
a huge non-monotonic jump in the spacing of numbers on the floating point
number line. This is a great annoyance to numerical analysts and leads to
convergence instabilities. That is why any modern computer used for numerical
work incorporates choice b) gradual underflow, which implies that you must
allow non-normalized numbers in the two binates (the two being + and - sign
bit) of the most negative exponent, which has the effect of creating another
pair of binates around zero. This keeps the spacing of numbers on the floating
point number line the same in the four binates around zero. Numerical
algorithms are then much more stable.

I haven't looked at what GPU's do, I strongly suspect that they round toward
zero, because first of all it doesn't matter much to graphics applications,
and secondly, the typical method of handling denormals is to take a trap and
drop into software emulated floating point because the cost of the additional
hardware to handle denormals is very large and the hardware complexity is
crazy-making. A GPU isn't going to want to break the pipeline for a denormal.

~~~
aray
It looks like nvidia GPUs treat denormals as zeros for single-precision
floating point math:
[http://developer.download.nvidia.com/assets/cuda/files/NVIDI...](http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-
CUDA-Floating-Point.pdf) (sections 4.1 and 4.2)

~~~
dbcurtis
In the context of graphics processing that trade-off totally makes sense.

Thanks for doing the homework that I was too lazy to do :)

It seems to me that in the context of NN computations, using the lack of
gradual underflow as a non-linear element is going to severely limit the
dynamic range of the neurons. On the plus side, the non-linear element is a
computational freebie. But in addition to limited dynamic range, it makes the
NN ridiculously non-portable across hardware implementations.

~~~
scott-gray
Actually if you read section 4.6 of that paper you'll see that denormals are
the default on sm_20 and above. But you can see in that same section this this
can easily be disabled with the ftz flag.

I had to give Jakob custom gemm kernels to do this research. Not sure why the
denormal point was left out of this blog as it's pretty critical to the whole
experiment.

~~~
scott-gray
So a minor correction here. We did explore placing ftz on various instructions
inside the matmul ops, but it turns out you don't need anything more than what
is already baked into tf by default. All tf gpu primitives are built with
-nvcc_options=ftz=true. This means you have an implicit non-linearity after
any non-matmul op (provided the scale of computation is near 1e-38). Matmul
ops are called through cublas and have denormals enabled.

------
BrianMingus
These findings seem to be at odds. The former says that deep linear nets are
useful, non-linear and trainable with gradient descent. The latter says that
the non-linearity only exists due to quirks in floating point and that
evolutionary strategies must be use to find extremely small activations that
can exploit the non-linearities in floating point.

Exact solutions to the nonlinear dynamics of learning in deep linear neural
networks

[https://arxiv.org/abs/1312.6120](https://arxiv.org/abs/1312.6120)

"We attempt to bridge the gap between the theory and practice of deep learning
by systematically analyzing learning dynamics for the restricted case of deep
linear neural networks. Despite the linearity of their input-output map, such
networks have nonlinear gradient descent dynamics on weights that change with
the addition of each new hidden layer. We show that deep linear networks
exhibit nonlinear learning phenomena similar to those seen in simulations of
nonlinear networks, including long plateaus followed by rapid transitions to
lower error solutions, and faster convergence from greedy unsupervised
pretraining initial conditions than from random initial conditions."

Nonlinear Computation in Deep Linear Networks

[https://blog.openai.com/nonlinear-computation-in-linear-
netw...](https://blog.openai.com/nonlinear-computation-in-linear-networks/)

"Neural networks consist of stacks of a linear layer followed by a
nonlinearity like tanh or rectified linear unit. Without the nonlinearity,
consecutive linear layers would be in theory mathematically equivalent to a
single linear layer. So it’s a surprise that floating point arithmetic is
nonlinear enough to yield trainable deep networks."

~~~
Plough_Jogger
The arxiv paper here is analyzing the nonlinearities in a network's learning
dynamics; exploring why training time / error rates are not do not vary
linearly throughout the the training process.

They note: "Here we provide an exact analytical theory of learning in deep
linear neural networks that quantitatively answers these questions for this
restricted setting. Because of its linearity, the input-output map of a deep
linear network can always be rewritten as a shallow network."

------
dahart
It's super interesting to think that _any_ non-linearity at all can make it
work. This particular non-linearity is surprising since it's clamping to zero
at the _center_ of the response curve. I'd have thought that's right where you
want the linear response, and that clamping in the middle would cause bad
things to happen. Sigmoid and RelU (and others) clamp at the foot/shoulder.
Perhaps this network just learns negative weights, compared to the traditional
activation functions??

~~~
dnautics
there's a theorem that any nonlinearity works (for sufficiently sized
networks).

~~~
deepnotderp
The universal approximation theorem actually assumes that the nonlinearity is
monotonically increasing, nonconstant and continuous. I don't think floating
point nonlinearities technically satisfy that.

~~~
dnautics
0) nonconstant. Yes, for most cases the floating point nonlinearities map x =>
x, so it is not a constant.

1) bounded. Yes, the nonlinearites are bounded by the range of the FP.

2) monotonically-increasing. Yes. Consider a + b, where fp(a + b) < a + b, in
other words, it's been rounded down. examine fp(a + (b - db)), cannot be
rounded up to a number higher than fp(a + b), so the the floating point
rounding functional fp must be monotonic for the operation +, a similar
argument applies for multiply, and thus for any linear function.

3) continuous function. No. Well, you can't win at everything, no computer
representation can be truly continuous, but it's reasonable approximation of
the approximation theory, otherwise ML on computers _in general_ would be
hopeless.

------
SomeStupidPoint
So this exploits the fact that floating point numbers have finite precision
(and perhaps uneven spacing) to generate non-linear operations?

That's actually a really cool usage of the specification!

~~~
dbcurtis
Well, except that the author misunderstands the specification and how it is
typically implemented on modern computers.

~~~
gdb
(I work at OpenAI.)

TensorFlow by default is built with denormals off (ftz=true), so denormals
aren't relevant for the applications we're interested in. We have updated the
post to indicate this — thanks for the feedback!

------
Nokinside
This is one cool hack.

You can't construct deep neural network from only linear parts because
consecutive layers can be always combined into single transformation matrix.
That's why you need alternating linear and nonlinear operations.

I wonder if it's possible to design special purpose low resolution floating
point circuit that maximizes this effect while preserving enough linearity.
Then you have fast DNN network pipeline constructed from just summation and
addition.

~~~
darkmighty
It's also possible to binarize DNNs to use the faster bitwise operations

[https://arxiv.org/pdf/1602.02830.pdf](https://arxiv.org/pdf/1602.02830.pdf)

------
blt
This is cool, but it seems like a crazy hack with no benefit besides "because
it's there". Relu is already such a cheap function to compute.

------
gugagore
I didn't understand how the gradients are produced to honor this underflow
behavior. Is that the reason why they use "ES" instead of symbolic (or
probably they meant automatic) differentiation?

~~~
dnautics
correct.

------
mark_l_watson
Well, that is cool and not what I would expect. That said, for practical
applications make training easier using leaky-relu, tanh, etc. activation
functions.

------
adrianbg
I wonder if this / gradient descent would work with integers, and with mod 2^n
as the nonlinearity.

------
tlarkworthy
I guess the hope is that a round of computation can be shaved off.

------
amelius
Wasn't this already obvious from simple networks which implement e.g. the XOR
function?

~~~
NhanH
Simple network without non linear function (eg sigmoid) can’t learn XOR

