
Why Are Eight Bits Enough for Deep Neural Networks? - aburan28
http://petewarden.com/2015/05/23/why-are-eight-bits-enough-for-deep-neural-networks/
======
taliesinb
I had wondered this myself -- it seems reasonable to see limiting the
precision of activation as a form of regularization, as the author alludes to.

For me, the place we'll eventually end up is obviously custom deep learning /
evaluation chips that perform analogue operations using transistors in their
linear regime (like how op-amps work). These chips would be programmed merely
to express the tensor operation graph, essentially analogue tensor FPGAs.

This should bring multiple order-of-magnitude reductions in power consumption
and increases in evaluation speed. And when you don't have a clock there might
also be interesting ways of dealing with time in which you don't discretize
and unroll, like one currently does with GRUs or LSTMs.

~~~
darkmighty
I agree that this kind of naive analog computing sounds very attractive with
those simple linear operations (linear networks have been exhaustively
studied, as you noted you essentially need only resistors and amplifiers). But
it's not entirely obvious to me they ought to be better than digital
electronics for comparable precision (considering their noise) and power
consumption. I think you may get into trouble in the small current regime due
to quantum mechanics: while you can do digital electronics with only a few
electrons, you may need a large number to be able to maintain good linearity.
An then there's the fact you can deal with exponentially larger numbers with
roughly linearly (or polynomial) increasing memory, while if you use analog
circuits you have to pay a quadratic cost on the exponential, so ~n^k vs
~exp(2n) power consumption doesn't look good from this pov. But who knows, as
the article points out the nonlinearities of the network may miraculously make
it work even with very poor linearity and poor precision. It remains to be
tested.

~~~
p1esk
_you can deal with exponentially larger numbers with roughly linearly (or
polynomial) increasing memory, while if you use analog circuits you have to
pay a quadratic cost on the exponential_

This does not make sense to me. Can you explain?

I think there might be misunderstanding of how analog computing is used to
build a neural network. First, a weight is stored as some analog physical
property, typically as charge on a floating gate, or on a capacitor in a DRAM
type cell. Second, the multiplication operation is performed by modulating the
analog input signal going through the floating gate transistor by the charge
on the floating gate (weight). Third, the summation is done via simple
summation of the currents. Finally, activation function is performed by an
opamp.

Regarding power consumption: 1\. A digital computer needs a thousand of
transistors to perform multiplication, analog circuit can do it with a single
one. 2\. Analog NN stores parameters (weights) locally, right where they are
needed to perform computation. Digital NN will need lots of memory transfers
to bring weights from RAM to ALU, and to store intermediate results.

That's why a properly implemented analog NN will always consume much less
power.

~~~
taliesinb
> This does not make sense to me. Can you explain?

I understood the reasoning to be that to increase the range of accurately
representable values in a circuit, you either need to increase the voltage or
current used in an analog circuit (to achieve a certain accuracy versus a
noise baseline), or devote more bits in a digital circuit. The first gives a
linear dependence (or quadratic for I^2 losses) of power on range, the second
logarithmic.

~~~
p1esk
Ah I see. Well, remember, with analog circuits, we are talking about
subthreshold currents. This current is orders of magnitude less than the
current in a digital circuit (nA vs uA). Correspondingly, the power
consumption will be negligible in comparison, even if you expand the current
range. And that is only a fraction of the total power consumption. Adding more
bits in a digital circuit linearly increases total power, dominated by
interconnect capacitance.

~~~
darkmighty
That was an important observation. Fighting noise is one of the primary
reasons the first digital computers were invented.

To give a bit of a dramatic illustration, if you circuit has on the order of 1
nV of thermal noise and you wanted to do the linear analog equivalent of 64bit
integer arithmetic, you would need a signal on the order of 10,000,000,000 V
to have enough precision. In fact, in terms of power consumption it's even
worse. If the 1 nV signal consumes something like 1 pW, you would need
something like the total power output of the Sun (on the order of 10^26 W) --
a bit of an expensive multiplication, no :) ? That's how crazy it is!

Again, if you can get away with less than 8 bits of precision and imperfect
linearity the picture changes, but I wouldn't declare it superior _a priori_
without looking at the numbers.

~~~
p1esk
Or, you could split your 64 bit computation into 8 bit computations, which
could be done with analog circuits, and still save a lot of power! :-)

But yes, I understand your point. Both analog and digital implementations have
their strengths and weaknesses. If you value power over precision, go with
analog. If the opposite - go with digital.

~~~
darkmighty
Right, but note you can't even split it, if you are thinking of linear
circuits. Precision necessarily means how your signal compared to the thermal
noise floor. It is possible to show can't compose 8-bit precision linear units
to get a >8-bit precision value. What happens is actually the opposite, if the
noise of the units are uncorrelated noise will propagate and increase to the
tune of sqrt(number of operations). Avoiding error propagation is another
advantage of digital operations.

The reason NNs _don 't_ exhibit strong error propagation is because of the
non-linearities between linear layers that perform operations analogous to
threshold/majority voting or the like, which have error correction properties.

~~~
p1esk
Interesting, but then how do you explain that rectified linear operations
between layers work better than sigmoids? According to your logic, ReLU should
have worse error propagating quality than squashing functions?

------
LoSboccacc
did my thesis on this topic (at that time we were searching the lower bound of
ALU needed to have them running in zero power devices)

it's interesting, NN degrade at about 6bit, and that's mostly because the
transfer function become stable and the training gets stuck more often in
local minimums.

we built a training methodology in two step, first you trained them in 16bit
precision, finding the absolute minimum, then retrain them with 6bit
precision, and the NN basically learned to cope with the precision loss on its
own.

funny part is, the less bit you have, the more robust the network became,
because error correcting became a normal part of its transfer function.

we couldn't make the network solution converge on 4bit however. we tried using
different transfer function, but then ran out of time before getting
meaningful results (Each function needs it's own back propagation adjustment
and things like that take time, I'm not a mathematician :D)

~~~
fgimenez
I had similar empirical results on one of my PhD projects for medical image
classification. With small data sets, we got better results on 8-bit data sets
compared to 16-bit. We viewed it as a form of regularization that was
extremely effective on smaller data sets with a lot of noise (x-rays in this
case).

~~~
tachyonbeam
When using 8-bit weights, what kind of mapping do you do? Do you map the 8-bit
range into -10 to 10? Do you have more precision near zero or is it a linear
mapping?

~~~
LoSboccacc
Don't know about him but I was working with -8 8 for input and -4 4 for
weights, using atan function for transfer maps quite well and there is no need
to oversaturate the next layer.

------
Houshalter
The problem with using digital calculations is that they are deterministic. If
a result is really small, it is just rounded down to zero. So if you add a
bunch of small numbers, you get zero. Even if the result should be large.

Stochastic rounding can fix this. You round each step with the probability so
it's expected value is the same. Usually it will round down to 0, but
sometimes it will round up to 1.

Relevant paper, using stochastic rounding. Without it the results get worse
and worse before you even get to 8 bits. With stochastic rounding, there is no
performance degradation. You could probably even reduce the bits even further.
I think it may even be possible to get it down to 1 or 2 bits:
[http://arxiv.org/abs/1502.02551](http://arxiv.org/abs/1502.02551)

The relevant graph:
[https://i.imgur.com/cOZ4fn3.jpg](https://i.imgur.com/cOZ4fn3.jpg)

------
hyperion2010
Point of interest, if you do the fundamental physics on neuronal membranes,
the number of levels that are actually distinguishable give the noise in the
system is only about 1000. So even in a biological system there are only 4x
the the number of discrete levels. I realize this isn't a good match to what
is mentioned in the article but it does put some constraints on the maximum
dynamic range that biological sensors have to work within.

~~~
tajen
1024 levels = 10 bits. The article mentions 8 bits, which is 256 levels. Now I
get what you mean with 4x.

------
rdlecler1
These networks ought to be robust to minor changes in W. It's the topology
that maters and frankly most of the W_ij != 0 are spurious connections --
meaning perturbation analysis will show that they play no causal role in the
computation. I wrote a paper on this which has >100 citation (Survival of The
Sparsest: Robust Gene Networks are Parsimonious). I used gene networks, but
this is just a special case of neural networks. In fact there been a bunch of
papers published on gene regulatory networks that show that topology is the
main driver of function -- not surprising, if you show the circuit diagram of
an 8-bit adder to an EE, they'll know exactly the function. Logically it has
to be so. In fact you can model the gene network of the drosophila
segmentation pattern with Boolean (1-bit) networks. The problem with ANN
research is that no few take the time to understand why things function as
they do. We should be reverse engineering these from biology. Every time a
major advancement is made in ANNs neurobiologist say "yes, we could have told
you that ten years ago" deep learning is just the latest example. It will hit
its asymptote soon, then people will say that AI failed to live up to its
expectation, then someone will make a new discovery. It's very frustrating to
sit on the sidelines and watch this happen again and again.

~~~
JuliaLang
Care to do it yourself?

------
TD-Linux
>On the general CPU side, modern SIMD instruction sets are often geared
towards float, and so eight bit calculations don’t offer a massive
computational advantage on recent x86 or ARM chips.

This isn't true, modern SIMD instruction sets have tons of operations for
smaller fixed point numbers, as used heavily in video codecs. Unless the
author meant some sort of weird 8 bit float?

------
afsina
Funny that I just finished initial implementation of the code that uses the
techniques from the paper (Vanhoucke et al.) mentioned in the post.

[https://github.com/ahmetaa/fast-dnn](https://github.com/ahmetaa/fast-dnn)

------
dnautics
Agreed. I'm going working on an 8bit floating point that is optimized for
learning algos, and optimized to be easy to soft emulate and also very
efficient in hardware. One of the cool things about this float is that
transfer functions (like the logistic) basically becomes a lookup table for
really good performance.

Also, there is no strong need for "zero".

------
Animats
That's fascinating, especially since very slow training, where the weights
don't change much per cycle, is in fashion. One would think that would result
in changes rounding down to zero and nothing happening, but apparently it
doesn't.

------
jokoon
Wouldn't that mean that using 8 bit cores would be enough to simulate neuron
networks ? That might significantly reduce the amount of transistors, thus
increasing the amount of cores and parallelism.

------
IgorPartola
I guess this is why rowing crews that regularly practice on choppy water end
up doing better in an average competition (no citation, just something my
coach once told me). Training in adverse conditions results in better built in
corrections.

------
jsprogrammer
It's not just eight bits. It's 8 bits * # of nodes.

The bits per node just determine the 'resolution' of your individual nodes;
while the network as a whole determines how many states can be represented.

------
Tobu
Eight bits are enough for me … [https://www.youtube.com/watch?v=GoGpLl-
SUfk](https://www.youtube.com/watch?v=GoGpLl-SUfk)

