
Accelerating Neural Networks with Binary Arithmetic - pplonski86
https://software.intel.com/en-us/articles/accelerating-neural-networks-with-binary-arithmetic
======
tc
Geoffrey Hinton described doing this back in 2012. He noted that it learned
more slowly but generalized better, he described how this is nearly isomorphic
with dropout, and he presented a biological motivation:

[https://www.youtube.com/watch?v=DleXA5ADG78#t=45m32s](https://www.youtube.com/watch?v=DleXA5ADG78#t=45m32s)

[https://www.youtube.com/watch?v=DleXA5ADG78#t=4m24s](https://www.youtube.com/watch?v=DleXA5ADG78#t=4m24s)

------
quotemstr
It's been known for a while now that quantization of neural networks greatly
increases efficiency while not decreasing inference power much --- NNs are
great at dealing with noise, and squashing a float weight into a byte is just
adding noise. These people seem to have quantized "all the way" as it were,
which makes sense.

------
yigitdemirag
IBM's TrueNorth does the same, it is a common practice in neural network chips
for inference. I think doing that kind of tricks for training is much more
valuable. Doing training at hundreds of GPUs and inference with chips at
mobile, can only be a temporary solution for deep learning market.

------
deepnotderp
One thing that people conveniently forget when discussing binary nets is that
they take a steep hit in object detection accuracy.

~~~
ktta
How much longer do they have to be trained (the article talked about this but
didn't give numbers)? Do they need more data than neural nets with FP ops? Any
estimates?

One thing that looks attractive to me is that FPGAs can be used very
effectively for this purpose because these will take up much less LUTs and
nets for specific purpose can be manually programmed in HDL.

~~~
regularfry
You don't even need FPGA's for this sort of network, necessarily. Take a look
at how WISARD worked:
[https://scholar.google.co.uk/scholar?q=igor%20Aleksander%20W...](https://scholar.google.co.uk/scholar?q=igor%20Aleksander%20WISARD)
It's "just" a bunch of RAM.

~~~
ktta
Correct me if I'm wrong, but that looks like it needs specialized hardware.
I'm thinking about how FPGAs with now being available on AWS can be used for
neural networks.

They are also much better than GPUs for in-house deployment because they
consume a lot less power which in turn reduces cooling power. They also don't
get 'refreshed' every year which can be a good or bad thing because you don't
have to upgrade so frequently and possibly lag being on performance compared
to rest of the industry.

FPGAs have often been acknowledged as being a low power solution for the same
precision but they're tremendously difficult to program. Also when you want FP
operations on FPGA you're using up a LOT of chip space, because they were
meant for GPUs.

So this kind of light weight neural nets will be much more suited for FPGAs.
I've tried some basic nets but they are better for GPUs since you can fit
larger nets without problem _and_ RAM is another problem with FPGAs.

I'm thinking that these nets would be _perfect_ since they take up much less
space on the FPGA and consume a lot less RAM. (The lower RAM consumption is
talked about here[1]) Also, the simple construction of the net with basic
operations would help the programming difficulty.

[1]:
[https://youtu.be/Q17HwA5oY4w?t=1m35s](https://youtu.be/Q17HwA5oY4w?t=1m35s)

~~~
regularfry
The original implementation used custom hardware, but the original
implementation was in 1984, so I think that's forgivable. I don't think the
custom hardware is necessary, though. Fundamentally the concept is that you
replace a thumping great matrix operation with a set of lookup tables in
memory. FPGA's are fun, I'm just not sure you need them for this. RAM is
plenty fast.

------
deepnet
Binarisation is not for backpropagation which requires a gradient (binary is a
cliff) for the calculus backprop to work - this gradient is provided by the
slope of the sigmoid.

The OP is only suggesting binarisation for the feedforward use of nets, not
the learning and training phase.

from the OP : " Real valued gradients are required for SGD to work. The
weights are stored in real valued accumulators and are binarized in each
iteration for forward propagation and gradient computations. "

~~~
wpower3
Awesome, thanks.

------
wpower3
I'm new to ANNs. Is the slowdown in training due to the fact that the
activation functions have such a sharp shape? That is the binary version of
activation doesn't look like it has a nice differential to use in something
like gradient decent.

