Hacker News new | past | comments | ask | show | jobs | submit login
XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks (arxiv.org)
74 points by jarmitage on Mar 19, 2016 | hide | past | favorite | 17 comments

Fantastic work. They do rely on floating point numbers for the weight updates (I was secretly hoping that they would have binarized gradient descent).

Seems that CNTK already implemented this.


That quantizes the gradient for compression of the communication between nodes, which is cool, but each node must still calculate a 32 bit floating point gradient locally. What GP is asking for is a way to avoid having any floating point math at all.

If you could implement training with only single-bit operations rather than floating point math, a hardware implementation could be several orders of magnitude faster and more efficient than current CPUs/GPUs. That would certainly usher in a revolution in computer architecture.

It can't work with gradient descent, which requires about 16 bits of precision. The problem is the gradients are small, and so are rounded down to zero. Then when added together they get zero, even though the number should be higher.

However it's been shown this can be solved by using stochastic rounding. But I believe that would require specialized hardware to implement. I'm not sure.

Can’t you just store the logarithm of the value instead of the actual value?

Then rounding should be irrelevant.

Logarithms are both expensive to compute, and don't magically add any more bits of accuracy.

Well, (a) 2-based logarithms are cheap to compute, and (b) add quite some bits of accuracy if the alternative is rounding the value to 0.

They are not cheap to compute, I believe the best way to approximate them is using expensive polynomial approximation which consumes a few cycles and operations.

All they do is map the numbers into a different range. A logarithmic value can represent more numbers near 0, but in order to add two logarithms, they need to be converted back to normal form, where the small numbers are rounded down to zero again.

There's no way around that, as adding a very small number to a very large number will always require many bits of precision to do accurately, regardless what transforms you use on the numbers.

> It can't work with gradient descent, which requires about 16 bits of precision.

Has this been mathematically proven, can you provide a source?

It's more an empirically-verified thing than a mathematical fact, there's nothing magic about 16 bits AFAIK. Empirically 16 bits seems to work well enough for some tasks, taking it down to 8 bits is usually taking it too far, and performance-wise there's not a lot of point playing with values in between e.g. 12 bits.

(Half-float arithmetic is implemented natively in recent CUDA CC5 architectures and is quite convenient, in particular it reduces memory bandwidth by 1/2 which is often the bottleneck.)

Stochastic gradient descent is fairly robust to noisy gradients -- any numerical or quantisation error that you can model approximately as independent zero-mean noise can be 'rolled into the noise term' for SGD without affecting the theory around convergence [0]. It will increase the variance of course, which when taken too far could in practise mean divergence or slow convergence under a reduced learning rate, perhaps to a poorer local minimum.

Extreme quantisation (like binarisation) the error can't really be modelled as independent zero-mean, UNLESS you do the kind of stochastic quantisation mentioned. From what I hear this works well enough to allow convergence, but accuracy can take quite a hit. I don't think it has to be 'implemented natively', although no doubt that would speed it up, a large part of the benefit of quantisation during training is not so much to speed up arithmetic as to reduce memory bandwidth and communication latency.

[0] https://en.wikipedia.org/wiki/Stochastic_approximation#Robbi...

The Nervana 16bit floating point implementations is probably worth noting in this context.

They perform (speed-wise) pretty well: https://github.com/soumith/convnet-benchmarks

What would be the advantage of binarized gradient descent?

For one, more compression during parameter transfer in data parallelism scenarios.

And maybe the possibility of building more efficient hardware. Floating point units take up a lot of transistors (in the ~100K transistor range AFAIK), whereas boolean logic is tiny.

Going to 1-bit precision is probably overkill, but papers have shown how neural networks with 8 bits of precision and simple ALUs can give results equivalent to full FPUs.

58x speedup sounds great, the question that is unanswered here is wheter using more layers can get back the accuracy, and what kind of speedup can be achieved compared to the original version?

Sounds interesting :-) But is there any Github repo or Dockerfile cook book available ? Last year, I have noticed these 2 repos: https://github.com/kevinlin311tw/caffe-cvprw15, https://github.com/kevinlin311tw/Caffe-DeepBinaryCode

How different are they ?

Applications are open for YC Winter 2021

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact