If you could implement training with only single-bit operations rather than floating point math, a hardware implementation could be several orders of magnitude faster and more efficient than current CPUs/GPUs. That would certainly usher in a revolution in computer architecture.
However it's been shown this can be solved by using stochastic rounding. But I believe that would require specialized hardware to implement. I'm not sure.
Then rounding should be irrelevant.
All they do is map the numbers into a different range. A logarithmic value can represent more numbers near 0, but in order to add two logarithms, they need to be converted back to normal form, where the small numbers are rounded down to zero again.
There's no way around that, as adding a very small number to a very large number will always require many bits of precision to do accurately, regardless what transforms you use on the numbers.
Has this been mathematically proven, can you provide a source?
(Half-float arithmetic is implemented natively in recent CUDA CC5 architectures and is quite convenient, in particular it reduces memory bandwidth by 1/2 which is often the bottleneck.)
Stochastic gradient descent is fairly robust to noisy gradients -- any numerical or quantisation error that you can model approximately as independent zero-mean noise can be 'rolled into the noise term' for SGD without affecting the theory around convergence . It will increase the variance of course, which when taken too far could in practise mean divergence or slow convergence under a reduced learning rate, perhaps to a poorer local minimum.
Extreme quantisation (like binarisation) the error can't really be modelled as independent zero-mean, UNLESS you do the kind of stochastic quantisation mentioned. From what I hear this works well enough to allow convergence, but accuracy can take quite a hit. I don't think it has to be 'implemented natively', although no doubt that would speed it up, a large part of the benefit of quantisation during training is not so much to speed up arithmetic as to reduce memory bandwidth and communication latency.
They perform (speed-wise) pretty well: https://github.com/soumith/convnet-benchmarks
How different are they ?