
HALP: High-Accuracy Low-Precision Training - chmaynard
http://dawn.cs.stanford.edu/2018/03/09/low-precision/
======
vanderZwan
I have a feeling this could combine really well with John Gustafson's posits.
I partially bring this up because of the paper he wrote with Isaac Yonemoto,
where the latter uses an 8-bit variant of the posit to create a cheap
approximation of the sigmoid function for neural networks[0], but also because
for posits one can tweak the number of fraction and exponent bits to suit the
needs of the required dynamic range and accuracy.

[0]
[http://www.johngustafson.net/pdfs/BeatingFloatingPoint.pdf](http://www.johngustafson.net/pdfs/BeatingFloatingPoint.pdf)

~~~
dnautics
The sigmoid approximation is nice, but not 100% necessary (I don't know the
cost of doing sigmoid in GPU but I suspect the modern ones by nvidia probably
have some way of doing it relatively fast) - note that sigmoid and tanh are
still used in GRUs and LSTMs. And sigmoid, tanh, and their derivatives all
have approximative shortcuts in posits.

There is a single critical step in training which I find gets stuck in early
phases of learning due to accumulation issues (I solved it by expanding to 16
bits during this phase, which should be "on-chip"). It's nice to see another
method to do the same! I suspect that batch normalization (which seems similar
to what HALP is) after each step will also help.

~~~
vanderZwan
Yes, while I haven't done anything with ML myself, I heard via the 3Blue1Brown
video on Deep Learning[0] that the sigmoid function isn't really that used
anymore. But I figured allowing that being able to tweak the dynamic range
could make it a good fit for the rescaling and recentering approach.

You know what, I'll just go ahead and post a link to this article on the Unum
google group, perhaps someone there can add some thoughts[1].

[0]
[https://www.youtube.com/watch?v=aircAruvnKk&t=17m](https://www.youtube.com/watch?v=aircAruvnKk&t=17m)

[1][https://groups.google.com/forum/#!topic/unum-
computing/RHrQU...](https://groups.google.com/forum/#!topic/unum-
computing/RHrQUoyDl5c)

~~~
dnautics
Sigmoid is indeed hardly used as layer-to-layer transfer functions.

But I gurantee you > 20% of the world is activating sigmoid functions in ML
apps every day.

~~~
vanderZwan
Ah, thank you for that correction. Guess the video was accidentally misleading
(they technically didn't say anything about other ML approaches, but
overgeneralising like I did isn't that big of a leap)

------
John_KZ
My realization from this article is a worrying trend of developing AI ASICS
behind closed doors and only using them internally in your own datacenters or
renting them through an abstraction, as a service.

Given how it's 2018 and we still don't have a single FOSS mobile phone in
production, it's very possible that this technology will never leave the thin-
client model, and we'll never see a retail sale of (open/programmable)
neuromorphic hardware.

------
deepnotderp
Isn't this equivalent to changing the exponent bias? Like block floating
point?

~~~
yorwba
It's not just changing the exponent bias, but also adjusting the center of the
representable range for gradients to be closer to the current value of the
parameter (which is stored in full precision).

------
p1esk
Cifar10? Is this a joke? Where are results on ImageNet?

------
senatorobama
So much great work being done in Stanford. Can't wait to see what the next
generation of students are doing.

