
Making floating point math highly efficient for AI hardware - probdist
https://code.fb.com/ai-research/floating-point-math/
======
grandmczeb
Here's the bottom line for anyone who doesn't want to read the whole article.

> Using a commercially available 28-nanometer ASIC process technology, we have
> profiled (8, 1, 5, 5, 7) log ELMA as 0.96x the power of int8/32 multiply-add
> for a standalone processing element (PE).

> Extended to 16 bits this method uses 0.59x the power and 0.68x the area of
> IEEE 754 half-precision FMA

In other words, interesting but not earth shattering. Great to see people
working in this area though!

~~~
jhj
At least 69% more multiply-add flops at the same power iso-process is nothing
to sneeze at (we're largely power/heat bound at this point), and unlike normal
floating point (IEEE or posit or whatever), multiplication, division/inverse
and square root are more or less free power, area and latency-wise. This is
not a pure LNS or pure floating point because it is a hybrid of "linear"
floating point (FP being itself hybrid log/linear, but the significand is
linear) and LNS log representations for the summation.

Latency is also a lot less than IEEE or posit floating point FMA (not in the
paper, but the results were only at 500 MHz because the float FMA couldn't
meet timing closure at 750 MHz or higher in a single cycle, and the paper had
to be pretty short with a deadline, so couldn't explore the whole frontier and
show 1 cycle vs 2 cycle vs N cycle pipelined implementations).

The floating point tapering trick applied on top of this can help with the
primary chip power problem, which is moving bits around, so you can solve more
problems with a smaller word size because your encoding matches your data
distribution better. Posits are a partial but not complete answer to this
problem if you are willing to spend more area/energy on the encoding/decoding
(I have a short mention about a learned encoding on this matter).

A floating point implementation that is more efficient than typical integer
math but in which one can still do lots of interesting work is very useful too
(providing an alternative for cases where you are tempted to use a wider bit
width fixed point representation for dynamic range, or a 16+ bit floating
point format).

~~~
grandmczeb
The work is definitely great and I have no doubt we'll see new representations
used in the future. But at least on the chip I work on, this would be a <5%
power improvement in the very best case. For the risk/complexity involved, I
would hope for a lot more.

------
moflome
Not sure why this isn't getting more votes, but it's a good avenue of research
and the authors should be commended. That said, this approach to optimizing
floating point implementations has a lot of history at Imagination
Technologies, ARM and similar low-power inferencing chipsets providers. I
especially like the Synopsys ASIP Design [0] tool which leverages the open-
source (although not yet IEEE ratified) LISA 2.0 Architecture Design Language
[1] to iterate on these design issues.

Interesting times...

[0] [https://www.synopsys.com/dw/ipdir.php?ds=asip-
designer](https://www.synopsys.com/dw/ipdir.php?ds=asip-designer) [1]
[https://en.wikipedia.org/wiki/LISA_(Language_for_Instruction...](https://en.wikipedia.org/wiki/LISA_\(Language_for_Instruction_Set_Architecture\))

------
Geee
A bit off-topic, but I remember some studies about 'under-powered' ASICs, ie.
running with 'lower-than-required' voltage and just letting the chip fail
sometimes. I guess the outcome was that you can run with 0.1x power and get
0.9x of correctness. Usually chips are designed so that they never fail and
that requires using substantially more energy than is needed in the average
case. If the application is probabilistic or noisy in general, additional
'computation noise' could be allowed for better energy efficiency.

~~~
david-gpu
That sounds awful for verification, debugging, reproducibility and safety-
critical systems. Imagine this in a self-driving car. Scary.

~~~
TomMarius
Well, you could simply not use these in a self-driving vehicle.

------
dnautics
Wow! It's kind of a wierd feeling to see some research I worked on get some
traction in the real world!! The ELMA lookup problem for 32 bit could be fixed
by using the posit standard, which just has "simple" adders for the section
past the golomb encoded section, though you may have to worry about spending
transistors on the barrel shifter.

~~~
jhj
The ELMA LUT problem is in the log -> linear approximation to perform sums in
the linear domain. This avoids the issue that LNS implementations have had in
the past, which is in trying to keep the sum in the log domain, requiring an
even bigger LUT or piecewise approximation of the sum and difference non-
linear functions.

This is independent of any kind of posit or other encoding issue (i.e. it has
nothing to do with posits).

(I'm the author)

~~~
dnautics
Thanks for your work!! (And citing us ofc)

Do you think there might be an analytic trick that you could use for higher
size ELMA numbers that yields semiaccurate results for machine learning
purposes? Although to be honest I still think with a kuslich FMA and an extra
operation for fused exponent add (softmax e.g.) you can cover most things
you'll need 32 bits for with 8

~~~
jhj
I've thought of that, but the problem is that it needs to linearly interpolate
between the more accurate values, and depending upon how finely grained the
linear interpolation is, you would need a pretty big fixed point multiplier to
do that interpolation accurately.

If you didn't want to interpolate with an accurate slope, and just use a
linear interpolation with a slope of 1 (using the approximations 2^x ~= 1+x
and log_2(x+1) ~= x for x \in [0, 1)), then there's the issue that I discuss
with the LUTs.

In the paper I mention that you need at least one more bit in the linear
domain than the log domain (i.e., the `alpha` parameter in the paper is 1 +
log significand fractional precision) for the values to be unique (such that
log(linear(log_value)) == log_value) because the slope varies significantly
from 1, but if you just took the remainder bits and used that as a linear
extension with a slope of 1 (i.e., just paste the remainder bits on the end,
and `alpha` == log significand fractional precision), then
log(linear(log_value)) != log_value everywhere. Whether or not this is a real
problem is debatable though, but probably has some effect on numerical
stability if you don't preserve the identity.

Based on my tests I'm skeptical about training in 8 bits for general problems
even with the exact linear addition; it doesn't work well. If you know what
the behavior of the network should be, then you can tweak things enough to
make it work (as people can do today with simulated quantization during
training, or with int8 quantization for instance), but generally today when
someone tries something new and it doesn't work, they tend to blame their
architecture rather than the numerical behavior of IEEE 754 binary32 floating
point. There are some things even today in ML (e.g., Poincaré embeddings) that
can have issues even at 32 bits (in both dynamic range and precision). It
would be a lot harder to know what the problem is in 8 bits when everything is
under question if you don't know what the outcome should be.

This math type can and should also be used for many more things than neural
network inference or training though.

~~~
nestorD
> It would be a lot harder to know what the problem is in 8 bits when
> everything is under question if you don't know what the outcome should be.

I might have a solution for that : I work on methods to both quantify the
impact of your precision on the result and locate the sections of your code
that introduced the significant numerical errors (as long as your numeric
representation respects the IEEE standard).

However, my method is designed to test or debug the numerical stability of a
code and not be used in production (as it impacts performances).

~~~
jhj
None of the representations considered in the paper (log or linear posit or
log posit) respect the IEEE standard, deliberately so :)

~~~
nestorD
You drop denormals and change the distribution but do you keep the 0,5 ULP
(round to nearest) garantee from the IEEE standard ? And are your rounding
errors exact numbers in your representation (can you build Error Free
Transforms) ?

~~~
jhj
For (linear) posit, what the "last place" is varies. Versus a fixed-size
significand, there is no 0.5 ulp guarantee. If you are in the regime of full
precision, then there is a 0.5 ulp guarantee. The rounding also becomes
logarithmic rather than linear in some domains (towards 0 and +/\- inf), in
which case it is 0.5 ulp log scale rather than linear, when the exponent scale
is not 0.

For my log numbers under ELMA (with or without posit-ing), the sum of 2
numbers alone cannot be analyzed in a simple ulp framework I think, given the
hybrid log/linear nature. Two numbers summed are both approximated in the
linear domain (to 0.5 ulp linear domain, assuming alpha >= frac + 1), then
summed exactly, but conversion back to the log domain when done is
approximate, to 0.5 ulp in the log domain. But the result is of course not
necessarily 0.5 ulp in the log domain. Multiplication, division and square
root are always the exact answer however (no rounding). The sum of two log
numbers could of course also be done via traditional LNS summation, in which
case there is <0.5 ulp log domain error.

Kulisch accumulation throws another wrench in the issue. Summation of many log
domain numbers via ELMA will usually be way more accurate than 0.5 (log
domain) ulp rounding via LNS traditional summation techniques, because the
compounding of error is minimized, especially when you are summing numbers of
different (or slightly different) magnitudes. Kulisch accumulation for linear
numbers is of course exact, so the sum of any set of numbers rounded back to
traditional floating point is accurate to 0.5 ulp.

------
sgt101
For those interested the general area I saw a good talk about representing and
manipulating floating point numbers in Julia at CSAIL last week by Jiahao
Chen. The code with some good documentation is on his github.

[https://github.com/jiahao/ArbRadixFloatingPoints.jl](https://github.com/jiahao/ArbRadixFloatingPoints.jl)

------
davmar
caveat: i haven't finished reading the entire FB announcement yet.

google announced something along these lines at their AI conference last
september and released the video today on youtube. here's the link to the
segment where their approach is discussed:
[https://www.youtube.com/watch?v=ot4RWfGTtOg&t=330s](https://www.youtube.com/watch?v=ot4RWfGTtOg&t=330s)

------
moltensyntax
> Significands are fixed point, and fixed point adders, multipliers, and
> dividers on these are needed for arithmetic operations... Hardware
> multipliers and dividers are usually much more resource-intensive

It's been a number of years since I've implemented low-level arithmetic, but
when you use fixed point, don't you usually choose a power of 2? I don't see
why you'd need multiplication/division instead of bit shifters.

~~~
jhj
Multiplication or division by a power of 2 can be done by bit shift assuming
binary numbers represent base-2 numbers; i.e. not a beta-expansion
[https://en.wikipedia.org/wiki/Non-
integer_representation](https://en.wikipedia.org/wiki/Non-
integer_representation) where binary numbers are base-1.5 or base-sqrt(2) or
base-(pi-2) or whatever (in which case multiplication or division by powers of
1.5 or sqrt(2) or (pi-2) could be done via bit shift).

But when multiplying two arbitrary floating point numbers, your typical case
is multiplying base-2 numbers not powers of 2, like 1.01110110 by 1.10010101,
which requires a real multiplier.

General floating point addition, multiplication and division thus require
fixed-point adders, multipliers and dividers on the significands.

------
saagarjha
I find it interesting that they were able to find improvements even on
hardware that is presumably optimized for IEEE-754 floating point numbers.

~~~
nestorD
It is a trade-of : they find improvements by losing precision where they
believe it is not useful for their use case.

