
Beating Floating Point at Its Own Game: Posit Arithmetic [pdf] - speps
http://www.johngustafson.net/pdfs/BeatingFloatingPoint.pdf
======
Dylan16807
> There are no “NaN” (not-a-number) bit representations with posits; instead,
> the calculation is interrupted, and the interrupt handler can be set to
> report the error and its cause, or invoke a workaround and continue
> computing, but posits do not make the logical error of assigning a number to
> something that is, by definition, not a number. This simplifies the hardware
> considerably.

What a strange claim. Outputting a NaN is easier than raising an interrupt, or
could be done within an interrupt, and detecting a NaN input requires a
handful of gates or less.

This is not to say that NaNs are a good or bad method, but they're definitely
not expensive to implement.

> posits lack a separate ∞ and −∞ [...] “negative zero” is another defiance of
> mathematical logic that exists in IEEE floats.

I will note that the IEEE standard almost had a projective infinity mode, and
x87 has that mode.

> floats are asymmetric and use those bit patterns for a vast and unused
> cornucopia of NaN values

If we want to ignore languages like javascript and lua, sure.

~~~
VikingCoder
> This is not to say that NaNs are a good or bad method, but they're
> definitely not expensive to implement.

I saw him explain that in float, there a too many bit representations that
amount to "NaN". According to a Stack Overflow I found:

"IEEE 754 standard defines 16,777,214 32-bit floating point values as NaNs, or
0.4% of all possible values."

That's "expensive" in terms of losing bits of expressiveness that could have
been used to represent actual numbers.

~~~
louithethrid
Cost of a NaN Operation on chip- same as every other operation, as NaN just is
handled and returned like another floating point value - always resulting in a
new NaN value- thus a error invalidates all resulting wrong results.

Cost of a Interrupt: 100 ns to 1 microseconds (Quora) Sorry, that solution is
simply not interesting for most implementations where floats are used. There
are sensors which in realtime hammer out so many values, that not using NaNs
means dropping part of your sensor values.

This is a classic case of re-inventing a optimal wheel (fine for racing) and
not really looking at the use-cases (not fine for a lot of normal day to day
driving).

Im sure it will bring the groupies on conferences.

~~~
Dylan16807
> Cost of a Interrupt: 100 ns to 1 microseconds

That sounds like the time it takes to do a context switch to the OS. A math
error interrupt doesn't need to do a context switch. The cost doesn't need to
be any higher than a branch misprediction at 10-20 cycles.

~~~
louithethrid
I stand corrected, sorry, it was late at night and yes of course its a
floating point operation that goes sour, which in assembly would be handled
and then the result passed to the programm handling. Still expensive when
encountered in mass though, on a micro controler.

Thanks for putting it right, before the missinformation could spread.

------
yorwba
The LINPACK benchmark feels a bit like bragging. Of course you can demand an
exact dot-product operation for any implementation of your standard, and then
use it to compute an exact vector-matrix product, but floats could do the same
if it were part of the standard. The fact that it requires a 1024-bit
accumulator makes me doubt that it would be used for a massively parallel
implementation e.g. in GPUs.

The overall idea of choosing a number distribution that loses less precision
for common operations and orders of magnitude is pretty compelling, though.

~~~
dnautics
The accumulator is pipelineable. For most graphics and ml applications you
probably don't need the exact dot product. You probably _do_ want it for
scientific applications... There is a trade-off in performance, as always, an
appeal to calculation speed though falls to the retort of, sure, you can
calculate wrong answers with higher throughput if you wished.

------
copperx
Here's a C/C++ partial implementation of posits:

[https://github.com/libcg/bfp](https://github.com/libcg/bfp)

~~~
dnautics
bfp uses C++ classes which store the format hyperparameters as fields. If
you'd rather a C/C++ library memory layout which reflects the bitwidth of the
particular posit:

[https://github.com/Etaphase/FastSigmoids.jl](https://github.com/Etaphase/FastSigmoids.jl)

contains both a julia library and a C/C++ library. Posits are implemented as a
C type and a C++ class where the type and class sizes correspond to the
bitwidth.

~~~
libcg
Dev here, I'm planning to transition to a C library with fixed size posit
implementations. I need all the help I can get :)

FastSigmoids uses float and double types which would be problematic in
embedded environments.

~~~
dnautics
Shoot me a message. I am planning on making a bitwise/intmath implementation
in C. Things might be accelerated if you know someone interested in sponsoring
it.

------
stephencanon
Posits are actually quite reasonable, but there's a lot of either ignorance or
disingenuousness in this article, which is really too bad. I wish that John
would ditch the hyperbole and solicit feedback from other experts, because
posits are not a bad idea, but the presentation continues to give him the
trappings of a crank.

I'll unpack just the first example that jumped out at me:

> Currently, half-precision (16-bit) IEEE floats are often used for this
> purpose, but 8-bit posits have the potential to be 2−4× faster. An important
> function for neural network training is a sigmoid function, a function f(x)
> that is asymptotically 0 as x → −∞ and asymptotically 1 as x → ∞. A common
> sigmoid function is 1/(1 + e−x) which is expensive to compute, easily
> requiring over a hundred clock cycles because of the math library call to
> evaluate exp(x), and because of the divide.

"have the potential to be" 2-4x faster? Sure. But until we see an
implementation, 1-2x is _much_ more likely (closer to the 1x end of the
spectrum). Commodity hardware runs 32b IEEE float multiplications with 3-4
cycles latency and single-cycle throughput (or better). 16b can be made faster
if designers care to. There's simply not much "faster" available for posits to
inhabit (2-4x faster than 16b float would be faster than small integer
arithmetic).

Evaluating a sigmoid function requires "over a hundred clock cycles" only in
the most naive possible implementation sitting on top of a lousy math library.
Using 32b floats on a current generation phone or laptop with a decent math
library, a naive scalar implementation of the sigmoid function has a _latency_
of less than 50 cycles. But latency doesn't matter _at all_ in a machine
learning context; we're interested only in throughput. On a machine with
AVX-512, a single core can evaluate a sigmoid function with a throughput of
about 1.25 cycles / input. In full-precision 32b floating-point (i.e. a
relative error of ~10^-7). John's proposed posit implementation has a relative
error of about 10^-1. If we target that error threshold, we can trivially go
below 1 cycle/input in 32b or 16b float on a phone. So IEEE floats are _at
least_ two orders of magnitude faster than he claims. You need to go back more
than 15 years for the numbers that the paper tosses around to even be
plausible.

There are several other examples like this in the paper. I don't want to be
too antagonistic, because posits are not a bad idea (actually, I think they're
a pretty good format), but this paper is either ignorant of the state of the
art, or more marketing than science.

~~~
dnautics
I'm just going to be blunt here. John and I have decided that we need to be
more marketing savvy after he's had trouble with several rounds pitching other
floating point formats. Posits are just an intermediate step to try to build
acceptance for valids, so there's a lot of effort put into branding.

A couple of points: 2-4x faster means 2x faster in dot product based simd and
4x faster in matrix simd, assuming that your bottleneck is memory throughput.

The sigmoid function realistically isn't a bottleneck in general, but you
gotta admit it is pretty cool to have a ~zero clock cycle approximation. (I've
tried to rein in John a bit on this one)

~~~
stephencanon
If you're bound by memory throughput, you can't go beyond a 2x speedup
(there's 1/2 as much data to move in an 8b format, whether it's in vectors or
matrices doesn't matter). I still don't see any reasonable expectation for 4x.

~~~
dnautics
if your matrix contents are static during your rate-limiting step (as they are
for most DL applications) your FLOPs scale with O(n^2) relative to your memory
throughput on your vector component.

[a b, c d] dot [e, f]

is four multiplies

[a b c, d e f, g h i] dot [j, k, l]

is nine multiplies.

------
delhanty
HN discussion from 2 years ago:

[https://news.ycombinator.com/item?id=9943589](https://news.ycombinator.com/item?id=9943589)

~~~
ktta
This is _different_ from unums which it says so in the second line of the
Abstract.

~~~
shpx
Actual discussion of posits from 3 months ago

[https://news.ycombinator.com/item?id=14013971](https://news.ycombinator.com/item?id=14013971)

John Gustafson (the author) has an account

[https://news.ycombinator.com/threads?id=jlgustafson](https://news.ycombinator.com/threads?id=jlgustafson)

------
speps
I read the book on Type I Unums, it was quite interesting but I'm not an
expert in this field. However, I appreciate the fact you don't have to buy a
book to find his new research.

------
CleanCut
That looks really cool. Is there hardware support for posits anywhere?

~~~
jamesaross
John Gustafson is on the board of Rex Computing, a small semi startup. They
claim to have taped out last year, but I don't know if the chip has been
validated or if it had posits in silicon, but I know that is one of their
longer term goals.

I find Gustafson's UNUMS 2.0 more compelling than the first version.

~~~
trsohmers
Founder of REX Computing here, we taped out in July of 2016 and got our
silicon brought up and working back in February. I gave a talk at Stanford
showing our hardware and a tiny bit of software:
[https://www.youtube.com/watch?v=ki6jVXZM2XU](https://www.youtube.com/watch?v=ki6jVXZM2XU)

Type 2 unums are pretty much entirely deprecated by type 3 unums (now given
the name 'posits' as referred to in the OP's linked paper)... they are
basically superior in every way, and I recommend watching John's February 2nd
talk at Stanford.

~~~
deepnotderp
Does the Rex Neo have posits?

------
strigeus
What's the smallest positive integer that 64-bit posits cannot represent
exactly?

