
FloatX: A C++ Library for Customized Floating-Point Arithmetic - ArtWomb
https://github.com/oprecomp/FloatX
======
ArtWomb
Link to ACM library paper:

[https://dl.acm.org/doi/10.1145/3368086](https://dl.acm.org/doi/10.1145/3368086)

------
helltone
I have encountered many real world cases where I needed less precision/range
than what float/double had available. But usually I found fixed point was a
better solution than reduced precision floats. I wonder what applications are
there that can deal with reduced precision but somehow still need the range
you get with an exponent?

~~~
CodesInChaos
16-bit floats with an 8-bit exponent and a 7+1 bit mantissa are popular for
neural networks, because they have the same range as standard 32-bit floats
while taking have the memory and memory bandwidth.

[https://en.wikipedia.org/wiki/Bfloat16_floating-
point_format](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)

~~~
helltone
Interesting, although I believe most neural nets nowadays have moved to use
linear (relu) activation, which again removes the need for exponent and would
work really well with fixed point.

Here's a reference using 16-bit fixed point neural nets
[http://ieeexplore.ieee.org/document/7011421/?part=1](http://ieeexplore.ieee.org/document/7011421/?part=1)

------
nwallin
How does this compare to boost::multiprecision?

[https://www.boost.org/doc/libs/1_72_0/libs/multiprecision/do...](https://www.boost.org/doc/libs/1_72_0/libs/multiprecision/doc/html/boost_multiprecision/tut/floats.html)

~~~
brandmeyer
Ironically, the section labeled "What FloatX is NOT" tells you more about what
it _is_ than what it _isn 't_.

Its a system for emulating narrower precision floating-point arithmetic using
native machine-width floating point. The author's claim that its much faster
than using the integer unit for this purpose.

Boost::multiprecision and MPFR are libraries to execute higher-precision
arithmetic, commonly using the integer hardware to do so.

~~~
jcranmer
Emulating narrower using larger floating-point arithmetic can be dodgy, since
you open yourself up to double rounding scenarios. For the IEEE 754 types
(half, single, double, and quad), it is the case that the primitive operations
(+, -, *, /, sqrt) are all correctly rounded if you emulated it by converting
to the next size up, doing the math, and converting it down. For non-IEEE 754
types (such as bfloat16, or the x87 80-bit type), this is not the case, so
double rounding is a possible concern.

------
bhuthesh_r
I'm trying to use 16 bit floats for matrix multiplication on x86-64. I found
solutions for ARM and some NVIDIA GPUs but none for any X86-64 chips. Any
pointers in this direction would be helpful.

~~~
integricho
Here is a little sample code I threw together, it shows the whole cycle, from
conversion to half-floats to the conversion back to floats and performing a
simple multiplication of the values:

[https://godbolt.org/z/FYu_rK](https://godbolt.org/z/FYu_rK)

Hope it helps.

~~~
bhuthesh_r
Thank you.

------
igorkraw
Ha, very cool to see this here, I briefly worked with one of the authors in
Zurich^ ^

