
Bits in a Float, and Infinity, NaN, and Denormal (2012) - Cieplak
https://www.cs.uaf.edu/2012/fall/cs301/lecture/10_24_weirdfloat.html
======
wjakob
This analysis is somewhat dated and leaves out one important fact: nowadays,
floating point arithmetic is carried out using a set of special scalar SSE
instructions (and not the ancient x87 co-processor, as was done in the
author's benchmark).

SSE instructions remove performance pitfalls related to infinities and NaNs.
The only remaining case where slowdowns are to be expected denormals (which
can be set to flush-to-zero if desired.)

In other words: it's perfectly fine to work with infinities and NaNs in your
code.

------
phkahler
You may also want to read about posits:

[https://www.johndcook.com/blog/2018/04/11/anatomy-of-a-
posit...](https://www.johndcook.com/blog/2018/04/11/anatomy-of-a-posit-
number/)

I'm a fan of these not because of the claims regarding precision, but because
they drop all the complexity and baggage of IEEE floating point.

~~~
jcranmer
> There’s only one zero for posit numbers, unlike IEEE floats that have two
> kinds of zero, one positive and one negative.

> There’s also only one infinite posit number.

Those two things are a big deal-breaker for me. Yes, having positive and
negative 0 can be useful--there are times when you want to think of 0 not as
"this is exactly 0" but as "this value underflowed our range", and it matters
whether or not you are an underflowing negative number or an underflowing
positive number. Of course, using IEEE-754 to check for exactly one of
positive and negative 0 is painful.

Similarly, having NaN as a distinct type can be useful. You get to distinguish
between "this computation shrank too small to be represented", "this
computation grew too large to be represented", and "this computation makes no
mathematical sense". Posits don't give you that. Furthermore, as many language
runtimes have discovered, the sheer number of NaN values means you can
represent every pointer and integer as a tagged NaN.

The only thing in IEEE-754 I would truly toss in a heartbeat is that x != x
holds true for NaN values.

~~~
tehjoker
I recently implemented dijkstra's shortest path using nearest neighbor nodes
on a 3D image problem. While IEEE-745 is usually just a pain for me, in this
case it was pretty cool. Instead of allocating a separate boolean array of
"visited" nodes, I just used negative numbers to denote visited. Since
negative zero is a thing, I didn't have to add any exceptions to handle zero
distance.

~~~
phkahler
That's a nice hack, but you're violating what many people consider a tenet of
writing good code. I don't consider that an argument in favor of negative zero
and all the baggage of IEEE 754.

~~~
tehjoker
Fwiw, in my code, I got to write "-0" explicitly and the method for checking
is std::signbit(x)

------
kimburgess
Groking that way that floats work really is a lot of fun.

Years ago a put together a math library
([https://github.com/KimBurgess/netlinx-common-
libraries/blob/...](https://github.com/KimBurgess/netlinx-common-
libraries/blob/master/math.axi)) for a domain specific language that had some
"limited" capabilities. All functionality had to be achieved through a
combination of some internal serialisation functions and bit twiddling.

It was simultaneously one of the most painful and interesting projects I've
done.

------
slivym
On a personal note, this representation annoys me:

value = (-1) sign * 2 (exponent-127) * 1.fraction

It should be:

value = (-1) sign * 2 (exponent-127) * (1 + fraction*2^-23)

It sounds trivial, but you can't reason mathematically about the first
equation.

~~~
dbcurtis
I assume what you are saying is to assume a 1 in the MSB of the mantissa. That
has been done. HP minicomputers, I believe, and maybe some orhers of that era.

Unfortunately, that means that you have no way to represent numbers in the
denormal binate, which leads to severe problems with monotonicity. As you move
to binates with smaller exponents the distance between representable numbers
halves in all the normalizable binates. Unless you allow for denormals, you
have a GIANT jump from the smallest normalizable number to zero.

This leads to problems in numerical algorithms. Taking differences to find
slopes gets unstable as you approach convergence, causing converge to fail.

~~~
ruds
The complaint is about the text in the article, which reads "The hardware
interprets a float as having the value:

    
    
        value = (-1)^sign * 2^(exponent-127) * 1.fraction".
    

"1.fraction" is nonsense that doesn't really mean anything, whereas (1 +
fraction * 2^-23) does mean something.

~~~
kevin_thibedeau
It's just a different notation same as 1234 is shorthand for

    
    
      1*10^4 + 2*10^3 + ...

------
protonfish
In my opinion, the confusion that arises when programmers get results from
floating-point computations that are not what they expect stems from this:

> Floats represent continuous values.

But as you probably know, this isn't possible. The concept of infinite
precision is interesting in theory, but disappears when any actual calculation
needs to be made, whether on a digital electric computer or not.

I wonder if this is not a flaw in the crude mechanical representation of
numbers, but a flaw in the decision to base floating-point computation on the
concept of continuous numbers. I believe that a better model for floating-
point computational representation and manipulation would be to reflect the
rules of scientific measurements - that each number includes an explicit
amount of precision that is preserved during mathematical operations.

This would not only keep JavaScript newbies from freaking out when they add
0.1 and 0.2, but prevent problems of thinking calculation results are correct,
when they are not.

If you aren't getting what I am saying, let me give an example. Let's say, for
some reason, you want to measure the diameter of a ball. You have a measuring
tape so you wrap it around the widest part and record that it is 23.5cm. To
calculate the radius, you should divide by π. If you do this in double-
precision floating point, you will get 7.480282325319081, but this is
nonsense. You can't create a result that is magically more precise than your
initial measurement though division or multiplication. The correct answer is
7.48cm. This preserves the amount of precision in the least precise operand,
and is arguably the most correct result.

~~~
im3w1l
I've seen this idea of storing the precision mentioned many times on HN, but I
must say I don't believe in it outside of some few niches.

First reason being that it's much more complex and it's unclear what the
complexity buys us.

Second reason is that it doesn't model how variables co-vary. As a toy example
imagine that I have a number x: 5+-1.

Then I let y = x - 1: 4+-1

Finally I let z = 1 / (x - y).

Now, by construction z will be very close to 1. But a system naively tracking
uncertainties will be very concerned x - y. If it does a worst case analysis
it gets 1+-2. If it does an average case analysis assuming independent
gaussian errors it gets 1+-sqrt(2). When we perform the division our
uncertainty goes infinite.

~~~
protonfish
I don't see any reason to claim that explicit-precision floating-point is more
complex, just different. Yes, change is uncomfortable and takes effort, but it
does necessarily mean the new way is inherently more complicated. I worry
about objections based on "that's not what we are taught in school." I think
that what we teach can (and should) be improved if need be, and not used as a
motivation to deny criticism of established dogma.

I am not sure if I fully understand your example, but I don't see any problem
with it. Using basic significant figure rules, this is (with an additional
step for clarity):

    
    
        x: 5e0  
        y = x - 1: 4e0  
        z1 = x - y: 1e0  
        z = 1/1e0  
        z2 = 1e0  
     

The answer seems to simply be 1+-1. "Significant Figures" are a simplification
of precision where the precision is an integer that represents the total
digits of the least precise measurement. A more accurate way is to represent
precision as standard deviation, then calculate the precision of the result
with basic statistical techniques.

------
nayuki
The article content is decent, but I can't stand the fact that the author used
this bit of CSS to make all the text unreasonably small: <style> body, div,
... { font-size: x-small } </style>

