
There Are Only Four Billion Floats, So Test Them All (2014) - tosh
https://randomascii.wordpress.com/2014/01/27/theres-only-four-billion-floatsso-test-them-all/
======
cperciva
On the topic of "test them all", I wrote this in my paper about error bounds
and worst cases for complex multiplication
([http://www.daemonology.net/papers/complexmultiply.pdf](http://www.daemonology.net/papers/complexmultiply.pdf)):

 _[For finding worst-case inputs] the approach taken was to perform an
exhaustive search, taking several hours on the second author’s laptop, of IEEE
single-precision inputs, using only a few arguments from Theorem 1 to prune
the search._

Even though (2^32)^4 was far too large for it to be tractable to test _all_
4-tuples of single-precision inputs, a "pruned" search was able to easily
cover and/or exclude the entire search space.

~~~
gajjanag
Nice!

I have noticed that there is a lot of high quality mathematical/numerical
software coming out of INRIA; and have often wondered of late why we do not
have an equivalent in the US. Do you have any thoughts on why this is the
case; and what the closest equivalent in the USA may be?

~~~
hinkley
It's more a matter of what have they done for us lately. We had a number of
supercomputing centers in the 90's (at least one has shuttered), and among
other things NCSA brought most people their first telnet client and their
first web browser via NSF funding that Al Gore pushed for (see also: Al Gore
invented the Internet)

After that they were working with INRIA on collaboration software fifteen
years before WebEx was a thing. They also had large divisions looking at VR
and data visualization.

~~~
hinkley
Oh, and also Apache web server was a fork of NCSA's httpd.

------
tikhonj
Unfortunately, this doesn't scale well beyond a single 32-bit floating point
number.

The good news is that if your code works with doubles or multiple inputs, you
might still be able to test it exhaustively using an SMT solver like Z3 which
supports a "theory of floating point numbers"[1]. A solver like this
symbolically models floating point operations at a _binary_ level, letting it
use an efficient pruning search to solve problems about the _exact_ behavior
of floating point numbers.

I haven't tried it myself so I don't know quite how well it scales, but it
should definitely work better than a loop across all possible inputs!

[1]:
[https://pdfs.semanticscholar.org/db9f/2cc0bf18c4661f5629b088...](https://pdfs.semanticscholar.org/db9f/2cc0bf18c4661f5629b088dfb8c8a2193506.pdf)

~~~
conistonwater
I want to disagree with you about scaling: most formulas that one might
reasonably come up with don't depend in any serious way on the number of bits.
So suppose you have a formula that you can test exhaustively with 32-bit
floats, and you were to implement it on 64-bit floats. If it fails a couple of
times on 32-bit floats, giving it a failure rate of a couple per billion, the
same failures are likely to show up in testing with 64-bit floats, a couple
per billion. A couple per billion is good enough to be triggered by purely
random tests. It's only specifically the _exhaustive_ part that doesn't scale,
but that part is not really necessary for this to be useful.

Z3's support for floating point theories is really cool, but sometimes the
simpler brute-force approach works too!

IOW, you could change the message to "Your formula's failure rate is either
exactly zero or larger than a few per billion, so random testing is good
enough", and I think it would still be good, insufficiently-applied advice.

~~~
tikhonj
That's a good point—even a few hundred random test cases are usually enough to
cover "normal" code. I use QuickCheck for that at work, and it's enough to
both find bugs I wouldn't see otherwise and to make me confident in my code
when it doesn't find any bugs. (As a bonus, it also tries to "reduce" failing
cases, which can really help with debugging.)

It's worth tuning your random input generator to produce inputs that commonly
cause problems like -0.

I can still think of a few situations where exhaustiveness might be worth the
effort:

\- verifying code passed in from the outside (security considerations?)

\- verifying code generated automatically (ie as part of program synthesis)

\- verifying very general code you expect to be used in hard-to-predict
contexts (ie library routines or compiler optimizations)

~~~
brucedawson
A few hundred random test cases? Double-precision numbers can have 2,048
different exponents. I've seen multiple bugs that only happen with denormals
so you need thousands of tests to have any hope of hitting those.

If you're going to do random testing then you need to have a huge set of
special numbers (denormals, NaNs, infinity, zero, exact powers of two, one
more or less than exact powers of two) and you should do millions or billions
of random numbers. Then you can start to feel confident.

~~~
im3w1l
>It's worth tuning your random input generator to produce inputs that commonly
cause problems like -0.

This would definitely include denormals. As well as +-inf, nan, odd int, even
int, negative odd and even int. +-1. Some .5s.

------
simonw
If you're interested in this style of testing and use Python,
[https://github.com/HypothesisWorks/hypothesis-
python](https://github.com/HypothesisWorks/hypothesis-python) is a very
powerful library that lets you create parametrized tests which will
automatically be run against a set of arguments designed to exercise those
interesting edge-cases.

~~~
masklinn
hypothesis is a pretty different style of testing, it is not intended for
_exhaustive_ testing.

Which you can easily do in pytest directly using parametrized test cases.

------
pjc50
Just don't try testing the nine billion names of God.

~~~
wolf550e
for whoever downvoted 'pjc50:

[https://en.wikipedia.org/wiki/The_Nine_Billion_Names_of_God](https://en.wikipedia.org/wiki/The_Nine_Billion_Names_of_God)

~~~
umanwizard
Just because something was a valid reference doesn't mean it shouldn't be
downvoted. I fail to see how it's relevant to the article.

~~~
planteen
Often humorous and witty comments are my favorite parts of HN. They are far
better than posts on a topic where people seem not to have read the article.

~~~
dasil003
Sure, most people enjoy humor or wit, but most people aren't as funny as they
think, and I think even you'd agree the luster tends to wear off once the top
thread on every post is a chain of memes, in-jokes, and stupid puns nested too
deep to fit comfortably on any screen.

That is the natural entropic end-state of all large social news sites. HN has
anti-bodies against that by downvoting a bit more aggressively than might seem
warranted at first glance, but all in all I think it's a wise choice.

------
mkl
Previous discussion:
[https://news.ycombinator.com/item?id=7135261](https://news.ycombinator.com/item?id=7135261)

~~~
mark-r
Does this need a (2014) on the title? Certainly worth seeing if you missed it
before though.

~~~
brucedawson
Yes! I'm the author of the article but, still, I hate it when old articles are
posted without acknowledgment that they are old.

I mean, it's a _timeless_ _classic_ , but the fact that it's three years old
is still worth acknowledging.

------
ericfrederich
I remember doing this. In Python I wanted to know if a float -> str -> float
would result in the same number... it does.

But it wasn't really documented or guaranteed that it would, so I wrote some
bit-bashing code to generate all 2 __32 bit patterns.

------
jepler
After reading about the case the author knew was not right (negative zero), I
wonder why the relevant "equality" test wasn't more like

    
    
        if(isfinite(testValue.f) && isfinite(refValue.f) 
                ? testValue.i == refValue.i
                : (fpclassify(testValue.f) == fpclassify(refValue.f)
                   && signbit(testValue.f) == signbit(refValue.f)))
    

i.e., if the numbers are both finite [including denormals], they must have the
same bit pattern; otherwise, they have to be the same kind (i.e., both
infinite, or both nan) and have the same sign (my manpage for signbit says NaN
has a sign, but that's news to me actually).

I doubt it's that I know more than Bruce Dawson about FP; is it about working
in the distant past (2014, cough) without compiler support for C99 stuff like
fpclassify and signbit?

~~~
conistonwater
I don't like testing for NaN's using == and != either, but the only values
that compare the same while having a different bit patterns are ±0 and all the
NaN's, so as long as you get those right, the plain equality is fine. Also, I
think your test has a corner case with 0x7ff8000000000000 and
0xfff8000000000000, because NaN's can have sign bits too, while the intention
in the post is to treat any NaN as matching any NaN. So the test for equality
could be something like "isnan(a) == isnan(b) && (isnan(a) || (a == b &&
signbit(a) == signbit(b)))" without any bitwise manipulation.

------
jheriko
True in 2009, true today... [http://jheriko-
rtw.blogspot.co.uk/2009/04/understanding-and-...](http://jheriko-
rtw.blogspot.co.uk/2009/04/understanding-and-improving-fast.html)

------
rocqua
Something is wrong with the encoding of the code snippets on this site. I
spent quite a while staring at while (i &lt;= stop) Until I realized that's
html escape codes.

Weird, because I am running standard chrome on windows 7.

~~~
brucedawson
Fixed. Wordpress mangled the code when I made an unrelated change.

------
dmitriid
Also: property-based testing.

\- Haskell, Erlang: QuickCheck

\- Elixir: StreamData

\- Clojure/ClojureScript: clojure.spec

other languages have their own implementations. It may not cover _everything_
, but it will cover more than manually hardcoding values in your tests.

------
valbaca
Now do doubles.

------
fooker
4 billion is fine. What happens if your program has a few thousand floats?

~~~
masklinn
Then you can't use exhaustive testing?

------
marian0_
Funny he complains so much about incorrect rounding without doing any
research: that's called bankers' rounding.

[http://wiki.c2.com/?BankersRounding](http://wiki.c2.com/?BankersRounding)

~~~
int_19h
The functions called "ceil" and "floor" shouldn't do banker's rounding - the
very names describe what the behavior is supposed to be.

------
Taniwha
2 things here:

\- firstly binary ops really need not 4G ops to test them all, but 4G __2 ops
which is not a 'few seconds' but 4G times a few seconds

\- secondly, as a chip designer, by the time you've got silicon it's
essentially too late - testing 4G or 4G __2 operations in the original verilog
/vhdl model is very much not a few seconds or 4G times a few seconds

