
There Are Only Four Billion Floats – So Test Them All - thedufer
http://randomascii.wordpress.com/2014/01/27/theres-only-four-billion-floatsso-test-them-all/
======
cperciva
I did something quite similar when I was finding worst-case inputs for complex
multiplication rounding errors. Here I had to search over tuples of four
floating-point values; but I could ignore factors of two (since they don't
affect relative rounding errors) and I had some simple bounds on the inputs
such that I had only a hundred billion tuples left to consider -- a small
enough number that it took a few hours for my laptop to perform the search.

Replicating this for double-precision inputs would have been impossible, of
course; but fortunately knowing that the worst-case single precision rounding
error occurred in computing

    
    
      (3/4 + 12582909 / 16777216 i) * (5592409 / 8388608 + 5592407 / 8388608 i)
    

made it easy to anticipate what the worst cases would be in general -- and as
usual, knowing the answer made constructing a proof dramatically easier.

------
maho
> Round to nearest even is the default IEEE rounding mode. This means that 5.5
> rounds to 6, and 6.5 also rounds to 6.

Wat. I had never even heard about this, but the more I read and think about
it, the more this rule makes sense to me.

How did I miss this - is this commonly taught in school and I just did not pay
attention? It seriously worries me that I don't even know how to round
properly - I always thought of myself as a math-and-sciency guy...

~~~
hvidgaard
it's rarely taught in school. Most people know that you round .5 up, but that
is a slightly uneven distribution and rounding to the even number however
produce a perfectly even distribution. It's also the way you round in the
financial sector.

~~~
maho
I don't believe the round-to-next-even-method will give you a _perfectly_ even
distribution. Benford's law can be generalized to the second, third, ...
digit, and although the distribution is not as skewed as for the first digit
[1], you'd still be rounding down a bit more often than rounding up.

Benford's law applied to rounding floats is splitting hairs, really - but it
goes to show how even "simple" things like rounding can be really difficult if
you only worry about them for too long.

[1]
[http://en.wikipedia.org/wiki/Benford%27s_law#Generalization_...](http://en.wikipedia.org/wiki/Benford%27s_law#Generalization_to_digits_beyond_the_first)

~~~
robzyb
You're over-intellectualizing the issue.

Benfords law only applies to certain sets of numbers.

Not all numbers display Benfords law. Therefore Benfords law should not be
considered when talking about rounding unless there's a reason to think that
it applies to your particular set of numbers.

~~~
cbr
The interesting thing about Benford's law is how widely applicable it is. Most
sets of numbers you come across in your life are likely to follow it, or at
least have leading 1s be more common than leading 9s. As such we should
definitely be considering it when trying to decide which rounding rule to
adopt in the general case.

~~~
mzs
I'll try to make the parent's point a different way. This is about rounding.
That's about the digits with the least significance. A good rounding scheme
wants to see a uniform distribution there in the two least significant digits.
Benford's law breaks down after four digits pretty much completely. That's the
over-thinking it aspect.

------
ElliotH
I did some work for my third year dissertation on performing exhaustive
testing on floats. Worked well enough, but I was never convinced that it was
an especially effective way of testing.

If your test oracle is code itself, then it's not unlikely you'll make the
same bug twice. I can't find the paper I'm after, but NASA did some work here.

If your test oracle isn't code, you have the problem of working out the
correct outputs for the entire function space (at which point maybe you'd be
better off with some kind of look-up table anyway)

The next problem is your input combinations grow much faster than you are able
to test. Most functions don't just work on one float (which is very feasible)
but they might work on three or four streams of input, and keep internal
state.

The moment you have these combinations you run into massive problems of test
execution time, even if you can parallelise really well.

Butler and Finelli did some work
[http://www.cs.stonybrook.edu/~tashbook/fall2009/cse308/butle...](http://www.cs.stonybrook.edu/~tashbook/fall2009/cse308/butler-
finelli-infeasibilit.pdf) also at NASA investigating if doing a sample of the
input space was worthwhile, but their conclusion was that it isn't helpful,
and that any statistical verification is infeasible.

In my report I ended up concluding that it isn't a particularly useful
approach. It seems that following good engineering practice has better returns
on your investment.

------
zamalek
The problem with this is that there are a select few cases where a machine can
(or should) determine the expected value for a test assertion.

\- My function does something complicated: chances are you will be replicating
your function in your test code. Hand-code assertions.

\- My function can be tested using a function that is provided by the
runtime/framework: great! Remove your function and use the one in the
framework/runtime instead.

\- My function can be tested using a function that is provided by the
runtime/framework, but is written for another platform or has some performance
trick: test as described in the article.

\- My function should never return NaNs or infinities: test as described in
the article.

> And, you can test every float bit-pattern (all four billion!) in about
> ninety seconds. [...] A trillion tests can complete in a reasonable amount
> of time, and it should catch most problems.

Your developers will never run said tests. Unit tests are supposed to be fast.
However, these exhaustive tests could be run on the CI server.

~~~
chriswarbo
The code in the article is about optimisation, in which case there should
_always_ be a slow reference implementation. Never try to write highly
optimised code on the first go: make it work, make it correct, then make it
fast.

Also, these aren't unit tests; they're property tests.

~~~
zamalek
> The code in the article is about optimization [...]

Very true but that really should have been stated up-front by the author. Next
thing you know "some guy at Valve said all unit tests with floats should test
all 4 billion cases." You would be surprised at how much people take away from
a skim over something - e.g. a poor understanding of Apps Hungarian created
the monster that is Systems Hungarian.

> in which case there should always be a slow reference implementation.

Good point - however, keep in mind that bug fixes would need to be made in two
places. Not only does it mean the maintainer needs to grok two pieces of code,
but they may have to determine if the original code is providing bad truths.
Hopefully there wouldn't be many places where this type of testing would be
required as the low hanging fruit is often enough: however keep in mind
certain industries do optimize to ridiculous degrees - and in some of those
industries optimization can happen first depending on the developers [well-
seasoned] intuition, and can be a frequent habit. The author's industry is a
brilliant example of where the typical order of optimization (make it work,
make it correct, make it fast) takes a back seat - life is incredibly
different at 60/120Hz.

~~~
MaulingMonkey
> Good point - however, keep in mind that bug fixes would need to be made in
> two places.

Only if the bug actually occurs in both implementations. This does happen
sometimes, so it's indeed good to keep in mind - but my experience
experimenting with these kinds of comparison tests hasn't borne this out as
the more common case. YMMV, of course.

> Not only does it mean the maintainer needs to grok two pieces of code, but
> they may have to determine if the original code is providing bad truths.

I find automated comparison testing of multiple implementations to be
extremely helpful in grokking, documenting, and writing more explicit tests of
corner cases by way of discovering the ones I've forgotten, frequently making
this easier as a whole, even with the doubled function count.

> Hopefully there wouldn't be many places where this type of testing would be
> required as the low hanging fruit is often enough: however keep in mind
> certain industries do optimize to ridiculous degrees - and in some of those
> industries optimization can happen first depending on the developers [well-
> seasoned] intuition, and can be a frequent habit.

I don't trust anyone who relies on intuition in lieu of constantly profiling
and measuring in this sea of constantly changing hardware design - I've yet to
see those who've honed worthwhile intuitions drop the habit :). Even in the
lower hanging fruit baskets, it's frequently less a matter of optimizing the
obvious and more a matter of finding where someone did something silly with
O(n^scary)... needles in the haystack and where you'd least expect it.

------
zomgbbq
If it takes 90s to test 2^32 floats then it should only take about 12,257
years to test 2^64 doubles.

~~~
Someone
It's perfectly parallelizable. Go to AWS, fire up 2^20 four core instances for
a day, and you are all set :-)

~~~
croddin
That would only cost 7.5 million dollars using c3.xlarge instances! I am
actually surprised that it is that close to being feasible.

~~~
_delirium
Are there any estimates on how much on-demand capacity Amazon typically has?
If I tried to spin up instances as fast as the API will let me, how long until
I run out of quickly provisioned instances? Can I get thousands on short
notice? Tens of thousands? Millions?

------
wglb
This should be one of the required units of study in Floating Point School.

Quick--why would you sort an array of floating point numbers before adding
them up? And in which order?

~~~
alanh
Ooh, nice question.

Floats are a quantity and a magnitude. Adding two floats requires (at least)
the larger of the two magnitudes because larger values are more important than
smaller ones. So the smaller float essentially gets converted to a small
quantity and a large magnitude. In an extreme case, this small quantity
becomes zero. So when adding a very large number and a very small number, the
result will be the same as adding zero to the very large number.

If you add numbers largest to smallest, then thanks to this, at a certain
point you are simply doing $SUM = $SUM + 0 over and over.

But if you start with the smallest numbers, they have a chance to add up to a
value with a magnitude high enough to survive being added to the largest
numbers.

~~~
wglb
Exactly.

------
jwmerrill
Who's using single precision floats these days?

~~~
Negitivefrags
Your comment makes me feel sad. It just reflects a whole mindset that makes me
feel like an oppressed minority somehow.

It sucks that most people don't really care about performance any more. It
sucks when people causally dismiss people who do care about performance with
arguments like "premature optimization" and so on.

It kind of sucks that the web is taking over because it gives an excuse to use
slow languages. It almost makes me feel sick to think that Javascript stores
all numbers as doubles.

So you know what? It makes me feel happy to use the most efficient type that
will work. I like my single precision floats.

I can die happy knowing that I didn't waste everyone's collective network,
disks and CPUs with needless processing. Can't it be good to save just for the
sake of saving?

So you want a practical reason do you?

I shaved over 2 gigabytes off the size off the download size of our product
simply by using more efficient data types. Using _half_ size floats is good
enough for some data (16 bits).

And let me tell you about a data type that we used to use called a char. 8
bits would you believe! Turns out those x86 CPUs that are sitting there a few
VMs deep still support those things.

~~~
jwmerrill
Sorry for making you feel sad. Good points all around, here, and in the other
comments about GPUs. Regarding CPU, though, my understanding is that on a 64
bit machine, 32 bit arithmetic is no faster than 64 bit arithmetic. Is that
incorrect?

~~~
hayfield
Some work I was doing over the summer indicates that the optimal datatype
depends on the algorithm you're using.

qsort() can get 15% better performance with uint64_ts than uint32_ts (sorting
identical arrays of 8 bit numbers represented differently). On the other hand,
a naive bubble sort implementation was managing 5% better performance with 16
bit datatypes over any of the others.

If you start measuring energy usage as well, it becomes even stranger - using
a different datatype can make it run 5% faster, while using 10%+ less energy
(or in the qsort() case, take less time, but use more energy).

~~~
nly
qsort strikes me as a bit of a crappy benchmark because of the type erasure.
Your compiler likely isn't inlining or applying vectorisation. Vectorisation
would likely benefit 32bit floats on a 64bit platform more than doubles.

~~~
hayfield
Indeed - it probably isn't ideal.

I'd tested a number of sort algorithms (bubble, insertion, quick, merge,
counting), so also testing the sort function from stdlib seemed like a logical
continuation. It was done rather quickly, so there wasn't time to properly
investigate, merely look at the numbers and go "that's strange".

------
nraynaud
I like his notion that IEEE errors result from sloppiness. So here are 2 "non-
sloppy" version of the computation of the sign of a determinant:
[https://github.com/bjornharrtell/jsts/blob/master/src/jsts/a...](https://github.com/bjornharrtell/jsts/blob/master/src/jsts/algorithm/RobustDeterminant.js)
[http://www.cs.cmu.edu/afs/cs/project/quake/public/code/predi...](http://www.cs.cmu.edu/afs/cs/project/quake/public/code/predicates.c)

Remember, the determinant value is not the rounded value of the exact
computation with those things unless it is is zero, it's still "sloppy" if
it's far from zero. You get only 1 bit of guaranteed accuracy in a 64bits FP
number if the result is not zero.

I think there is a little bit to "sloppiness" in the fact that we carry very
inexact results in FP. Even "good enough" FP computation is PhD level.

------
vanderZwan
> However the ceil function gave the wrong answer for many numbers it was
> supposed to handle, including odd-ball numbers like ‘one’.

Given all the special and unique features of the number one, I sincerely
cannot tell if he's being sarcastic or not.

~~~
ghayes
I believe he's (dramatically) saying that you'd expect someone to have checked
that number manually, if not by other means.

~~~
sampo
Maybe they were like "of course it works for simple cases like 1, let's just
test some more difficult numbers".

~~~
ajuc
That's why unit testing is so great.

~~~
tripzilch
Because one is the unit number.

------
jevinskie
> A trillion tests can complete in a reasonable amount of time, and it should
> catch most problems.

When your individual test is a handful of instructions, you can make such
impressive claims. I only wish $work's test suite was so svelte!

------
donpdonp
This post feels particularly applicable since I just came across a huge bug in
ruby's BigDecimal package.

If you use ruby 2.1, be sure to upgrade to BigDecimal 1.2.4.

[https://www.ruby-forum.com/topic/4419577#1133001](https://www.ruby-
forum.com/topic/4419577#1133001)

The bug is basically: if the divisor is less than 1, the result will be 0.0.

[https://gist.github.com/donpdonp/c36b33b861cea49c0a86](https://gist.github.com/donpdonp/c36b33b861cea49c0a86)

------
kordless
It seems like a perfect fit for a genetic algorithm. Even for the speed stuff.

~~~
dj-wonk
A perfect fit? Care to expand on your idea? I'd like to see what you had in
mind.

