
Why Bitwise Reproducibility Matters - mindcrime
https://thewinnower.com/papers/954-why-bitwise-reproducibility-matters
======
yarvin9
It seems clear that if we hadn't inherited a bunch of irreproducible languages
and/or instruction sets from the 20th century, no one would bother creating a
new one. There are certainly some ways to get optimization advantages from
underspecified computation, but especially in a networked world this win is
outweighed by the problems it causes.

When you use irreproducible floating-point, you're essentially treating your
CPU as an analog computer. For a subset of very specialized numerical
applications, this is probably okay. For any kind of integration with
interesting, non-numerical system software (let alone scientific publishing!),
it's a disaster.

I always recommend this link on FP determinism:
[http://gafferongames.com/networking-for-game-
programmers/flo...](http://gafferongames.com/networking-for-game-
programmers/floating-point-determinism/)

~~~
varelse
Neither the instruction sets nor floating point are irreproducible at all.
What are you talking about?

In fact, FP32 is commutative (go ahead, check it out). It's just not
associative. If you can't figure out how to reduce sums of FP32 numbers in a
deterministic manner, you suck, no that's too weak, you really suck. And your
complete lack of numerical analysis has turned a digital computing problem
effectively into an analog nightmare you'll never debug. HPC people love to
use this excuse for why their parallel code isn't deterministic, but it turns
out that making it so is effectively free in an age of fast RDMA and INT64
atomics.

To be fair, there are situations where noisy non-deterministic inputs
(cameras, microphones, and other sensors) are part of the deal. However, if
given simulated deterministic examples of inputs, your result isn't
deterministic, you still suck. That said, I'm willing to relax the constraint
to assume reproducibility given the same HW/compiler/toolchain so I think the
naysayers are pretty much out of excuses here.

I speak from experience. And the efforts I've made to achieve bitwise
reproducibility in HPC algorithms are dismissed as "engineering" by the people
who can't do so.

I've decided to take that as a compliment.

~~~
dibanez
I also work in HPC and have been doing a better job of this but still
struggling in some areas. For example, the sin() and cos() implementations are
different on some machines, and their only guarantee is being within some ulp
of the right answer. So far haven't come up with a rounding idea that will
make those values the same again.

This is worse for our codes because our mesh structure changes topology based
on floating point comparisons, so error in values turns into different
discrete structures.

Parallelism introduces more headaches, making some codes non-deterministic
between runs on the same machine, although I've been able to fix this for the
most part.

Its important to break down "determinism" into consistency across changes in
something, whether it be hardware or time or partitioning.

~~~
varelse
If your parallel reductions are consistent, you're cool in my book. That said,
I've solved a lot of dynamic load-balancing behavior with 64-bit fixed point
reductions and decisions. Maybe that will work for you?

Unless the timings themselves are non-deterministic, sigh. But then I'm far
more sympathetic. I have not found this sort of thing to be the case for the
most part and when I have, I've used a deterministic measure rather than
timing (i.e. number of calculations as opposed to how long they take).

For even compiler revisions are sufficient to break FP32 associativity let
along different transcendental approximations (we ought to fix that, no?).

Where I get angry is when people sloppily use FP32 for everything or FP64
because it's double(tm) and then insist determinism isn't possible. That isn't
even science IMO.

~~~
reality_czech
I don't understand the obsession with bitwise reproducibility. It seems like
if your algorithm is numerically stable and valid, you can just compare
multiple inexact test runs with some margin of error. Even if you hire nothing
but guru-level IEEE floating point experts, a focus on bitwise reproducibility
will close off a lot of opportunities to parallelize the code.

A lot of machine learning tools like random forests and neural networks
inherently inject randomness. Are we just going throw up our hands and say we
only use classical deterministic algorithms run in single threaded mode,
because we can't think of any way to compare multiple test runs except memcmp?
Because that's what I'm hearing (maybe I'm missing something).

~~~
varelse
Moore's Law w/r to clock frequency ran out over a decade ago. And that sucks,
because it's trivial to make single-threaded code reproducible.

But parallel code is how one continues to benefit from Moore's law w/r to
multiple cores and ever-increasing SIMD width and SIMD units. And it happened
at exactly the same time as the migration to mostly single-threaded weakly-
typed languages began.

For if your results aren't reproducible, there's no way to detect if your code
has a race condition that is reducing the efficacy of your methods.

For an application like molecular dynamics, it's important to conserve overall
energy. Any such inconsistency is the equivalent of setting off tiny little
hand grenades in the simulation. Inconsistent summation obscures this without
a great deal of work to sample many independent simulations. Compare and
contrast to running things twice to sniff this out in deterministic code.

For machine learning, it can amount to anything from a harmless implicit
regularizer to the AI equivalent of Gary Busey taking the wheel and driving
you straight to crazy town.

I speak from experience in both cases. And speaking from experience, as long
as your reductions are associative, you'll be fine. That can be achieved with
fixed point atomics, or if you don't have them, reduction buffers, or finally,
a deterministic reduction algorithm if you have neither. I've used all of the
above and they have at most cost 2-3% more than the non-deterministic
alternatives (usually much less).

Finally, I inject randomness like crazy, but I do so in a reproducible manner.
Confusing determinism and randomness reminds me of people who don't understand
the difference between precision and accuracy (TLDR: precision is easy,
accuracy is tough).

~~~
reality_czech
I've seen race conditions that only happen 1 time in 1000. Or race conditions
that only happen when the machine is under load. I agree that it's frustrating
to have to debug non-deterministic code, but I feel like there must be a
better way to compare test runs than memcmp on the resutls.

You are right that deterministic pseudo-random numbers can be used when
"random" inputs are required. I overlooked that... thanks for pointing it out.

~~~
varelse
One way I addressed this exact situation: Compute a hash on the state after
each iteration and save it. Now run again normally and save several iterations
worth of state somewhere. When the hash diverges, you have hit the race
condition.

Do this a few times until you've characterized what's happening and your brain
figures out how and why by seeing the pattern this will (hopefully) reveal.

------
littlewing
"The reason is a mixture of widespread ignorance about floating-point
arithmetic and the desire to get maximum performance."

Hopefully developers don't misinterpret this to mean that it is idiotic or
overly obsessive about performance to use floating point types in
calculations. That's not true.

As an example, I once decided to use double-precision floating-point in
calculations that would accumulate enough accuracy error that I had to do
approximate equality conditions throughout the parts of the code that
determined these solutions. This not involve monetary amounts, and we needed
solutions in a few seconds, not 20-30 seconds. Each step in the solution built
on the former, so I could not break it up any more to solve in parallel than
it already was. Throwing faster hardware at it wasn't an option available at
the time. I still think I did the right thing, and it's still an integral part
of production. Just going with the standard "just use BigDecimal because it's
accurate" approach would have been a mistake.

------
jtchang
I've never had to write a RTS game but fp determinism sounds like a nightmare.
I can imagine someone clicking a button and not having it do the same thing
every time. I'd be cursing the platform all the way down.

------
jmnicholson
Here is another piece by the author "Reproducibility, replicability, and the
two layers of computational science "
[https://thewinnower.com/papers/reproducibility-
replicability...](https://thewinnower.com/papers/reproducibility-
replicability-and-the-two-layers-of-computational-science)

------
dnautics
John Gustafson's unums are an effort to introduce a scheme for bitwise
reproducible floats.

~~~
bsder
This is cool, but we can't seem to get floating point decimal jammed into
languages. And that has an immediate business case.

I would love to have floating point decimal on _embedded_ systems rather than
a binary floating point unit. But the libraries are generally huge.

I use the equivalent of decimals _WAY_ more on embedded hardware. User entry
is always decimal. Percentages--decimal. Time specifications--milliseconds,
microseconds, nanoseconds--decimal with precision.

I wind up writing integer fixed-point arithmetic versions of this crap, over
and over and over.

Maybe since Moore's law broke and memory transistors are now way more
expensive that processor transistors, we'll get decimal floating point as a
feature soon.

I won't hold my breath, though.

~~~
dnautics
Fixed-precision is much better for business uses. What, exactly is the benefit
of floating point decimal? If you want to divide a $100 venmo three ways, you
get to know that your 1/3 is carried to 12 decimal places instead of 2?

~~~
bsder
Fixed-precision is fine. Until your quantum isn't good enough.

We only store to hundredths ... oops ... new regulations require us to decimal
round to thousandths. Cue full reload of the database with all your numbers in
them.

Or, even better, try dealing with mm and mils in the same CAD database
simultaneously. (conversion factor of 25.4--a clean number in decimal but very
much _not_ in binary)

In addition, you forget that _simply adding decimals_ doesn't work in binary
floating point. Adding 0.10 10 times doesn't equal 1.00 in binary floating
point no matter how much precision you use.

------
JesperRavn
I strongly disagree. Bitwise reproducibility would make it impossible to
change or improve matrix multiplication algorithms, since the order of
operations matters, and will change across implementations.

Do we really expect everyone who does linear regression to say "I did linear
regression, using the following implementation of a linear solver". Because
that's what is being asked.

A much better alternative is to test that algorithms give expected answers
(within some tolerance) on simulated but non-trivial data. It will still
require some care to distinguish bugs from rounding errors, but it is more, er
realistic, than expecting bitwise reproducibility.

~~~
yarvin9
Not to be too harsh, but this seems like the kind of thinking that leads to
solutions like the leap second. For the sake of an optimization in an
important but specialized corner case, you abandon a general principle of
universal applicability (repeatable computing, chronological time).

Maybe the optimization is important. In that case, couldn't you add another
layer devoted to that special case? For instance, reorder matrix multiplies in
a source-to-source transformation? Put leap seconds in the presentation layer,
not the chronology layer?

Everyone who does linear regression for the purposes of publishing a
scientific paper should post their code and data, so that anyone else can
recompute it and get the same bits. How is this controversial in 2015?

~~~
JesperRavn
It's not harsh to me, since you are critiquing the work of some of the best
experts in the world, not me.

What do you mean by "reorder matrix multiplies in a source-to-source
transformation?" To be clear, the issue is that matrix multiplication (which
forms the basis for a huge part of numerical computing) is only deterministic
when the order of operations is known. But any kind of fast algorithm is going
to have a highly complex order of operations that depends on the algorithm and
parameters. You can't reorder these back to some canonical order without
completely changing the algorithm (and destroying its performance gains).

Given the above, posting code and data won't be enough for binary
reproducibility. Source code for every dependency would be needed too.

As I noted I do support testing and reproducibility, just not at the level of
binary data.

~~~
yarvin9
Yes, source code for every dependency is needed too!

This is a classic source-to-source transformation problem. You have an
abstract matrix multiplication algebra which needs to be converted into a
deterministic algorithm with a known order of operations. You need a second
algorithm, essentially a macro, which converts the abstract operation into the
concrete operations, which are then compiled to deterministic code.

This macro should itself be deterministic, even if its inputs contain
information about local hardware configuration that's needed to make the
optimized code run as fast as possible. If this information is in your data
set, it's in your data set. Optimization is typically much less important to
the reproduction pass, so a reproduction will probably just use your
configuration details rather than those matched to the reproducer's machine.

Either way, both algorithms are deterministic and the result is reproducible.
What's wrong with this picture? I ask because I genuinely want to know :-)

[Edit: it's disappointing to see "disagree == downvote" applied to the
parent.]

~~~
CJefferson
The main problem is that with floating point numbers, (a+b)+c is different to
a+(b+c), so almost any change can produce slightly different answers.

Also we have the problem of what is "right"? Maybe your first implementation
does (a+b)+c, then someone finds doing (a+c)+b gives a more accurate answer.
Should we change?

Then you are given two CPUs. If I want to do x1 + x2 + ... + xn, then obvious
thing to do is going to be to give each CPU half the input data. But now
however I split will give me a (slightly) different answer.

With matrix multiplication this is much worse -- all algorithms chop the
matrix into little pieces in different ways, each giving a slightly different
answer. Even if you try to define some plain simple algorithm, it will usually
give a less accurate answer, due to how the floating point rounded!

With integer matrices on finite fields, it is much easier to get what you
want, 100% reproducability.

~~~
yarvin9
Let me say the same thing perhaps more clearly: there isn't any nondeterminism
here, just failure to capture the inputs and outputs of a deterministic code
transformation.

Suppose CPU count is the configuration input. Then you have a deterministic
macro function M(nCPU, abstract program) => concrete program. And another
deterministic function C(concrete program) => CPU instructions.

Now nCPU is no longer a random piece of information randomly found in your
environment: it's data in your data set. And your results are bitwise
reproducible.

With a reproducible toolchain, I can use a 1-CPU machine to get the same
results as you did on your 32-CPU machine, by using the same data set
(including your nCPU=32). Why? Because I'm reproducing your results and I care
more about precision than performance.

