Hacker Newsnew | comments | show | ask | jobs | submit login

Due to round off error correct programs are not a subset of replicable programs. Scientists do trade replicability for speed.



You are perhaps not understanding what I mean by replicability. By replicability, I mean that if you run some statistics on a large dataset several times, you will get the same answer each time. Rounding errors in floating point arithmetic don't harm replicability, because all modern computers have that issue. There is some loss in accuracy, but floating point arithmetic is still deterministic.

-----


> There is some loss in accuracy, but floating point arithmetic is still deterministic.

Not in a multi-core setting. When you do a floating point reduce operation the order in which you reduce your floating point numbers matters, but this order is not necessarily deterministic.

-----


Translation for readers:

Let's say you have some floating point numbers, like...

  float A = 42.90F;
  float B = 00.01F;
  float C = /* ... */;
  float D = /* ... */;
(A + B + C) will not necessarily produce the same result as (A + C + B).

In fact, if you do (A + B + B + B ...) very much, then you will wind up exacerbating the problem --- it can be a source of bugs if you repeatedly increment a counter by small floating point values. For example:

  const int Iterations = 10;
  for ( int i = 0; i < Iterations; ++i )
    A += B;
  printf( "%2.4f\n", A );
The output is 43.0000, as you'd expect. But what if we increase Iterations to 100? We get 43.8998.

Astute readers probably would note that 00.01F can't be exactly represented in base 2, and wonder whether something like 00.0125F would still suffer from this problem. Alas, yes:

  A = 42.9000F 
  B = 00.0125F

  8    iterations ==  43.0000 
  80   iterations ==  43.9001
  800  iterations ==  52.9006
  8000 iterations == 142.8820  /* we would expect 142.9 */
So if you want to parallelize some computations, and you want to do 4000 iterations on core #1 and 4000 iterations on core #2 and then sum the results, well... let's see what happens:

  float A1 = 42.9000F;
  float B1 = 00.0125F;

  float A2 = 42.9000F;
  float B2 = 00.0125F;

  const int Iterations = 4000;
  for ( int i = 0; i < Iterations; ++i )
  {
    A1 += B1;
    A2 += B2;
  }
  float A = A1 + A2 - 42.9000F;
  printf( "%2.4f\n", A );
The output is 142.8885 -- which is different from our earlier result of 142.8820!

You may be tempted to try various things to get around this fact... for example, what about counting from zero rather than from 42.9000? (Answer: the error can be even worse, which is a topic worthy of its own separate post) ... but you will quickly discover that this is a fundamental issue. The order of operations matters.

-----


Which is why every good numeric library should have a "correct" reduce functions implementations that guarantee minimal errors. in case of a sum of an floating point array, that would be a tree-based sum, possibly coupled by sorting the array beforehand.

Also, if you do some kind of map-reduce style work distribution where different nodes perform same operations on different data, it pays off to use a deterministic (pseudo-)random algorithm for work distribution, so that your results can be reproduced.

-----


> When you do a floating point reduce operation the order in which you reduce your floating point numbers matters, but this order is not necessarily deterministic.

There is no such thing as a "floating point reduce operation". You can do a reduction using a floating point calculation as your joining operation, but even in that case, each individual operation is still deterministic. Moreover, a classical reduce is also deterministic, being a fold in one direction or other based on the associativity of the joining operation. The non-determinism comes only from your choice to specify the behaviour of your "parallel reduce" function in a non-deterministic way, and if you do that without understanding the consequences that's hardly floating point's fault.

-----


In practice, the non-determinism is important for efficiency. Consider computing a norm or dot product in parallel. You sum the owned part of the distributed vector, then use MPI_Allreduce() to get the result of the global sum (the parallel implementation uses some tree structure, usually semi-redundantly, perhaps using dedicated hardware). If you run with a different number of processes, then the reduction will necessarily be done with different associativity. Of course any such result is as good as the others (unless the vector is somehow sorted in which case some orders will commit larger rounding error).

The MPI standard does not require that reductions use deterministic associativity for the same number of processes (independent of the physical network topology), but they do suggest that it is desirable and (AFAIK) most implementations are deterministic on the same number of processes.

-----


Map reduce type calculations are less likely to be on kniwn or fixed numbers of pricesses, as they are designed for failing hardware and more flexible setups, unlike most MPI jobs. But most if the calculations are probabky less sensitive to errors too.

-----


Exactly. So long as we're talking about accuracy in terms of precision, and not "5% of the time the result will be nonsense", people already make this sacrifice every day.

There are also plenty of existing models where accuracy is ultimately assured, but not always maintained internally. Branch prediction, for example. The computer guesses which way the branch will go! This can give a big speed boost depending on the workload and the predictor, and the computer assures accuracy by fact-checking the guess down the line.

-----




Guidelines | FAQ | Support | API | Lists | Bookmarklet | DMCA | Y Combinator | Apply | Contact

Search: