Hacker News new | past | comments | ask | show | jobs | submit login
IEEE 754 round-off errors (yurichev.com)
78 points by kulikov009 on Aug 27, 2021 | hide | past | favorite | 58 comments

There's been a few articles about rounding floating point numbers correctly in the past few days, but I realized that might not always be what you want.

For instance when adding some small positive number to a floating point value I would typically prefer it round up if correct rounding would mean that it remains unchanged.

b ≠ 0 and a + b = a can be a source of fun bugs.

However a point can be made that if you even need to consider such behavior, you should probably be using something more precise than your current number representation.

> However a point can be made that if you even need to consider such behavior, you should probably be using something more precise than your current number representation.

Or a different algorithm. For instance, when computing a mean, keeping the current mean and count instead of the current total and count is more numerically stable.


Well, you shouldn't do exact equality check on floats anyway, that can be a source of fun bugs regardless of rounding.

I see this advice often and I find that it misses the point. The problem isn't the equality check, the problem is finite precision.

There are situations that call for exact equality checking in floating point.

In which case, you don't use IEEE 754 floating point for math, you use some kind of numerics type or library that correctly models the domain.

That's just trading one class of bugs for another. Adding a million 10^-50s to 1.0 is far still closer to 1.0 than the next floating point number.

Your algorithm would give an answer that's a million values away.

It's a trade off regardless of what you're doing.

One class of bug can make you overwrite values in a map or get your code 'stuck' to a number, the other just gives you a (more) wrong number.

I know which kind of error would be preferred in a video game for instance.

> the other just gives you a (more) wrong number.

This is a common misconception: IEEE754 floating point arithmetic is not wrong or fuzzy or inherently inexact. It is very exact; it just has limited precision.

No. In mathematical terms typical floating point arithmetic has perfect precision - it is repeatable - but limited accuracy - it can deviate from the real value. It's a shortcoming of CS vocabulary that "precision" is also used to refer to the number of bits used.

"Exactness" is a fuzzy term used in common english to refer to the concept of both precision and accuracy, or either really. It's not really used in mathematics.

Oh cool, you want to use different definitions. No matter. Point remains: the floating point representation of 1/4 is not approximate. Similarly, the closest floating point value to 1/10 also precisely defines some value: it's exactly 0.1000000000000000055511151231257827021181583404541015625.

Now whether that's accurate to you depends upon your application. But the numbers don't care.

> Oh cool, you want to use different definitions. No matter.

It's not "different" terminology. It's the correct terminology. Which you should have down if you want to wisecrack at people.

You really just took a dozen words that were referring to the result of a calculation being wrong (as opposed to the real result) out of context to make them about floating point numbers in a vacuum.

Then after construing some new context, you tried to correct the straw man by giving an explanation in which you used pretty much all the terminology incorrectly.

Don't be surprised people come back at you in the same tone of voice if you leave an opening wide as a barn door.

I assume both of us are capable of constructing sentences using only precise mathematical language. This is not the place or time to flaunt that ability and correct others on things that are still perfectly understandable. We're not writing a paper.

The point is simply there’s a method to the madness. It’s not randomly inaccurate. Many operations are perfectly accurate. Those that aren’t are just the closest representable value to the accurate value.

Doing something like what you initially suggested would break that.

Another class of bugs silently transfers fractions of a cent into a bank account...

I did, personally, once have a bug in a script while working in finance, once upon a time. Working on a portfolio of test contracts with trillions of dollars in notional value, and I was off by a single penny. It was an easily found and fixed one-line bug due to truncation instead of correct rounding. But, yes. $1e-2 off on a portfolio worth $1e12 was enough to block release.

If you require a distinct point then you can check if a is smaller than the minimum increment that can be added to b, and add that increment instead, but in the few cases I’ve needed that sort of behaviour it has been easier to do the calculation and then check for equality.

> However a point can be made that if you even need to consider such behavior, you should probably be using something more precise than your current number representation.

I would agree with this. If you're in a place where you actually need to care about how accurately you're representing numbers with IEEE 754, you almost certainly should look at a precise numeric data type instead and should set aside IEEE 754 entirely. It's less common that we think that we need both the range of values that a float or double represent and need those values represented precisely enough to endure aggregation.

It would be nice to be able to give fewer digits to the exponent in a double, though. How often have I really needed 1e308 or 1e-324? How often have I even needed 1e38 or 1e-45? Then again, I don't remember how much precision we'd get back with more digits in the mantissa.

Precision is a really funny thing, too. Remember, pi to ~37 decimal places is precise enough to calculate the circumference of a circle of atoms (~10^-10 m) the size of the observable universe (~10^26 m), as at that point the primary source of imprecision is the fact that you used atoms to construct your circle.

> It would be nice to be able to give fewer digits to the exponent in a double, though. How often have I really needed 1e308 or 1e-324? How often have I even needed 1e38 or 1e-45? Then again, I don't remember how much precision we'd get back with more digits in the mantissa.

I'm not sure that helps. Going from 10^308 to 10^38, approximately, takes 3 bits from the 11-bit exponent and adds 3 bits to the 52-bit mantissa. This gives you _almost_ one extra decimal digit in your precision. It's not worth it, and if it is you certainly need a more special-purpose number type.

What about stochastic rounding as used in many quantized neural network implementations :).

Leads to non-reproducible or non-portable software.

It's better to have well-defined and reproducible behavior, so one can reason how to make correct algorithms.

Neural networks do it for very precise reasons nearly unique to them. If they were built on randomly rounding primitives, and that random was not the precise flavor one model or another needed, then you'd not easily be able to get the distributions needed for the task.

Would not be complete without a link to William Kahan [0]...

[0] http://www.cs.berkeley.edu/~wkahan/Mindless.pdf

I feel lot of unintuitiveness of floats stems from inexact decimal literals. With hex literals the situation would be marginally clearer

When I create my own programming language, the syntax for float literals will be:


This does kinda exist in Scheme and Racket:

    > (/ 1 10)
    > (exact? (/ 1 10))
    > (/ 1.0 10.0)
    > (exact? (/ 1.0 10.0))
    > (/ #e1.0 #e10.0)   ; #e for "exact"
    > (/ #i1 #i10)       ; #i for "inexact"
As mentioned in HN before [1], Pyret (which has the same origin as Racket) went further and made the exact literal (now "Exactnum") default, so the inexact literal (now "Roughnum") always requires a preceding `~`.

[1] https://news.ycombinator.com/item?id=28263486

This does not make sense: floating point numbers are not really "approximate". For example, small integer arithmetic using floats is guaranteed to be exact. Your language would be very confusing for this common use case.

They are approximate in the sense that the float literal 1.1 is not equal to the real number 1.1.

I think it was a joke

I like to consider what it would look like if a programming language forbade floating point literals from containing inexact values. This means that a literal like `0.1` would be disallowed, and you would instead be forced to write out the literal as the exact nearest representable value `13421773e-27`. Profoundly awful for usability, but at least nobody could ever claim to be surprised that floating point values are imprecise. :P

not quite the same, but Scheme has rationals.

Now that I think about it, the Scheme standard (R5RS at least) does not require an implementation to support the full numeric tower. I can't remember which parts are required, but I do wonder now if it permits an implementation that only has rationals and drops floating point. Most implementations, obviously, go the other route and only support floats. But it would be interesting to see.

I like the idea, but on a second thought, it is easily fooled:

    >>> 3.0/10.0 == 1.0/10.0 + 2.0/10.0

No need to even go up to this point. Numbers that are too different are enough 1e300 + 1 will be equal to 1e300. And it would still be even if 1 was added 1e300 times.

And you could not write 1e300 since it is not exact

You need to write 1000000000000000052504760255204420248704468581108159154915854115511802457988908195786371375080447864043704443832883878176942523235360430575644792184786706982848387200926575803737830233794788090059368953234970799945081119038967640880074652742780142494579258788820056842838115669472196386865459400540160

Now, I feel dumb for not noticing the "+ 1" was not needed. Completely forgot 1e300 would not map on a factor of 2, thanks for the correction.

It's funny how closely some implementations can come to IEEE 754 yet not show the same issues. The first example, which takes just 54 iterations to finish on a typical system, runs forever on a VAX, for instance, which is to say that all the world isn't IEEE 754, but accuracy and rounding issues should always be considered.

Definitely also check out interval arithmetic: https://en.wikipedia.org/wiki/Interval_arithmetic

Excellent article about it: http://cs.utep.edu/interval-comp/hayes.pdf

From my limited point of view, interval arithmetic looks particularly wrong for everything.

It is as if we were using Bayes rule to update our knowledge of numbers after arithmetic operations (which is a good idea), but all the distributions were artificially changed to uniform after the update (which is obviously a very bad idea).

Is there any practical application where interval arithmetic makes sense?

The big problem with interval arithmetic, stated in terms of probability distributions, isn't that the distributions are uniform, but that the distributions are independent.

The distribution of x-x is the degenerate distribution 0. But the naive implementation of uncertainty propagation would not take into account the dependence between x and x, and would give an over-approximation.

It's true that probability distributions generalize sets, and sets generalize intervals. But there are many applications where intervals are a great choice for representing uncertainty.

Changing a distribution to uniform is not obviously a bad idea because you can parameterize a uniform distribution and represent the parameters compactly.

There is no compact representation for arbitrary distributions.

It can find the global optimum for a function of a few variables.

Just do everything with integers. RTS games of the 90's did it. They were able to sync 8 PCs across the world to run the same historical simulation (think Age of Kings) by passing nothing more than mouse clicks and keyboard presses over peer to peer. Determinism is king in lockstep programming

> Just do everything with integers. It gets cripplingly inefficient to do many calculations with integers. Simulating weather, or doing nearly any scientific calculation, using only integers would be a massively terrible idea.

Same with neural networks (even though there is some work in that area), or a large part of signal processing, or medical work.... and on and on.

>Determinism is king in lockstep programming

Pretty much all modern games don't do lockstep because it's far too slow, messy, and error prone. It's much better to build a tolerant system that can handle missed frames, lags, and instead do predictive and catchup computing.

Maybe use the right tool for the job, and learn new tools to access new jobs, is a better approach. Learning IEEE 754 nuances is a useful skill, and one can program many things by using this tool well that cannot be done efficiently or cleanly otherwise.

There is a persistent myth that floating point is non-deterministic and therefore cannot be used for "lockstep" multiplayer games.

This may have been true in the 90s because of buggy C compilers, but has long not been the case.

you can use 32 bit or 64 bit floating point and your code will be completely determistic across all modern CPUs and operating systems and even across most programming languages.

The only thing you need to avoid are trig functions (sin, cos) since different platforms have different implementations. so you use your own implementation for these functions(for games you can an approximation which will be even faster than the library implementation). many platforms have an identical sqrt function, but this should be thoroughly tested.

Also if you are running on an intel CPU on 32 bit OS (rare nowadays) you need to call a function at the start of your program to disable 80 bit extended fp.

> There is a persistent myth that floating point is non-deterministic and therefore cannot be used for "lockstep" multiplayer games.

Indeed, on all systems I've used, FP is deterministic in the strict sense that running the same code twice on the same platform (with identical HW and SW stack) will produce the same results.

But the results may change depending on: - programming language - libraries used and their versions - underlying silicon - whether SIMD is used - FPU flags - compiler flags (example: ffast-math affects many things behind the scenes, as does /fp:fast, including things that dramatically affect accuracy like the use of approximate reciprocal instructions) - bugs affecting all of the above

At scale, this matrix becomes difficult to track and control. So it shouldn't be a surprise that reproducibility is an issue, and this myth persists.

> But the results may change depending on:

> - programming language

As I mentioned in my comment, this is part of the myth. most programming languages have the exact same semantics regarding floating point and if you port an algorithm between them you will get identically deterministic results. this is true for at least c/c++, java, c#, python, javascript, and many more.

- libraries used and their versions

As I mentioned in my comment, this is the only true part of the myth. sqrt should be safe, but trig functions will be different. for games, a lot of them use there own "fast" versions anyway, so not a problem.

- underlying silicon

nothing to worry about here. All modern CPUs have identical IEE754 semantics (except intel 32 bit which I addressed in my original comment)

- whether SIMD is used

Again, nothing to worry about here, whether it is hand coded SIMD assembly, or compiler generated, you will always get identical and portable results. the compiler is obligated to ensure this for autogenerated SIMD.

- FPU flags

not relevant on any system from the last 10 years

- compiler flags (example: ffast-math affects many things behind the scenes, as does /fp:fast, including things that dramatically affect accuracy like the use of approximate reciprocal instructions)

ffast-math by definition introduces imprecise semantics so simply don't use it

- bugs affecting all of the above

compilers (and CPUs!) may have been buggy decades ago but in the modern age floating point is extensively tested. I would be very surprised to see a floating point related bug in a compiler or programming language from the last 10 years

I still have these problems

> - programming language

I decided to not use C, because it is unsafe, and write all my code in Pascal.

Now the FreePascal double parsing does not work correctly

For one, it always uses 80-bit float. Even if parsing a double, then it rounds the 80-bit float to double after parsing, so the last bit is usually wrongly rounded. They refuse to fix it, because it is easier to maintain just one 80-bit float parsing function than multiple parsing functions for different types.

>- libraries used and their versions

So I moved to someone's double parsing library.

It just did not work. It simply lacked the algorithms required for full double precision parsing

But they were happy to fix it

>- FPU flags

After fixing it, that double parsing library still did not work ... on 32-bit

Although the library was implementing the correct operations. Turned out FreePascal compiles 64-bit using SSE, and 32-bit using FPU which was set to 80-bit precision

In reverse, formatting the double as string is also hard and ambiguous. FreePascal usually prints two decimal digits more than required (in the new implementation. They still had a broken implementation a handful years ago). I wrote my own implementation that prints the shortest decimal using arbitrary precision arithmetic, but that is really slow. (I tried to use a library, but that did not have enough precision, and after they tried to fix it, it is completely broken.)

Or if you need rationals, use fixed point (unfortunately not native in many languages)

My kid has all 100% in several HS classes yet Canvas shows 99.99%. I tried explaining round off errors but she’ll have none of it.

Oh for the... don't loop until "x == y", loop until "abs(x-y) < MAX_TOLERABLE_ABS_ERROR", that's been the standard advice since forever.

(There's no '==' on the entire page (except in Mathematica code, or for printing the results of an equality test).)

That was my knee jerk reaction as well, but if you look again you'll see the loop is doing a different check than you think.

Took me a second look to realize what's going on. First thought was "only 54 iterations?" as in, (initial error) * 54 >= 1. On the second look, it clicked that no, it's actually more like (initial error) * 2^54 >= 1

There's a simple solution to all this floating point madness: replace floating point arithmetic with fixed point arithmetic, which is associative, more energy efficient in hardware, and much easier to reason about.

I think you're missing the point. floating point isn't madness, but people thinking it represents the real numbers without stopping to consider how it works is madness.

For example, just consider the number of bugs that come down to integer over/under flow. The reality is that we like to work with a simplistic model of what is going on, and ignoring the hard to reason about edge cases really does simplify problems, and if on revisiting the code we can either handle or rule out these cases as ever happening, then we're golden.

As for fixed point arithmetic, this is how DSPs worked for years, and there are still lots of edge cases to worry about. A typical application would see many multiply accumulates for something like an FIR filter, so you still have to consider how large the value can end up, and split your integer into a suitable scaling factor to avoid overflow whilst trading off precision. The whole point of floating point is that this tradeoff is always carried out on your behalf.

But don't get me started on denormals...

Thanks for your reply, this is an interesting perspective - based on your comment, it seems likely that you have more real-world experience with these issues than I do.

Regarding floating point performing this tradeoff for you: yes, I understand that floating point arithmetic trades simplicity for greater dynamic range, giving lots of precision to small values and less precision to large values. You can tackle some of these issues with fixed point arithmetic at the cost of increased memory usage. For example, instead of using a 32 bit float, you could use a 64 bit fixed point value, where the most significant 32 bits represent the integral part and the least significant 32 bits represent the fractional part. Obviously this type offers both much greater range and greater precision (at every scale) than the 32 bit float, at the cost of 2x memory usage. I suspect that the 64 bit fixed point type might still be faster to work with in hardware, due to the relative simplicity of implementing fixed point arithmetic in hardware. What do you think?

Why would you compare a 64 bit fix pint with a 32 bit float? If you can afford twice the space you would use a 64 bit float. Also I'm not an hardware designer but I think that a 32 bit float multiplier is going to be less complex than a 64 bit fix point multiplier.

> floating point isn't madness, but people thinking it represents the real numbers without stopping to consider how it works is madness

This is certainly true in some places, but I would wager that a large number of JS devs are completely ignorant of floating point, and many of them likely lack the background knowledge to understand it.

No need to pick on JS. Most devs in any mainstream language are ignorant of how IEEE754 floating point works. In the 2000s, it was Java developers. In the 90s, Perl programmers. Today it's JS and Python. In another 10 years, it could be Rust. https://0.30000000000000004.com/

Not picking on JS. It's just the best representative of the problem right now.

Fixed point arithmetic numbers have a very limited range which leads to them not playing well with multiplication and division (doing scientific computation with them is a game of searching which division|multiplication got you that 0|+inf and then searching for an alternative formula if one such formula even exists, they are much more fragile than IEEE-754). They are not used that often so people are unfamiliar with their shortcomings but, I would not make them the default.

Here is little bit of information on them: https://pages.tacc.utexas.edu/~eijkhout/istc/html/arithmetic...

If you're doing division and multiplications with fixed point arithmetic and the result must fit in a same size word as the operands, you need to decide for each operation how many fractional bits from each operand you're going to retain in the result.

In other words: you need to understand the range of numbers that you're dealing with every step of the way.

That's not really an improvement...

(I wrote about this here: https://tomverbeure.github.io/rtl/2018/11/26/Racing-the-Beam...)

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact