Hacker News new | past | comments | ask | show | jobs | submit login
9999999999999999.0 – 9999999999999998.0 (sdf1.org)
313 points by lelf on Jan 5, 2019 | hide | past | favorite | 260 comments

I don't understand all the crap that IEEE 754 gets. I appreciate that it may be surprising that 0.1 + 0.2 != 0.3 at first, or that many people are not educated about floating point, but I don't understand the people who "understand" floating point and continue to criticize it for the 0.1 + 0.2 "problem."

The fact is that IEEE 754 is an exceptionally good way to approximate the reals in computers with a minimum number of problems or surprises. People who don't appreciate this should try to do math in fixed point to gain some insight into how little you have to think about doing math in floating point.

This isn't to say there aren't issues with IEEE 754 - of course there are. Catastrophic cancellation and friends are not fun, and there are some criticisms to be made with how FP exceptions are usually exposed, but these are pretty small problems considering the problem is to fit the reals into 64/32/16 bits and have fast math.

> considering the problem is to fit the reals into 64/32/16 bits and have fast math

Floating-point numbers (and IEEE-754 in particular) are a good solution to this problem, but is it the right problem?

I think the "minimum of surprises" part isn't true. Many programmers develop incorrect mental models when starting to program, and get no feedback to correct them until much later (when they get surprised).

It is true that for the problem you mentioned, IEEE 754 is a good tradeoff (though Gustafson has some interesting ideas with “unums”: https://web.stanford.edu/class/ee380/Abstracts/170201-slides... / http://johngustafson.net/unums.html / https://en.wikipedia.org/w/index.php?title=Unum_(number_form... ). But many programmers do not realize how they are approximating, and the "fixed number of bits" may not be a strict requirement in many cases. (For example, languages that have arbitrary precision integers by default don't seem to suffer for it overall, relative to those that have 32-bit or 64-bit integers.)

Even without moving away from the IEEE-754 standard, there are ways languages could be designed to minimize surprises. A couple of crazy ideas: Imagine if typing the literal 0.1 into a program gave an error or warning saying it cannot be represented exactly and has been approximated to 0.100000000000000005551, and one had to type "~0.1" or "nearest(0.1)" or add something at the top of the program to suppress such errors/warnings. At a very slight cost, one gives more feedback to the user to either fix their mental model or switch to a more appropriate type for their application. Similarly if the default print/to-string on a float showed ranges (e.g. printing the single-precision float corresponding to 0.1, namely 0.100000001490116119385, would show "between 0.09999999776482582 and 0.10000000521540642" or whatever) and one had to do an extra step or add something to the top of the program to get the shortest approximation ("0.1").

When I went to university in 1982, one of the lower level courses was called "Numerical Methods". It went over all of the issues related to precision, stability, as well as a host of common numerical integration and approximation methods.

I'm just a sample size of one, but isn't this kind of class a requirement for CS majors?

I majored in math and physics. In math, our version of the course was called "numerical analysis," and we covered those things. In physics, the behavior of floating point math was covered in one of our lab courses.

I don't know what's taught to CS majors, and as others have pointed out, programmers don't necessarily study CS.

I believe the issue is just that the pitfalls of floating point are not apparent without a certain level of math education.

But there may be one more pitfall, which is that those of us using FP regularly, also happen to be "scientific" or "exploratory" programmers who haven't learned a lot of formal software engineering discipline (including me). So we understand the math but might be more prone to making mistakes with it.

I do kind of like the idea of flagging any number that is potentially exposed to a FP issue. We all make mistakes. Displaying all floats in exponential notation by default would be a good enough warning to the wary. We only display them as decimals for readability.

It wasn't required at my school a decade or two later. Floating point representations were touched on somewhat in the intro to computer architecture class where we wrote assembly for a MIPS simulator. I got an extra dose because I switched my major to Math, where there was an entire course on Numerical Methods (though it wasn't required by either the Math or CS depts).

I'm a little amazed in retrospect that the Math department was where one had to go to get a class in the mechanical details of computing when as a subject it's usually considered (and often in practice is) notably up the ladder of abstraction. And this was a weird outlier as the single most practical upper division class offered by the Math department at the time...

In my case, the CS class only covered the representation of floating point numbers (a sign bit, exponent bits, fraction bits, issues like bias etc) but not things like numerical approximation or integration methods. Those were in a separate class under the math department. And I think that's fair; after all those are really about scientific computing, not so much about computer science.

> isn't this kind of class a requirement for CS majors?

I think most CS departments dropped numerical analysis from their requirements by the end of 1980s. Nowadays you are more likely to find such a course in some dusty corner of math or engineering departments.

My university moved it from a 100 series course to a 200 series course but it's still being taught to ECE undergrads.

The problem is more that we don't have the tools to track and understand how the errors are altered as we do the math (e.g. how would you even begin to try representing catastrophic cancellation at compile-time) and doing the numerical error analysis on the abstract math itself is hard once the math gets complex let alone trying to figure it out after you've optimized the code for performance & tweaked the algorithms for real-world data/discrete space.

Now perhaps it could be possible to do it at runtime in some way but I suspect the performance of that is prohibitive to the point where arbitrary precision math or decimal numbers is going to be a better solution.

CS isn’t really required for majority of developers.

From a 1998 interview with William Kahan, the “father” of IEEE-754 floating point (emphasis mine):

> My reasoning was based on the requirements of a mass market: A lot of code involving a little floating-point will be written by many people who have never attended my (nor anyone else's) numerical analysis classes. We had to enhance the likelihood that their programs would get correct results. At the same time we had to ensure that people who really are expert in floating-point could write portable software and prove that it worked, since so many of us would have to rely upon it. There were a lot of almost conflicting requirements on the way to a balanced design.

I imagine that the number of people writing code without having taken a numerical methods class has only increased since the late 1970s being talked about, or even in the two decades since that interview.

Not in many of the CS related majors. Though I guess it is for hard CS.

My CS course at a community college went over the limitations of floats in detail, in the CS class.

How detailed, if I may ask?

I remember we learned how the representation worked bit-for-bit, and how it being stored in binary meant it couldn't perfectly represent everything in decimal. 1.01b meaning 12^0 + 02^-1 + 1*2^-2 for example.

There is a proposal about "unum" or "posit" number system. They give more precision for small numbers (small meaning smaller than about 10^70, for 64bit numbers), less precision for huge numbers, and an overall larger range, than the floating point system.



(They are definitely not any easier to understand than the floating point system, though.)

there is also DEC64


> Even without moving away from the IEEE-754 standard, there are ways languages could be designed to minimize surprises.

Floating point in base 10 is already in the standard since 2008:


The standard is not to blame, the lack of demand for that feature is.

Most of the potential users don’t know that they can demand from their software and hardware suppliers that feature. Using it there would be less “surprises.”

Showing that the result of 9999999999999999.0 – 9999999999999998.0 is a number between 1.9999999999999998 and 2.0000000000000124 will not solve the problem. IEEE floating point doesn't keep track of loss of precision.

To be clear, my point was that if programmers always saw floating-point numbers printed out as a range, from their beginning programming days, more of them would be likely to understand floating-point numbers better — or at least avoid the (impossible) idea that they map 1:1 with the real numbers. Having understood floating-point numbers, they would know what to expect from 9999999999999999.0 – 9999999999999998.0 with 64-bit floating-point. So though seeing a range here won't help magically restore precision that's been lost, having seen ranges earlier would help, before trying to carrying out this calculation.

Nevertheless, you have a good point and what I take away from this is that showing a range for the end result of a computation (instead for a given number directly entered by the user/programmer) can be misleading if the result of the exact computation wouldn't actually have been in that range.

When NASA can't even get it right, because of "surprises", there's no chance in hell I'm blaming us mere mortal programmers... or even 10x wizards. (0)

It's time to look at other ways to depict fractional parts of numbers in a computer. I know that one can express any rational number as a integer fraction. And our computers are incapable of expressing a irrational number exactly - it does so to a certain precision... In other words, every number a computer expresses is a rational number.

The exception is if the computer could express irrational numbers as symbolics, then we could work with the symbolic instead. And then as a last pass, the symbolic could convert to a imprecise rational depiction, or express as its native type.

(0) https://itsfoss.com/a-floating-point-error-that-caused-a-dam...

That failure had nothing to do with floating point specifically. The same failure could have occurred when converting any data type with larger range (including rationals or arbitrary precision numbers) to 16 bit integer.

I would say the fact that floats can have data larger than INT16_MAX is hardly a surprise. That was just a bug, not some great and surprising drawback of floating point.

Rationals get unwieldy quickly, even with the simplest of arithmetic. A couple of additions is enough to get a large denominator.

That has not been my experience using a language with builtin support for rationals. The rational is simplified after each operation so it never grows unwieldy large. It is slower than floats, but imo vastly superior for most use cases.

> The rational is simplified after each operation so it never grows unwieldy large.

We recently had an exponential memory growth bug because rationals can not always be simplified, for example if you start with (2/3) and repeatedly square it. Fortunately, this was not in user-facing code, so there was no chance of a denial-of-service attack, but that's definitely something to watch out for with rationals.

"Most" use cases? You mean the use cases that don't involve heavy math, like graphics, physics, and statistics?

Even simplifying after every operation, in the typical case exact rationals grow exponentially in the number of terms in the computation. This means that either: (a) you cannot use them for any non-trivial computation. (b) you have to round them, in which case they are strictly worse than floating-point numbers because they have redundant representations and a very non-uniform distribution.

Ariane 5 is ESA not NASA

You can work directly in reals.


That package approximates reals. It fails to do equality on infinite precision reals, for example.

Yeah, the limitations of FP are well-known to anyone who does much numerical work.

Floating point numbers are the optimal minimum message length method of representing reals with an improper Jeffery's prior distribution. A Jeffery's prior is a prior that is invariant under reparameterization, which is a mandatory property for approximating the reals.

In this case, it is where Prob(log(|x|)) is proportional to a constant.

Thus, we aren't going to ever do better than floats if we are programming on physical computers that exist in this universe. There is a reason why all numerical code uses them. Best to learn their limitations if you are going to use them, otherwise use arbitrary precision.

Outside of the academic world decimals are almost always a better solution if performance isn't critical.

Most logic is multiplicative. For example, apply a 30% tax on a dollar quantity and display both subtotal and grand total. With floats, there are inequalities. With decimal there usually aren't unless you're dividing, but we already have to deal with divide errors in base ten, and it is much more likely to need to represent 0.30 than 1/3 and because decimal shares a base with binary (since it's factors are 5 and 2) binary doesn't really get us anything but headaches anyway. It's true that there are still gotchas, but they happen less often and usually don't end up looking stupid and weird for no reason. That 0.1 + 0.2 = 0.300000000000001 is dumb and we all know it.

> Outside of the academic world decimals are almost always a better solution

Is “academic world” now a shorthand for “all numerical computing”?

Decimals basically never make sense, except possibly in some calculations related to money. Those make up a minuscule part of modern computer use.

Maybe decimals are also better for homework assignments for schoolchildren?

The type of applications where decimals are useful are by and large insensitive to compute speed and need no special hardware support. You can easily write your code for decimal arithmetic on top of integer arithmetic hardware.

Those of us who need binary floating point for graphics, audio, games, engineering, science, .... won’t stop you.

Even with money I use integers. Instead of dollars (or local currency), I store values internally as pennies (or local equivalent 1/100 of main currency). Sometimes when working with interest I'll need to work with floats, and some databases I have values stored as DECIMAL(8,2) instead of INT, but for the most part I've saved quite a few headaches by keeping my values in INTs.

There are currencies that have the lowest value coin as 1/20 of main coin.

That seems okay until you need to track sub-penny accuracy somewhere, then you have a big problem.

Lots of applications don't need that, but a surprising amount do, so it's not a global solution.

But there's still a minimum significant value which can be defined from the problem space. Do DECIMAL(16,8) or whatever.

In some situations, there may even be industry or legal standards as to what can be considered rounding noise.

No need for snark. You might be correct that, by “computational volume”, handling currency values might be considered a niche; but even something like World of Warcraft has to handle money at some point.

That doesn't seem dumb at all. Making BCD the default would mean floats use 17% more space for the same precision. That might seem like a small loss, but it's also for an incredibly small gain. Programmers would still have to be aware that testing two decimals for exact equality is dangerous. I don't see the problem with 0.1 + 0.2 = 0.300000000000001 if you aren't testing floats for direct equality.

> it is much more likely to need to represent 0.30 than 1/3

Citation needed, because this isn't really true.

Even if one concedes your (unspoken) idea that only financial transactions aren't "academic" (which also isn't true), in the real world financial transactions will typically include currency conversions, and those will have all sorts of weird non-decimal factors.

You cannot do anything in finance with such reasoning. Take something simple, say a mortgage at 5% compounded 12 times a year. To compute payments using some fixed length representation or decimal is going to lead to more error than to use the usual floating point. This rabbit hole would continue for many applications.

Floating point makes them all much easier to do well.

Have you worked on finance software? I have - we always used ints for everything, so we could avoid rounding suprises

I did a web project in the gambling space ~10 years back - we were legally required to perform all calculations as integers in ten thousandths of a cent (or millionths of a dollar).

We chose to _not_ do _any_ calculations client side in Javascript...

Which regulation is that? I've worked on financial applications, but not gambling, and I've not heard of this regulation. I should probably know about it!

Australian, or possibly Tasmanian state regs. This would have been around 2011 or so (The Samsung Galaxy S2 was the "top of the range Android phone" at the time...)

I worked for a gambling company in the UK for a bit. They did all their maths in pennies, not thousandths of a penny.

Yes, I have. I also have a math PhD, have written scads of scientific and numerical software, and have written articles on floating-point math. So now that we have enough of our personal accolades out of the way, let's focus on facts regarding calculations:

How did you use ints to compute compounded interest on loans? I asked that above, and you avoided it. I ask again.

For example, suppose you have a mortgage where you lent $100K at 5% annual, compounded 12 times a year, for 30 years, and you need basic values regarding this loan.

Often in such calculations you need to compute 100K*(1+0.05/12)^360. How do you do that with integers? Naively you need (1+(1/20)/12)^360, which as a reduced fraction each of the numerator and denominator have over 2800 binary digits. Do you really do this with integers?

Now put that in a mortgage trading or pricing system where it needs to do millions/billions of those per second.

Doing this as double gives enough precision to make the difference between computed and infinitely precise negligible (approx. 10^-17 error).

It's easy to make examples where doing incremental calculations, rounding to pennies and storing, results in long term error. In these cases I don't see how do to it with integers without massive overhead.

And this is a trivial, common example. Doing stuff like hedge fund stuff, or anything using numerical integration to make models for pricing, would be astounding hardly to do with integer only math.

What finance software did you write? A simple ledger works fine as integers. Anything more complex will hit performance and scaling issues soon after the basics.

I’ve not done any real finance programming, so is this a reasonable explanation?

Currency is stored as a count of cents (millicents if being fancy). Therefore the two main features of floats are not useful:

- Support for very small numbers is not needed. Floats dedicate approx half their range to numbers between -1 and +1, this is wasted when counting whole cents.

- Support for very large numbers at the expense of precision is actively bad, as the precision must always be down to individual cents.

So the useful range of floats is much reduced when using floats for counting, approx 54 bits out of a 64 bit float are used. Instead ints (“counting numbers”) are much better for counting cents than floats (which approximate the continuous real numbers in a finite number of bits).

IBM Decimals are the standard for finance applications. Integers, not floating point.

Floats don't have the precision required.

There's no reason for every step of a computation to be confined to the same very small message length. And the necessary error analysis should be built into the language, preferably in the same "advanced users only, here be dragons" package as the imprecise types themselves.

So interestingly, processor makers are on the same page with you re: computations, and lots of processors can internally do computations in "extended precision", e.g. 80-bit floats, only converting to/from 64-bit doubles at the start and end of the computation.


That hasn’t been the case for over a decade.

It may be coming back though.

IBM's new and rising supercomputer architecture, POWER9, supports hardware IEEE binary128 floats (quad precision). Their press claims the current fastest supercomputer in the world uses POWER9.

The ppc64 architecture (still produced by IBM) supports "double-double" precision for the long-double type, which is a bit hacky and software-defined, but has 106 bit mantissa.

And ARM's aarch64 architecture supports IEEE binary128 long-doubles as well, though it is implemented in software now (by compiler). Maybe they plan a hardware implementation in the future?

How do you mean? The x86-64 instruction set / abi specifies long doubles as 80-bits, and still supports them ...

Essentially there are two different sets of floating point instructions on x86 and x86-64: - the x87 instructions, which descend from the original 8087 coprocessor (and have 80-bit registers), and - the SSE instructions, which descend from the Pentium MMX feature set, are faster, support SIMD operations, and can be fully pipelined.

The x87 instructions are basically for legacy compatibility, or if you manually use long doubles on some platforms.

The idea behind extended precision registers was good in theory, but ultimately caused too much hassle in practice.

Yep - and there are absolutely some cases where you do want to manually use it which is why the x86_64 ABI on SysV (used by Linux and OS X, still specifies the long double type as 80-bits, and why GCC and Clang will still emit these instructions when long doubles are used!

(Sorry, this is more for the folks who aren't familiar with this, since it seems like you are familiar, but I didn't want it to seem like this isn't widely supported when they read "legacy" or "some platforms")

Here is a good toy examples that runs into the same numbers shown in the parent, showing the two different instruction types, and that long double can give you the correct answer, while still being run in hardware, vs. going all the way to float128s which are currently emulated in software!

Code w/ Assembly: https://godbolt.org/z/W3ZmqJ Output: https://onlinegdb.com/Sy_I3Q1ME

I agree that extended precision can be very useful, though I think the failing was on the software side: basically languages and compilers didn't provide useful constructs to control things like register spilling (which caused the truncation of the extended precision).

The current hardware trends seem to be providing instructions for compensated arithmetic, like FMA and "2sum" operations. I think this is ultimately a better solution, and will make it possible to give finer control of accuracy (though there will still be challenges on the software/language side of how to make use of them).

All the floating-point arithmetic that is natively supported these days is the 32- and 64-bit kind in SSE instruction sets and its extensions. The fact that something is "available" doesn't mean much in terms of actual support. As far as I know, long double means 128-bit floats in modern clang/gcc, and they are done in software.

Long doubles are typically 80-bit x87 "extended precision" doubles as far as I've seen. (Except on windows :-P ). It's part of the reason why LLVM has the 80 bit float type.



They are definitely still supported in modern Intel processors. That said, there can be some confusion because they end up being padded to 16 bytes for alignment reasons, so take 128 bits of memory, but they are still only 10 byte types.

They are a distinct type from the "quad" precision float128 type, which is software emulated as you mentioned.

All that being said, you are right that most of the time float math ends up in SSE style instructions, but as soon as you add long doubles to the mix, the compiler will emit x87 style float instructions to gain the extra precision.

Example: https://godbolt.org/z/PMZVdb

And nobody uses this terrible mis-feature in practice, everything runs via 64 bit xmm registers.

Rightly so, because programmers want their optimizing compiler to decide when to put a variable on the stack and when to elide a store/load cycle by keeping it in a register. With 80 bit precision, this makes a semantic difference and you end up in volatile hell.

Yeah I agree that everything typically runs in XMM registers and that's what people want. I'm not sure what about the availability of extended precision makes it s a misfeature? For some cases it IS what you want, and it's nice to be able to opt in to using it..

EDIT: If I had some application where I needed the extended range, like maybe I was going to run into the exact numbers above, I'd appreciate the ability to opt-in to this. Totally agree I wouldn't want the compiler to surprise me with it, but also not terrible, or useless.

Code w/ Assembly: https://godbolt.org/z/W3ZmqJ Output: https://onlinegdb.com/Sy_I3Q1ME

To be fair, the problem you describe isn't inherent to 80-bit floating point values. If you use 80-bit values in your ABI or language definition, it won't occur - it occurs when you try to user a wider type to implement a narrower type, e.g., implementing 64-bit floats (as specified in the ABI or language) with 80-bit operations.

In that case, the extra precision is present and "carried across" operation when registers or the dedicated floating point stack is used, but is discarded when values are stored to a narrower 64-bit location. So the problem is one really of mismatch between the language/ABI size and the supported hardware size. Of course, 80 bits isn't a popular floating point size any more in modern languages, so this happens a lot.

The x87 ISA does, yes, and they are supported for binary compatibility reasons. However the actual x87 registers are shadowed by the vector registers so you can only use one. Any modern vectorizing compiler uses the vector instructions for FPU arithmetic, even when scalar, with a max precision of 64-bit.

>The x87 ISA does, yes, and they are supported for binary compatibility reasons.

Well x86-64 is not binary compatible with x86 so that's not the reason. It is mostly for software relying on either the rounding quirks or the extended 80 bit precision I guess.

> However the actual x87 registers are shadowed by the vector registers so you can only use one

You are confusing with the legacy MMX registers which are deader than the x87 for stack. XMM registers do not shadow the for stack.

The x86-64 instruction set does not specify any particular ABI and definitely does not specify the precision of a C language type.

Fair point, I was a little fast and loose with my words there, which is definitely dangerous when it comes to things like C language / ABI standards! :-P

A small message length means small memory, which is important in physical computers. It is information-theoretic optimal. This is a well-defined term.

The limitations should be well known. One of the first things I check when joining a financial software project is how the system represents money. I’m rarely surprised.

(It’s inevitably floats or doubles)

> Floating point numbers are the optimal minimum message length method of representing reals with an improper Jeffery's prior distribution.

Do you have a link to a proof or discussion of this? I haven't heard this before and I would love to have this statement unpacked a little more.

> Thus, we aren't going to ever do better than floats if we are programming on physical computers that exist in this universe.

Maybe, but that doesn't mean the particular implementation of floats being used is the best one. See also: Unums and Posits


To my understanding, experiments with Unums showed shortcoming that Gustafson didn't anticipate and lead to Posits which drop the fixed length constraint. Doing that makes improving precision a lot easier but at the cost of computation time. Overall I am not convinced that the current implementation is optimal but it is a very good trade-off between speed and precision.

> Doing that makes improving precision a lot easier but at the cost of computation time.

Not quite. The difference in computation time is the current the lack of hardware support, not something inherent to the underlying encoding method. So in practice you are right, but in, for example, embedded contexts without floating point hardware, the performance advantages of IEEE floats should disappear (especially if using a 16 or 8 bit posit suffices).

Posits are simpler to implement than IEEE floats (less edge cases) and use more bits for actual numbers whereas IEEE floats waste about half on NaNs. The use of tapered precision is also nice.

Even if hardware support existed, it seems like a variable length encoding has some inherent overhead relative to a fixed length encoding. If you have a "base length" of e.g. 32 bits and occasionally expand to 64, there's an inherent cost there in both computation and memory, presumably for greater precision. Perhaps that overhead could be minimal with hardware support, but it seems it must have some.

Those are type one unums, not posits. What you are saying about variable length encoding may be true, but it does not actually apply to the current comparison. Type 2 unums are also fixed length, but have other issues.

'nestorD was discussing the effects and overheads of "dropping the fixed length constraint" in the comment you replied to.

Oh darn, you're right. My bad!

In my defense, the comment he replied to got downvoted and I thought it was nestorD, so I was "primed" to misinterpret his comment as criticizing unums in general.

People get upset that floating point can’t represent all infinite number of real numbers exactly - I can’t understand how they think that’s going to be possible in a finite 64 bits.

To hit the point home a little harder: you can easily iterate through the entire representable set of float32 on a modern machine within seconds. I've encountered many engineers who don't quite get that.

Right - if you have a monadic function that takes a 32-bit float, your tests should probably literally cover every single input value.

Wait, where did OP's 64-bit slot go?

> I can’t understand how they think that’s going to be possible in a finite 64 bits.

You apparently stole 32 of them to make your bat.

If you put them back your tests balloon to half a century each.

An alternative calculation: https://news.ycombinator.com/item?id=18109432

"You can rent a Skylake chip on Google Cloud that'll perform 1.6 trillion 64 bit operations per second for $0.96/hr preemptively. That's enough to run one instruction over a 64 bit address space exhaustively over 120 days, or for ~$2800"

It might not make economic sence to actually make this happen for any realistic test, but it's interesting that it might actually be feasible to do it on any kind of human timescale...

At some point, your test switches from testing the code, to testing the machine the code runs on. That likely happens before 120 days.

Given issues like the intel fdiv bug that may make sense to test if you really want to avoid running into hardware specific bugs.

Maybe, but I'd rather a test suite that's designed to test hardware, rather than overloading some code's unit tests.

I think most unit tests are best served by testing key values- e.g. values before and after any intended behavior change, values that represent min/max possible values, values indicative of typical use.

The unit test can serve as documentation of what the code is intended to do, and meaninglessly invoking every unit test over the range of floats obscures that.

There are certainly cases where all values should be tested, but I don't think that's all cases.

Wow that's aggressively snarky.

I presume they were saying 'and for 32-bit floats you also get this property that you can...'

Or on any computer at all, even an “infinite” (at least unbounded) computer like a Turing machine, considering that almost all real numbers are not computable.

Well, you don't need to represent all the real numbers. You can get quite far with just rationals or algebraic numbers, although you'll have trouble with exponentials and trignometry. And computable numbers are basically superior to any other number system for computation.

You of course need an unbounded but finite amount of space to store these numbers, which is perfectly fine.

> And computable numbers are basically superior to any other number system for computation.

I don't think that's really quite true. The point of FP is that you don't get any wierd statefulness in your compute complexity as values accumulate, every operation basically has O(1) compute time where N is the number of previous operations you've done. For rationals and algebraics that isn't the case.

You'll have massive performance issues the second you end up with a relatively prime numerator or denominator that ends up in an iterative algorithm.

To me the only downside of IEEE 754 is that most languages including C and C++ do not provide a sensible canonical comparison methods. This leads to surprised beginners and then a ton of home made solutions which are often not appropriate.

Would you really want a default comparison where a == b does not imply a - b == 0?

I think it depends, in languages which have implicit type coercion I think that would hurt. In languages like swift, where you need to explicitly cast even an Int to Double it would be less of a footgun. I'd rather floats have some overloaded operator maybe ~=, for approximate comparison.


Very far from a floating point expert here, but what I do is to scale-down by a few odd prime-power factors as appropriate:

Scaling down by powers of 5 is obviously appropriate for decimals, currency etc.

Scaling down by powers of 3 is good for angles measured in the degrees, minutes, seconds system.

If one scales down a lot there is an increased risk of overflow, so one can compensate by scaling up some powers of 2.

The way I think of this is as using my own manual exponent bias [0].

>the exponent is stored in the range 1 .. 254 (0 and 255 have special meanings), and is interpreted by subtracting the bias for an 8-bit exponent (127) to get an exponent value in the range −126 .. +127.

So, for example, even single-precision number are always exact multiples of 1/(2^126), and I'm just changing the denominator to contain powers of 3, 5, 7, ... etc.

[0] https://en.wikipedia.org/wiki/Exponent_bias

Integers are a lot less trouble for many currency problems, but I think some people are afraid of multiplying integer fractions.

In financial calculations I've seen, figures are given in standard magnitudes (per cent, per mille, basis points, integer cents, etc.) which, if you're lucky with your language, can be encoded as types which can be promoted to higher precision (somewhat) transparently.

Presumably we could actually make decimal floating point computation the default and greatly reduce the amount of surprise. I don't think the performance difference would be an issue for most software.

Decimal floating point won't avoid this issue, for a sufficiently large value the ulp would be 10.

It would solve more common issues like this though:

> I appreciate that it may be surprising that 0.1 + 0.2 != 0.3 at first, or that many people are not educated about floating point, but I don't understand the people who "understand" floating point and continue to criticize it for the 0.1 + 0.2 "problem."

That's not a calculation that should require a high level of precision.

The correct solution is to understand how floating point number systems work and use near comparisons for floats.

Decimal fp is still 'wrong' for, say, 1/3 + 1/3 = 2/3.

A lot of real-world data is already in base-10 for obvious reasons, and so an arrangement that lets you add, subtract and multiply those without worrying is worthwhile, even if it can't handle something more exotic.

Would you really call 'any rational with divisible factors other than 2 and 5' to be 'exotic'?

Maybe we really should move back to base-60 like the Babylonians used, then you could at least divide by 3.

Because humans standardized on base-10, and computers are ultimately for humans to use?

Maybe we should also add data types to every language that can convert exactly between inches, feet, miles and every other non-base-10 unit?

The argument "we want to look at base-10 in the end so it should be the internal representation" is really weak and ignores basically every other practical aspect.

The way to avoid this issue is to avoid floating-point numbers that have any implicit zeroes (due to exponent) after its significant digits. Basically restrict the range to only values where it's guaranteed that for any x1 and x2 from the range, (x1-x2) produces a non-zero dx such that x2+dx == x1.

The only example off the top of my head that is floating point is C# "decimal", which actually originates from the Decimal data type in OLE Automation object model (which could be seen in VB6, and can still be seen in VBA):


Note this bit:

"scale: MUST be the power of 10 by which to divide the 96-bit integer represented by Hi32 * 2^64 + Lo64. The value MUST be in the range of 0 to 28, inclusive."

The reason why it's limited to 28 is because the 96-bit mantissa can represent up to 28 decimal digits exactly. The way it's enforced, any operation that produces a result outside of this range is an overflow error (exception in .NET).

I believe IEEE754 floats have that subtraction/addition guarantee (as long as the hardware doesn't map subnormals to zero). The problem in this case is the input numbers are rounded when they are converted from text/decimal to a float, and so aren't exact.

> I believe IEEE754 floats have that subtraction/addition guarantee (as long as the hardware doesn't map subnormals to zero).

They don't - all 11 bits of the exponent (for float64) are in use, so you can have something like 1e300, and then you can't e.g. add 1 to it and get a different number.

   >>> x = 1e100
   >>> x

   >>> y = x + 1
   >>> y

   >>> x - y

That's something else: x == y, so the difference must be 0. What you said above is

  if x != y, x - y != 0
which holds for all x and y, but is different to

  if dx != 0, x + dx != x
which fails for some (many!) x and dx.

Would ` 9999999999999999.0 – 9999999999999998.0 == 10.0000` really be less surprising?

I think the numbers would need to be something like

9999999999999995.0 – 9999999999999994.0 == 10.0000

Binary-coded decimal formats have more or less already lived and died (both fixed and floating point). They still have areas of applicability, but this idea is very much not a new one - x86 used to have native BCD support, but it was taken out in amd64 IIRC.

I took the table to be a handy guide to where arbitrary precision is the default vs. hw accelerated math.

Filtered by languages I care about, I guess I have no choice but to learn perl 6 if I want correct (but presumably slow) floating point with elegant syntax (my taste might not match yours).

I’d be curious to know what the random GPU languages and new vector instruction sets do with this computation. I don’t think they’re all 754 compliant.

Can't comment on the situation with other GPU languages, but CUDA on GPUs since fermi are 754 compliant, with the exception that certain status flags are unavailable.

Because if there are obvious edge and corner cases, like overflow scenarios, a professional system will either ensure that expectations are lived up to, or flatly denied as errors.

No surprises.

> exceptionally good way to approximate

You answered your question. 99% of the time being exact is a requirement and calculation speed is utterly unimportant, thus using IEEE 754 results in programs that are fundamentally broken.

Is that really true? In my experience, 99.9% of the time I don't need an exact number; the vanishingly few times when I have such a need (almost entirely calculations involving currency), using a fixed point representation is simple enough.

People do different kinds of work, so there are programmers who experience it both ways, 99% of the time floats are good solution or 99% of the time floats are an incorrect solution. Because of history and language support, classes and other resources for learning to program teach you to use floating-point numbers and don't bother with alternatives. As a result you have a lot of programmers who default to treating every number with a dot in it as floating point number, and they get burned by it, and instead of realizing it's just a gap in their education that they can correct, they treat overuse of floats as a mistaken industry-wide consensus that needs to be overturned.

I think there is argument to be made for high-level languages defaulting for arbitrary precision math ("make it correct first, fast second"). But considering that we are still fumbling around with fixed-width integers and that is much simpler domain after all, I don't hold my breath on "solving" the problem of reals any time soon.

That's exactly my point, why is the default a lossy format? And consider the distinction between variables and calculations. Formats like IEEE754 are designed for performing fast high accuracy transformations on matrices. I have no complaints about that. But the default arbitrary number format should be able to store exact integers and ratio's.

You don't need an exact number but customer data is universally decimal. Soon as you blindly convert that to IEE754 everything is now broken.

Is it really, though? I'm honestly struggling to think of a non-currency situation in which fractional customer data necessarily be handled as a decimal value -- and, honestly, even if the availability heuristic might make them seem more common than they are, I'd be astonished if even a single percent of general calculations programmers collectively ask computers to perform are involving currency. Most real-life situations just don't even inherently _have_ that kind of precision, let alone need it. Seriously, I can't think of a time when I've needed to store a coordinate or a person's height as a decimal value to prevent something from being broken.

My 5/8s wrench disagrees. Happens to store quite nicely in a float, however.

A useful website for these that I ran across recently: https://float.exposed/

For example, entering 9999999999999999.0 into "double" gives https://float.exposed/0x4341c37937e08000 and entering 9999999999999998.0 gives https://float.exposed/0x4341c37937e07fff

My wishlist for such a page would contain two additional features:

1. Allow entering expressions like "a OP b == c", so that one can enter "0.1 + 0.2 == 0.3" or "9999999999999999.0 - 9999999999999998.0 == 1.0" and see the terms on the left-hand side and right-hand side.

2. Show for each float the explicit actual range of real numbers that will be represented by that float. For example, show that every real number in the range [9999999999999999, 10000000000000001] is represented by 10000000000000000, and that every real number in the range (9999999999999997, 9999999999999999) is represented by 9999999999999998.

The author of this one has a blog post about it: https://ciechanow.ski/exposing-floating-point/ and I also like a shorter (unrelated) page that nicely explains the tradeoffs involved in floating-point representations and the IEEE 754 standard, by usefully starting with an 8-bit format: http://www.toves.org/books/float/

The IEEE 754 calculator at http://weitz.de/ieee/ does some of what you ask for. You can enter two numbers, see the details of their representation, and do plus, minus, times, or divide using them as operands and see the result.

> 9999999999999999.0 into "double" gives https://float.exposed/0x4341c37937e08000

Nice that it reformats the input to "10000000000000000.0", gets the point across that a 64 bit double float just doesn't have enough bits to exactly represent 9999999999999999.0, but that it does happen to be able to represent 9999999999999998.0.

An easy rule of thumb is each 3 decimal digits takes 10 bits to represent. 9999999999999999 is 16 (= 15 + 1) decimal digits. And 3 bits can only represent 0-7. So you need more than 3 bits for that final decimal digit. So, 50 + 4 bits.

IEEE 754 64-bit floats have 53 significant bits ("mantissa").

This is awesome, I tried to read and understand the 754 float spec before, and I didn't really get it.

Try playing around with half precision, it makes things a lot easier to understand.

OT: What a lovely tld .exposed is. I… really wonder about its majority userbase.

The arithmetic is correct - the problem is that "9999999999999999.0" isn't representable exactly.

9999999999999998.0 in IEEE754 is 0x4341C37937E07FFF

"9999999999999999.0" in IEEE754 is 0x4341C37937E08000 - the significand is exactly one higher.

With an exponent of 53, the ULP is 2 - so parsing "9999999999999999.0" returns 1.0E16 because it's the next representable number.

    Using one of these workarounds requires a certain prescience of the
    data domain, so they were not generally considered for the table above.
Doing arithmetic reliably with fixed-precision arithmetic always requires understanding of the data domain. If you need arbitrary precision, you'll need to pay the overhead costs of arbitrary-precision: either by opting-in by using the right library, or by default in languages like Perl6 and Wolfram.

What is the "right answer"? Is the article claiming that such languages don't respect IEEE-754, or that IEEE-754 is shit?

If you want arbitrary precision, use an arbitrary precision datatype. If you use fixed precision, you'll need to know how those floats work.

Pointless article, imho.

Note that the last example in the list, Soup, handles the expression "correctly", and also happens to be a programming language the author is working on.

> Is the article claiming that such languages don't respect IEEE-754, or that IEEE-754 is shit?

No, I don't think so. Where does that come from? The page doesn't mention FP standards at all.

> If you want arbitrary precision, use an arbitrary precision datatype.

That's the point. Half of them don't offer this feature. The other half make it very awkward, and not the default.

We went through this exercise years ago with integers. These days, there are basically two types of languages. Languages which aim for usability first (like Python and Ruby), which use bigints by default, and languages which aim for performance first (like C++ and Swift), which use fixints by default. It's even somewhat similar with strings: the Rubys and Pythons of the world use Unicode everywhere, even though it's slower. No static limits.

With real numbers, we're in a weird middle ground where every language still uses fixnums by default, even those which aim for usability over performance, and which don't have any other static limits encoded in the language. It's a strange inconsistency.

I predict that in 10 years, we'll look back on this inconsistency the same way we now look back on early versions of today's languages where bigints needed special syntax.

> Pointless article, imho.

I'm sorry you thought so. It pops up pretty often and always seems to spark a lot of conversation, so I think most programmers that give it any thought can find it a very interesting area of study.

There's an incredible amount of creep: We have what starts with nice notation (like x-y) and have to trade a (massively increased) load in either our minds or in the heat our computer generates. I don't think that's right, and I think the language we use can help us do better.

> What is the "right answer"?

What do you think it is?

Everyone wants the punchline, but this isn't a riddle, and if this problem had a simple answer I suspect everyone would do it. Languages are trying different things here: Keeping access to that specialised subtraction hardware is valuable, but our brains are expensive too. We see source-code-characters, lexicographically similar but with wildly differing internals. We want the simplest possible notation and we want access to the fastest possible results. It doesn't seem like we can have it all, does it?

I think the surprise was that Go uses arbitrary precision for constants.

There are several.

If you subtract two numbers close to each other with fixed precision you don’t know what the revealed digits are. (1000 +/- .5) - (999 +/- .5) = 1 +/- 1.

Thus 0, 1, and 2 are all within the correct range.

What does revile mean in this context?

Edit: revealed

Floating point numbers have X digits of accuracy based on the format. (Using base 10 for simplicity) Let’s say .100 to .999 times 10^x.

But what happens when you have .123x10^3 - .100x10^3. It’s .23? x 10^2 but what is that ? we might prefer to pick 0 but it really could be anything. We can’t even be sure about the 3. If the numbers where .1226 x 10^3 and .1004 x 10^3 that just got rounded the correct number would be .222 x 10^2

Yeah, that's just a limitation of the format. Approximating an uncountably infinite quantity of numbers with only 64 bits is never going to be exact.

Y However, you aren't going to do any better without using vastly more expensive arbitrary precision.

You could see it as a "limitation of the format", or you could see it as exchanging one type of mathematical object for another.

For example, CPU integers aren't like mathematical integers. CPU integers wrap around. So CPU integers aren't "really" the integers—CPU integers are actually the ring of integers modulo 2n , with their names changed!

I'm not sure what the name of the ring(?) that contains all the IEEE754 floating-point numbers and their relations is called, but it certainly exists.

And, rather than thinking of yourself as imprecisely computing on the reals, you can think of what you're doing as exact computation on members of the IEEE754 field-object—a field-object where 9999999999999999.0 - 9999999999999998.0 being anything other than 2.0 would be incorrect. Even though the answer, in the reals, is 1.0.

Floating point numbers aren't a ring. In fact, they aren't even associative. Thus, they are don't even rise to the level of group or even monoid.

The point is to illustrate a simple fact that most of us know- but maybe some don't.


It doesn't even illustrate that particularly well. As is, the page just seems to be pointing at floating point and yelling "wrong", with no information on what's actually happening.

By all means embrace the surprise and educate today's 10,000, by why not actually explain why these are reasonable answers and the mechanics here behind the scenes?

I was one of those people today! This intrigued me enough to learn more about IEEE 754.

Thank you for the relevant xkcd!

The right answer is to convert to an integer or bignum. If the language reads 9999999999999999.0 as a 32 bit float, you will get 0.0. If it's a double, you'll get 2.0.

I don't think there is a "right" answer. Defaulting to bignum makes no more sense than defaulting to float for inputs "1" and "3" if the operation to be performed on the next line is division. Symbolic doesn't make sense all of the time either, what if it's a calculator app and the user enters "2*π", they probably don't want "2π" to be the result.

If we're going to try to find a "right" answer from a language view without knowing the exact program and use cases then the most reasonable compromise is likely "error" because types weren't specified on the constants or parsing functions.

There's is a mathematically correct answer for this problem given their decimal representation. That's the correct answer for the math, period. What "good enough" behavior is for a system that uses numbers under the hood depends on context and is only something that the developer can know. Maybe they're doing 3D graphics and single precision floats are fine, maybe they're doing accounting and they need accuracy to 100ths or 1000ths of a whole number.

The appropriate default is, I would argue, the one which preserves the mathematically correct answer (as close as possible) in the majority of cases and enables coders to override the default behavior if they want to specify the exact underlying numerical representation they desire (instead of it being automatic). That goes along with the "principle of least surprise" which is always a good de facto starting point for any human/computer interaction.

Take a piece of pen and paper and subtract the two numbers. Whatever number you get for the difference is "the right answer."

Ok, I'll try with pi and e. I'll be right back.

The point is that this reveals a common weakness in most programming languages. Not that floating point math has limits, but that this isn't well communicated to the user. One of the hallmarks of good programming language design is the "principle of least surprise" which things like funky floating point problems definitely fall into. Not everyone who uses programming languages, in fact very few of them, have taken numerical analysis, and many devs are not well versed in the weaknesses of floating point math. So much so that a very common way for devs to become acquainted with those limits and weaknesses is by simply blundering into them, unknowingly writing bugs, and then finding the hard way the sharp corners in the dark. This is not ideal.

Consider a similar example, pointers. Some languages (like C and C++) use pointers heavily and it's expected that devs using those languages will be experienced with them. However, pointers are very "sharp" tools and have to be used exceedingly carefully to avoid creating programs with major defects (crashes, memory leaks, vulnerabilities, etc.) They are so hard to get right that even software written by the best coders in the world commonly has major defects in it related to pointer use. This problem is so troubling to some that there are many languages (java, javascript, python, C#, rust, etc.) which have been designed to avoid a lot of the most difficult to use aspects of languages like C and C++, they use garbage collection for memory management, they discourage you from using pointers directly, and so on. However, even those languages do very little to protect the user from blundering into a mindfield of floating point math.

Consider, for example, simply this statement:

x = 9999999999999999.0

Seems rather straightforward, right? But it's not, it's a lie. Because in many languages the value of x won't be as above, it'll be (to one decimal digit precision) 10000000000000000.0 instead. Whereas the value of ....98.0 is the same as the double precision float representation to one decimal digit precision (thus the difference between the two comes out as 2.0 instead of 1.0). Now, maybe in a "the handle is also a knife" language like C this is fine, but we have so many languages which go to such extremes everywhere else to protect the user from hurting themselves except when it comes to floating point math. And here's a perfect case where the compiler, runtime, or IDE could toss an error or a warning. Here you have a perfect example of trying to tell the language something you want which it can't do for you in the way you've written, that sounds like an error to me. The string representation of this number implies that you want a precision of at least the 1's place in the decimal representation, and possibly down to tenths. If that's not possible, then it would be helpful for the toolchain you're using for development to tell you that's impossible as close to you doing it as possible, so that you know what's actually going on under the hood and the limitations involved.

Something which would also drive developers towards actually learning the limitations of floating point numbers closer to when they start using them in potentially dangerous ways than instead of having to learn by fumbling around and finding all the sharp edges in the dark. The sharp edges are known already, tools should help you find and avoid them not help new developers run into them again and again.

Using one of these workarounds requires a certain prescience of the data domain

I'm a little concerned if merely knowing the existence of floating point arithmetic constitutes "prescience."

Are there any mainstream languages that consider a decimal number to be a primitive type? I feel like floating point numbers are far less meaningful in every day programs. Even 2d graphics would be easier with decimal numbers. Unless you're using numbers that scale from very small to very large, like 3d games or scientific calculations, you don't actually want to use floating point.

> Are there any mainstream languages that consider a decimal number to be a primitive type

Mathematica. But it's not particularely fast.

> Unless you're using numbers that scale from very small to very large, like 3d games or scientific calculations, you don't actually want to use floating point.

Unfortunately, we can sometimes only use floats in 3D graphics and floats aren't even good for semi-large to large 3D scenes. Unity is a particular bad offender. It's not even necessary for meshes but having double precision transformation matrices would make life so much easier. Could simply use double precision world and view matrices, then multiply them together and the large terms would cancel out in the resulting worldView matrix, which can then by cast back to single precision floats.

Not to mention z-fighting for distant objects.

Depends how you define 'primitive type.' A decimal number is built-in for C# and comes along with the standard libraries of Ruby, Python, Java, at least.


Julia has built in rationals (as do a few other languages).

I'm not aware of any language (other than Wolfram) that defaults to storing something like 0.1 as 1/10 - i.e. uses the decimal constant notation for rationals, rather than having some secondary syntax or library.

Even in Wolfram, 0.1 is not the same as 1/10.

    In[1]:= Precision[0.1]

    Out[1]= MachinePrecision

    In[2]:= Precision[1/10]

    Out[2]= \[Infinity]

Ah, thanks for the correction.

I don't currently have a license, so out of curiosity is 0.3 == 3/10 in wolfram?

Yes that's true.

According to the documentation of Equal†,

> Approximate numbers with machine precision or higher are considered equal if they differ in at most their last seven binary digits (roughly their last two decimal digits).

Which is why in Mathematica, 0.1+0.2==0.3 is also True.

If you need a kind of equality comparison that returns False for 0.3 and 3/10, use SameQ. Funnily, SameQ[0.1+0.2,0.3] is also True, because SameQ allows two machine precision numbers to differ in their last binary digit.

†: https://reference.wolfram.com/language/ref/Equal.html

Yes, but the Julia REPL produces 2.0 as an answer and casting both these to BigInt doesn't work either.

Thus my second paragraph. You have to opt in to use other formats.

Casting to bigint doesn't work because the problem occurs when converting the decimal constant in the source to floating point. You would have to convince the parser to parse the constant as something besides a float.

Agree - but it's a shame, fundamentally what ever the reason this is really, really egregious behaviour

In Common Lisp it's even standardized. Arithmetic is too important to be left to wrong CPU intrinsics.

There are issues with arbitrary precision decimal numbers. For one, you can't deal with things like 1/3: these are repeating decimals so they need infinite memory to represent.

Do you mean some fixed point decimal number? Cause the normal way to do decimal numbers would still be floating point.

They mean arbitrary-precision decimal arithmetic (i.e. a struct containing bignum x and integer y where the connoted value is x*10^y, such that multiplication can be defined simply as the independent multiplication of the value parts and of the exponent parts.)


There are plenty of languages without the concept of a primitive type...

The linked post is a bit poorly expressed, but I think there is a good point there: fixed-size binary floating-point numbers are a compromise, and they are a poor compromise for some applications, and difficult to use reliably without knowing about numerical analysis. (For example, suppose you have an array of floating-point numbers and you want to add them up, getting the closest representable approximation to the true sum. This is a very simple problem and ought to have a very simple solution, but with floating-point numbers it does not [1].)

Perhaps it is time for the developers of new programming languages to consider using a different approach to representing approximations to real numbers, for example something like the General Decimal Arithmetic Specification [2], and to relegate fixed-size binary floating-point numbers to a library for use by experts.

There is an analogy with integers: historically, languages like C provided fixed-size binary integers with wrap-around or undefined behaviour on overflow, but with experience we recognise that these are a poor compromise, responsible for many bugs, and suitable only for careful use by experts. Modern languages with arbitrary-precision integers are much easier to write reliable programs in.

[1] https://en.wikipedia.org/wiki/Kahan_summation_algorithm [2] http://speleotrove.com/decimal/decarith.html

Do note that UB on integer overflow is (at least nowadays) more of a compiler wish for optimization reasons than it is technically necessary (your CPU will indeed just wrap around if you don't live in the 80s anymore, but a C++ compiler might have assumed that won't happen for a signed loop index).

Also worth checking: http://0.30000000000000004.com/

most popular previous discussion: https://news.ycombinator.com/item?id=10558871

There's an easier way to specify long floats in Common Lisp: use the exponent marker "L" e.g. 9999999999999999.0L0. No need to bind or set reader variables.

That said, even in Common Lisp I think its only CLISP (among the free implementations) that gives the correct answer for long floats.


    [1]> (- 9999999999999999.0L0 9999999999999998.0L0)
SBCL, CMUCL and Clozure CL:

    * (- 9999999999999999.0L0 9999999999999998.0L0)
The standard only mandates a minimum precision of 50 bits for both double and long floats, so there's no guarantee that using long floats will give the correct answer, as we can see.


Is floating point math broken?


No, it's just that a lot of people don't understand its limitations.

It’s nice that JavaScript has arbitrary-precision integers now. 9999999999999999n - 9999999999999998n === 1n

Google calculator gives answer 0 where as duckduckgo calculator answers with 2. xD

That's disappointing. The Android calculator has an awesome arbitrary precision engine (implemented by Hans Boehm!).


Interesting thing is that Google calculator will give 2 if you fill in the numbers by clicking on the calculator buttons, instead of writing them in the search bar.

Bing gives 1 though!

So duckduckgo uses the normal 64bit floating point, and the clever people at bing automatically switch to bignums when needed. But I have no idea what google does to get that 0?

I think Google is truncating the numbers. You get 0 even if you do:

9999999999999999 – 9999999999999990

    9999999999999999 - 9999999999999971 ==  0
    9999999999999999 - 9999999999999970 == 30
    9999999999999999 - 9999999999999969 == 32
    9999999999999999 - 9999999999999966 == 34

This is particularly sucky to solve in C and C++ because you don't get arbitrary precision literals.

    #include <boost/multiprecision/cpp_dec_float.hpp>
    #include <boost/lexical_cast.hpp>
    #include <iostream>

    using fl50 = boost::multiprecision::cpp_dec_float_50;

    int main() {
        auto a = boost::lexical_cast<fl50>("9999999999999999.7");
        auto b = boost::lexical_cast<fl50>("9999999999999998.5");
        std::cout << (a - b) << "\n";

    int main() {
        fl50 a = 9999999999999999.7;
        fl50 b = 9999999999999998.5;
        std::cout << (a - b) << "\n";
doesn't, even if you change fl50 out for a quad precision binary float type.

> Even user-defined literals in C++11 and later don't let you express custom floating point expressions

Note that in your code sample you're not actually using user-defined literals (https://en.cppreference.com/w/cpp/language/user_literal). This works (based on on your earlier code sample and adding user-defined literals):

    #include <boost/multiprecision/cpp_dec_float.hpp>
    #include <boost/lexical_cast.hpp>
    #include <iostream>
    using fl50 = boost::multiprecision::cpp_dec_float_50;
    fl50 operator"" _w(const char* s) { return boost::lexical_cast<fl50>(s); }
    int main() {
        fl50 a = 9999999999999999.7_w;
        fl50 b = 9999999999999998.5_w;
        std::cout << (a - b) << "\n";

Thanks, it's nice to be wrong! For some reason I had it in my head that you couldn't get the token as a char const* for floating point expressions...

Note in C you can get the correct result if you use long doubles, which normally go to 80 bits[1]:

printf("%Lf\n", 9999999999999999.0L - 9999999999999998.0L);

In my x86_64 computer it breaks when you add enough digits. At this point it started outputting 0.0 as the difference:

printf("%Lf\n", 99999999999999999999.0L - 99999999999999999998.0L);

With 63 bits for the fraction part you more or less get around 19 decimal digits of precision, and the expression above uses 20 significant digits.

[1] https://en.wikipedia.org/wiki/Extended_precision#x86_extende...

Interestingly, SQLite gets it wrong, returning 2.0, but MySQL, MariaDB, Postgres, and Cockroach all get it right at 1.0

I guess this comes down to most of them having implementations of arbitrary precision decimals.

In PostgreSQL, if you specify a decimal literal, it is assumed to be type NUMERIC (arbitrary precision) by default, as opposed to FLOAT or DOUBLE PRECISION.

If you stored your values in table rows as DOUBLE PRECISION, you would of course get the wrong answer.

With python, I get 2 even when using Decimals.

    Python 2.7.3 (default, Oct 26 2016, 21:01:49)
    [GCC 4.6.3] on linux2
    Type "help", "copyright", "credits" or "license" for more 
    >>> from decimal import *
    >>> getcontext().prec
    >>> a=Decimal(9999999999999999.0)
    >>> b=Decimal(9999999999999998.0)
    >>> a-b
That is unexpected.

The issue is that 9999999999999999.0 == 10000000000000000.0. You need to pass a string: Decimal('9999999999999999.0')

This is because you create a python float (64-bit) before passing this float to the Decimal class.

  >>> a = 9999999999999999.0
  >>> a

  >>> Decimal('9999999999999999.0')-Decimal('9999999999999998.0')

There is interesting ongoing research on representing exact reals: https://youtu.be/pMDoNfKXYZg

Quick summary of the talk:

A specialist number representation is made for exact representation of values in geometric calculations (think CAD). Numbers are represented as sums of rational multiples of cos(iπ/2n).

Exact summation, multiplication and division (not shown) of these quantities are possible, and certain edge-cases (eg. sqrt) have special-case handling.

The system was integrated into and tested on an existing codebase.

The speaker was also one of the authors of Herbie, if other people remember that.

2 with 64 bit floats, 0 with 32 bit floats.

I could understand 0, but how does it get 2?

FP numbers are (roughly) stored in the form m×2^e (m = mantissa, e = exponent). When numbers can't be represented exactly, m is rounded. My guess is that these numbers end up being encoded as 4999999999999999×2 and 4999999999999999.5×2, where the latter is rounded up to 5000000000000000×2.

The nearest doubles to each of these two decimal constants end up being roughly 2 apart. Whereas for fp32 both decimal constants are stored as the same float.

Because 9999999999999999 is rounded to 10000000000000000 before any math even happens. Precision != order of magnitude.

These integers are so large that they cannot be precisely represented by 32 bit or 64 bit floats. So there's a rounding effect. https://stackoverflow.com/a/1848953

because 2 is the interval of precision at that scale. In floating point loss of precision scales with magnitude.

x999 rounds up when stored, x998 stays the same.

$ php -v

PHP 7.2.10 (cli) (built: Oct 9 2018 14:56:43) ( NTS )

$ php -r "echo 9999999999999999.0 - 9999999999999998.0;"


$ php -r "echo bcsub('9999999999999999.0', '9999999999999998.0', 1);"


bcmath - http://php.net/manual/en/function.bcsub.php

Result for all versions of PHP: https://3v4l.org/JYlrp

And with bcmath: https://3v4l.org/AmOQt

What happens with the bcmath extension enabled?

great point! edited to add. Decimal looks like a good option in php7, but a bit past a command line implementation: http://php-decimal.io/#introduction

The accounting software we use has a built in calculator which has a similar problem.

5.55 * 1.5 = 8.3249999999999....

26.93 * 3 = 80.7899999999999....

I raised it with the supplier some time ago, they said it's just the calculator app and the main program isn't affected. Quite shocking that they are happy to leave it like this.

Pros and cons of the different ways computers can work with fractions: https://softwareengineering.stackexchange.com/a/167166/10934...

The author says, "That Go uses arbitrary-precision for constant expressions seems dangerous to me."


My thoughts: 1) more inefficient programs because encountering an arbitrary-precision expression requires arbitrarily large memory and computation, 2) more complicated language implementation.

Constant expressions are evaluated at compile time. Compilation would suffer any eventual performance penalties. This probably makes the compiler simpler - no need to implement different arithmetic for different types & no need to guess the types.

The dangerous bit is, that just extracting a variable from constant expressions might change the result slightly. That should not be a problem, unless you are depending on exact values.

The dangerous part is that now the value of the computation can change slightly if it's no longer constant. So e.g. if it involves a named constant, and that constant then becomes a variable for some reason (e.g. because it now needs to be computed at runtime, because it varies from platform to platform), you can end up with broken code with no warning.

> Why?

Hint 1: Can you imagine ever moving a magic number from an expression into a constant?

While it is not the default literal type in Haskell, you can use coercion and the Scientific type to compute an (almost) arbitrary precision result. For example this prints 1.0 in the repl:

import Data.Scientific

(9999999999999999.0 :: Scientific) - (9999999999999998.0 :: Scientific)

Java also has BigDecimals for this kind of work.

How do Perl, Wolfram ans Soup get the "right" (ahaha...) answer ? (I'm not familiar with these langages)

Of course for others it should "fixable" where needed:

  ~ python3
  Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
  [GCC 8.2.0] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> from decimal import Decimal
  >>> Decimal('9999999999999999.0') - Decimal('9999999999999998.0')

Do any of the languages mentioned give a compiler warning? To help educate?

IMHO it's unfeasible, because the exact same situation as with 9999999999999999.0 (a literal that's impossible to represent accurately as double and will get rounded to something else) applies also to very common cases such as 0.1 (which can't have an exact binary representation at all) - adding a compiler warning for that will mean that the warning will trigger for pretty much every floating point literal.

I don't think that's quite true. For 0.1,the algorithm for printing back the number will still result in 0.1, but the same is not true for 9999999999999999.0, which comes back 1e16. So compilers could easily warn for this specific situation, where what I'd call the round trip value of the literal is broken.

But I don't think such a warning would be all that helpful. How often do we use literals with 16 significant digits, expecting exact representation?

The bigger gotcha here is catastrophic cancelation. This is the issue of an insignificant rounding error becoming much more significant due to subtracting of very nearly equal numbers. You can't generally detect this at compile time if you don't know all your numbers in advance (e.g. you're not working with only literals).

You can do abstract interpretion and calculate rounding precision at each line. The problem is that as soon as you have loops, you'll pretty much get that warning everywhere. Being sure that there is no cancellation is even harder, but possible in some cases. I'm sure there are better approaches, there's tons of research papers in that area.

> Several of the results surprised me. Did they surprise you?

Well, the perl6 result surprised me, since that means it's using something more precise than double precision floating point :)

They're called Rat (for Rational number): https://docs.perl6.org/type/Rat , which maintain precision until the denominator exceeds 64 bits: then they're downgraded to doubles. If you want to keep precision still at that level, you can use FatRat: https://docs.perl6.org/type/FatRat

It surprised me too, since it's not what I got.

    $ perl6 --version
    This is Rakudo version 2018.03 built on MoarVM version 2018.03
    implementing Perl 6.c.
    $ perl6 -e 'print 9999999999999999.0-9999999999999998.0;print "\n";'
(Incidentally, I would have used "say" rather than "print" with an explicit newline.)

Indeed, there was a bug with determining when to switch to floats, that was fixed by Zoffix in August:


  › perl6 -e '.say for $*PERL.compiler, 9999999999999999.0-9999999999999998.0'
  rakudo (2018.11)

Well, one answer is to use IEEE 1788 interval arithmetic, which will at least give you IEEE 754 high and low bounds for a calculation, rather than one answer that's clearly wrong. Otherwise, some inaccuracy is the trade-off for fast floating point calculations.

So.. Can someone better versed in the ways of system level programming tell me why we still use IEEE 754 exponential notation?

Iv'e seen article after article of how "horrible it is". So, are there default libs to use Binary Coded Decimal (BCD) or something like that?

You need a fixed data size for good performance. If you use fixed precision, you get absolute nonsense when doing very common calculations like `tan(x)`. IEEE 754 was masterfully engineered to produce the fastest, most correct results for the most common operations.

We use it because it's fantastic.

Binary floating points are not for system level programming; they are for scientific programming, where the accuracy of the values is related to the magnitude of the values.

64 bit integers are actually good for a lot of things that floating point gets used for; coordinates already should never have been floating point (the accuracy of measurement is independent of the magnitude for coordinates), but you could represent the entire solar system in millimeters without overflowing 64 bit integers (compared to not even the entire earth in mm for 32 bit integers).

Also, it's popular to blame javascript for all today's problems, but the double-precision floating point is the only number type it has, which has some effect on its use.

We use it because it's fast, accurate and very useful for many applications. It's not horrible.

In fact, as discussed in another thread, it's the optimal representation for many purposes. The only problems are that some languages over-privilege them to the point where it's difficult to use alternatives, and some programmers don't understand them.

Because ever other alternative is either variable size (which implies allocations, making itb orders of magnitude slower), or has the same problems, just in different places.

Interestingly `bc` as well as `bash expr` on the command line give the correct result.

"bc - An arbitrary precision calculator language"

bc is doing its job!

bc uses arbitrary precision, not floatys.

Welcome to Apple Swift version 4.2.1 (swiftlang-1000.11.42 clang-1000.11.45.1). Type :help for assistance.

  1> let foo = 9999999999999999.0 - 9999999999999998.0
foo: Double = 2

Actually Wolfram|Alpha does give 1, but Mathematica 11.3 gives 2.

Mathematica interprets the real numbers 9999999999999999.0 and 9999999999999998.0 as having machine precision. To work in arbitrary precision, you need a backtick after the number, followed by the number of significant digits.

In this case,

9999999999999999.0`17-9999999999999998.0`17 does indeed return 1.


In perl5 it's easier:

    perl -Mbignum -e'print 9999999999999999.0 - 9999999999999998.0'
The BigFloat solution on the website is suboptimal.

In Perl 6 even shorter than that:

    perl6 -e'print 9999999999999999.0 - 9999999999999998.0'

But, https://play.golang.org/p/naE55o3_xFP

I guess the first link is converting the float constants to ints at compile-time?

(edit: oh, it's actually mentioned in the article. I should read more carefully)

In gcc, there is software emulation for hardware floating point arithmetic so that compile time constants may be evaluated for any target architecture (even if the compiling architecture does not support that format). It seems go approximates this as “just evaluate with a high precision then convert to float” which is probably mostly fine but having arithmetic be different between compile time and run time seems likely to be not fun.

Not that this is specifically a "floating point" problem.

Plenty of languages are going to get upset if you add 2 billion to 2 billion.

its good to educate people about default representations of numerical literals and their corner cases.

Whenever I see these examples I do get annoyed st the Haskell one because we never are told what type it gets defaulted to, which only happens silently in the ghci repl, but will trigger a warning if it’s in a source file that’s being compiled.

sqlite> SELECT 9999999999999998.0 - 9999999999999999.0; -2.0

MariaDB [(none)]> SELECT 9999999999999998.0 - 9999999999999999.0; -1.0

I don't get it. Why is the author complaining? He requests 17 digits of accuracy which is not something you should be using floats for in any form. Just import/link a package which can do arbitrary precision arithmetic of your choice and pay the overhead price.

Pascal/Delphi gives the right answer:

writeln(FloatToStr(9999999999999999.0 - 9999999999999998.0));


Groovy 2.4.5 gets it right:

println(9999999999999999.0-9999999999999998.0) 1.0

Just tried it in PHP and was given an answer of 2.

The default should have been arbitrary precision bignums with easy opt-in for floats and ints of varying sizes.

But oh well.


I didn't downvote you, but this isn't a problem with computers. It's a problem with the (mis)use of floats.

Floats are not decimals. That's unfortunately a really, really common misconception, owing in part to poor education. Developers reach for floats to represent decimals without thinking about the precision ramifications.

When you're working with decimals that don't need a lot of precision this doesn't generally come up (and naturally, those are the numbers typically used in textbooks). But when you start doing floating point arithmetic with decimals that require significant precision, things get bizarre very fast.

Unfortunately if a developer isn't expecting it, that's likely to happen in production processing code at a very inopportune time. But the computer is just doing what it's told - we have the tools to support safe and precise arithmetic with decimals that need it. It's a matter of knowing how and when to use floating point.

FWIW, a fixed-precision floating-point decimal type would have the same problem. At some point the spacing between two consecutive floating-point values (ULP [1]) simply becomes more than one, no matter the radix.

[1] https://en.wikipedia.org/wiki/Unit_in_the_last_place

You're probably being downvoted for posting like you're on some other site, moreso than your sentiment that this is just a simple CS 101 thing that people ought to know.

Thing is, a lot of people don't take CS courses, and have to learn this as they go along. More importantly, the naive cases all seem to work fine - it's only when you get to increasing precision / scales that you notice the cracks in the facade, and that's only if you have something that depends on the real accuracy (e.g. real world consequences from being wrong) or if someone bothers to go and check (using some other calculator that gives more precise results).

My own view on it is that it's past bloody time for languages to offer a fully abstracted class of real numbers with correct, arbitrary precision math - obviating the need for the developer to specify integer, float, long, etc. I don't mean that every language should act like this, but ones aimed at business software development, for example, would do well to provide a first-class primary number type that simply covers all of this properly.

Yes, I can understand that the performance will not be ideal in all cases, but the tradeoff in terms of accuracy, starting productivity, and avoiding common problems would probably be worth it for a pretty big subset of working developers.

What is "properly" though? There's many real numbers that don't have finite representation. Arbitrary precision is all well and good, but as long as you're expressing things as binary-mantissa-times-2^x, you aren't going to be able to precisely represent 0.3. You could respond by saying that languages should only have rationals, not reals, but then you lose the ability to apply transcendental functions to your numbers, or to use irrational numbers like pi or e.

Performance is only part of the problem, and what it prevents is more-precise floats (or unums or decimal floats or whatever). The other part of the problem is that we want computers with a finite amount of memory to represent numbers that are mathematically impossible to fit in that memory, so we have to work with approximations. IEEE-754 is a really fast approximator that does a good job of covering the reals with integers at magnitudes that people tend to use, so it's longevity makes sense to me.

Exact real arithmetic is an open research problem (and slow, as well). Arbitrary precision has its own can of worms and is slow, too.

Not really. It's used for over 30 years successfully in all lisps. gmp is not really slow, and for limited precision (2k) there exist even faster libs.

gmp is not exact. It's just arbitrary-precision. There's a very large difference. Exact arithmetic handles numbers like pi with infinite precision. When you use gmp, you pre-select select a constant for pi with a precision known ahead of time. In the real world, 64 bits of pi is more than enough for almost every purpose, so whatever. It's fine. But there's a huge conceptual gap between that and exact arithmetic.

I never said that. For simple, non-symbolic languages gmp is still the best.

Lisp is of course better, optimizing expressions symbolically as far as possible, eg. to rationals, and using bignum and bigint's internally. As exact as possible. perl6 does it too, just 100x slower.

Applications are open for YC Winter 2024

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact