
Floating point oddity - furcyd
https://www.johndcook.com/blog/2019/11/11/floating-point-oddity/
======
shakna
> Incidentally, bc -l gives the right result, no matter how small you set
> scale.

There's an easy explanation here: bc has it's own arbitrary precision number
library embedded in it, so it doesn't make use of floats, doubles or even
quads at all.

The number library is surprisingly readable [1] for such a complex topic, and
doesn't use complex types at all, relying on ints and chars and pointers
inside the struct it uses to store the numbers.

It's been wrapper up for a couple languages, I stumbled over it with a Lua
wrapper [0]. Useful when you want better precision than you currently have,
but don't want to drag in something like GMP and are okay with minor edge
cases sometimes appearing.

[0] [https://webserver2.tecgraf.puc-
rio.br/~lhf/ftp/lua/#lbc](https://webserver2.tecgraf.puc-
rio.br/~lhf/ftp/lua/#lbc)

[1] src/number.c & src/number.h from
[https://www.gnu.org/software/bc/](https://www.gnu.org/software/bc/)

------
namirez
This has nothing to do with C vs. C++. It's a fundamental limitation of
floating point arithmetic. Here is the Taylor series of \exp:

\exp(x) = 1 + x + higher order terms.

Sometimes you can represent x by a floating number but 1+x will be 1 because
the difference has to go from exponent into mantissa. This happens when the
number is between these two values:

(std::numeric_limits<T>::lowest, std::numeric_limits<T>::epsilon)

~~~
johndcook
I thought there was a difference between a C and C++ implementation, but it
was an error on my part. When I fixed the bug, I changed the post title. But
the post hit HN while I was in the process of fixing it.

~~~
mehrdadn
How does expm1 not work for you? It works for me.

------
mehrdadn
tl;dr: expm1 does work. Something funny is going on if it doesn't work for
you.

~~~

How this works out in practice is that once you try your hands on some serious
floating-point, you _quickly_ learn that an expression like x == 0 (or even
like x <= 0) is a strong code smell. It's an _instant_ red flag that you need
to drop what you're doing and think things through extremely carefully first.
So you'd stop right there and try to get some intuition for what's going on.

Now the intuition you want here is that floating point is logarithmically
spaced around 0 rather than around 1, so if you do exp(x) for small x, it
won't have adequate resolution to return anything but 1, because the next-
highest jump is too large to be accurate. So when you subtract 1, you get
zero.

But... that's precisely why expm1 exists: it's to evaluate exp(x) - 1
accurately. And in fact, it should work. And it does work for me. Why doesn't
it work here? It sounds like either a bad implementation or the wrong compiler
flags. We really need to know the compiler flags that are being used.

For what it's worth: floating-point can be a lot worse than this, where
there's no simple red flag or library function like this that you can
recognize. In my experience, a rule of thumb (somewhat of a yellow flag) is
that, if you start doing _nonlinear_ operations (even division), you gotta
roll up your sleeves and get ready for some fun...

~~~
gumby
> x == 0 (or even like x <= 0) is a strong code smell.

Careful: 0.0f is defined by the standard to be 0x0. So such comparisons can be
quite reasonable.

However trying for 0 as the result of a computation, as is used in this
example is indeed problematic.

But there is plenty of reasonable code that initializes a float or double
variable to 0.0 and can expect to be able to inspect it and find its bit
pattern to be 0x0 before that variable is manipulated. After that, all bets
are off.

~~~
gjm11
The point isn't just that you can't tell exactly when something is zero,
although indeed you often can't.

It's that if you have some calculation that needs to be special-cased _at
zero_ (e.g., because you're dividing by something and it might explode) then
it probably actually needs to be special-cased _near zero_ (because near zero
it might overflow or run into other numerical problems), and it's probably
worth seeing if there's another way of organizing the calculation that doesn't
misbehave for small values in the first place.

------
Arnavion
>I tried replacing exp(x) - 1 with expm1(x). This made no difference.

Interestingly, in Rust, all four q(x) are non-zero, using f32::exp / f64::exp
in e(x) consistently gives 0, and using f32::exp_m1 / f64::exp_m1 consistently
gives 1. (Given non-zero q(x), this is to be expected.) Both f32::exp_m1 and
f64::exp_m1 are implemented using libm's expm1 [1], so I wonder what the
author's C++ compiler is doing instead.

[https://play.rust-
lang.org/?version=stable&mode=release&edit...](https://play.rust-
lang.org/?version=stable&mode=release&edition=2018&gist=9b93c17e9effed0f0fe88ca206c02d23)

[1]: [https://doc.rust-lang.org/src/std/f64.rs.html#724-726](https://doc.rust-
lang.org/src/std/f64.rs.html#724-726) \+ [https://github.com/rust-
lang/rust/blob/bc0e288ad02ef362b5a6c...](https://github.com/rust-
lang/rust/blob/bc0e288ad02ef362b5a6c42aaf61f2901c9b46db/src/libstd/sys/unix/cmath.rs#L19-L20)

Edit: Actually, I can't reproduce the author's situation on godbolt. The expm1
method consistenly returns 1.
[https://gcc.godbolt.org/z/RvsdHU](https://gcc.godbolt.org/z/RvsdHU)

------
veselin
There is an interesting paper that can automatically rewrite expressions to
improve the floating point errors.

[http://herbie.uwplse.org/](http://herbie.uwplse.org/)

~~~
mehrdadn
This looks... too good to be true. And indeed, I just tested it out on the
quadratic formula, and the result doesn't seem to be terribly useful? At least
in light of the problems that actually come up when trying to use the
quadratic formula. (Which btw is _quite_ a fun exercise to figure out if this
doesn't ring a bell.)

------
xondono
I don't want to sound dismissive, but isn't this just a problem with numerical
instability? How is this different from trying to solve the following system?

    
    
      x^2+(2+y)*x+1=2
      x^2+(2 -y)*x+1=3
    

And playing with the value of _y_.

I know that this example involves a sign change while the OP doesn't, but it's
the same kind problem essentially, isn't it?

If I have learnt anything building PID loops on microcontrollers is that
floating point can be a PITA if you are near this kind of situations.

~~~
martincmartin
_isn 't this just a problem with numerical instability?_

Yes, but many developers aren't aware of numeric instability. Or, maybe
they've heard the term, but not how it applies in practice.

People seem to go through stages when learning about floating point:

1\. "When you need fractions, use floating point." At this stage, they expect
"x / 3 * 3 == x" to always be true, and are surprised when things like that
fail.

2\. "Floating point is just approximate. So always use double, which is
accurate to 15 decimal places." So this explains why you don't get exact
answers, but of course doesn't explain the numerical instability in the OP's
post.

3\. "Cancellation can cause results to be arbitrarily far off! We need to
analyze the error propagation of every expression!" This is impractical, and
developers in stage 2 look at you like you're over engineering the system.

4\. "In practice, using doubles 'just works' in almost all situations." If
you're lucky, you can recognize when it might not work, and do error analysis
only in those cases. If not, you just don't sleep well at night, wondering
whether or where your code will break down. Probably some of both.

~~~
coldcode
I learned about this back in the early 80's writing a transcendental runtime
library (to be used on the F-16 fighter with Jovial). I was always afraid to
read a story about some pilot in a dog fight launching a missile and having it
go nuts and they were shot down because the code had issues.

------
_nalply
I tried this with Julia and got the answer 1.

    
    
        e(x) = x == 0 ? 1 : (exp(x) - 1) / x
        q(x) = abs(x - sqrt(x^2 + 1)) - 1 / (x + sqrt(x^2 + 1))
        h(x) = e(q(x) ^ 2)
    

What's happening here?

Is Julia switching to arbitrary precision like bc -l or rewriting exp(x) - 1
to expm1(x) or doing something else? Seems a lot of magic to me.

~~~
StefanKarpinski
Not sure what’s going on but Julia is definitely not switching to arbitrary
precision without being explicitly asked to do so. That would be very
unjulian.

I’m not at a computer or I’d investigate. One hypothesis based on some other
comments is that getting the wrong answer (especially when using expm1) may be
due to unsafe C compiler flags, which Julia does not apply by default. Just a
guess though.

Update from reading more comments: it seems that Rust and C via godbolt
without questionable C flags also don’t fail like this. So I wonder if John
hadn’t compiled the code with fast-math on, which will allow me the compiler
to give all kinds of wrong answers.

~~~
_nalply
I am using homebrew cask julia version 1.1.0 on macOS 10.13.6.

------
SloopJon
One thing I notice about the C program is that it uses the double-precision
exp(), fabs(), and sqrt(), rather than the "q"-suffixed variants from
quadmath.h, a feature that appears to be specific to GCC. However, changing
the program had no effect for me with GCC 9.2.

I can't get the C++ program to compile.

Edit: the post has been rewritten since I saw it.

------
shultays
There are many posts about floating point limitations and this is the most
absurd one.

Yes, floating point represents irrational numbers as an approximation and each
operation would further deviant you from the real values you would expect.

he could just say "`float a = M_PI;` is not actually equal to pi!"

------
pavanky
The post no longer exists?

~~~
jandrese
[https://web.archive.org/web/20191112011511/https://www.johnd...](https://web.archive.org/web/20191112011511/https://www.johndcook.com/blog/2019/11/11/floating-
point-oddity/)

~~~
colanderman
The original post was missing return types for the C++ functions, thus
defaulting to int. Adding correct return types (T) resulted in the same output
as the C version (0), thus eliminating the discrepancy and making the title
nonsensical, hence the edit.

-Wall -Wextra -Werror is your friend.

------
saagarjha
Since floating point operations are not associative, I wonder if it is legal
for compilers to rearrange them. Maybe that’s what’s happening here?

~~~
thethirdone
As far as I know, floating point add and multiply are commutative (but not
associative) in IEEE 754. And I believe C and C++ allow you to detect if IEEE
754 is used by the hardware and if so, execute the floating point expressions
as written.

Note: That is of course no longer true under -ffast-math.

~~~
cperciva
_As far as I know, floating point add and multiply are commutative (but not
associative) in IEEE 754._

Usually, but not always. (+0) + (-0) yields (+0), while (-0) + (+0) yields
(-0).

~~~
thethirdone
In IEEE 754-1985 and IEEE 754-2008 this is not valid behavior. I have mostly
worked on the 2008 revision, so I didn't know if 0 was commutative in the
original standard. Thanks for making me look it up.

IEEE 754-2008

> Section 6.3 The sign bit

> When the sum of two operands with opposite signs (or the difference of two
> operands with like signs) is exactly zero, the sign of that sum (or
> difference) shall be +0 in all rounding-direction attributes except
> roundTowardNegative; under that attribute, the sign of an exact zero sum (or
> difference) shall be −0. However, x + x = x − (−x) retains the same sign as
> x even when x is zero.

No quote for IEEE 754-1985 because I didn't want to type it out and it is
mostly the same.

~~~
cperciva
Huh, interesting. I've definitely seen systems where the sum of two zeroes
took the sign of the first value. I assumed they were inheriting 754 behaviour
but I guess not.

------
vesinisa
This post does not seem to compare C++ and C, but C++ and bc arbitrary
precision calculator?

~~~
colanderman
He edited it after fixing an error in the C++ version. (See my comment
elsewhere in this thread.)

~~~
johndcook
Right. Sorry for the confusion. The post hit HN while I was editing.

------
rurban
@dang please rename title to "Floating point oddity" as in the linked post.
It's not about C vs C++.

------
kccqzy

        template <typename T> T q(T x) {
          return fabs(x - sqrt(x*x + 1.)) - 1./(x + sqrt(x*x + 1.));
        }
    

The author doesn't seem to have a lot of numeric experience. He probably
doesn't know that, even when T is float, the compiler uses the double version
of sqrt.

Don't believe me? Read the assembler:

    
    
        0000000000000c3c <float q<float>(float)>:
         c3c:   55                      push   rbp
         c3d:   48 89 e5                mov    rbp,rsp
         c40:   48 83 ec 20             sub    rsp,0x20
         c44:   f3 0f 11 45 fc          movss  DWORD PTR [rbp-0x4],xmm0
         c49:   f3 0f 5a 5d fc          cvtss2sd xmm3,DWORD PTR [rbp-0x4]
         c4e:   f2 0f 11 5d f0          movsd  QWORD PTR [rbp-0x10],xmm3
         c53:   f3 0f 10 45 fc          movss  xmm0,DWORD PTR [rbp-0x4]
         c58:   f3 0f 59 45 fc          mulss  xmm0,DWORD PTR [rbp-0x4]
         c5d:   f3 0f 5a c0             cvtss2sd xmm0,xmm0
         c61:   f2 0f 10 0d 4f 02 00    movsd  xmm1,QWORD PTR [rip+0x24f]        # eb8 <std::piecewise_construct+0x8>
         c68:   00
         c69:   f2 0f 58 c1             addsd  xmm0,xmm1
         c6d:   e8 5e fc ff ff          call   8d0 <sqrt@plt>
         c72:   f2 0f 10 5d f0          movsd  xmm3,QWORD PTR [rbp-0x10]
         c77:   f2 0f 5c d8             subsd  xmm3,xmm0
         c7b:   66 0f 28 c3             movapd xmm0,xmm3
         c7f:   f3 0f 7e 0d 39 02 00    movq   xmm1,QWORD PTR [rip+0x239]        # ec0 <std::piecewise_construct+0x10>
         c86:   00
         c87:   66 0f 54 c1             andpd  xmm0,xmm1
         c8b:   f2 0f 11 45 f0          movsd  QWORD PTR [rbp-0x10],xmm0
         c90:   f3 0f 5a 65 fc          cvtss2sd xmm4,DWORD PTR [rbp-0x4]
         c95:   f2 0f 11 65 e8          movsd  QWORD PTR [rbp-0x18],xmm4
         c9a:   f3 0f 10 45 fc          movss  xmm0,DWORD PTR [rbp-0x4]
         c9f:   f3 0f 59 45 fc          mulss  xmm0,DWORD PTR [rbp-0x4]
         ca4:   f3 0f 5a c0             cvtss2sd xmm0,xmm0
         ca8:   f2 0f 10 0d 08 02 00    movsd  xmm1,QWORD PTR [rip+0x208]        # eb8 <std::piecewise_construct+0x8>
         caf:   00
         cb0:   f2 0f 58 c1             addsd  xmm0,xmm1
         cb4:   e8 17 fc ff ff          call   8d0 <sqrt@plt>
         cb9:   f2 0f 58 45 e8          addsd  xmm0,QWORD PTR [rbp-0x18]
         cbe:   f2 0f 10 0d f2 01 00    movsd  xmm1,QWORD PTR [rip+0x1f2]        # eb8 <std::piecewise_construct+0x8>
         cc5:   00
         cc6:   f2 0f 5e c8             divsd  xmm1,xmm0
         cca:   66 0f 28 c1             movapd xmm0,xmm1
         cce:   f2 0f 10 55 f0          movsd  xmm2,QWORD PTR [rbp-0x10]
         cd3:   f2 0f 5c d0             subsd  xmm2,xmm0
         cd7:   66 0f 28 c2             movapd xmm0,xmm2
         cdb:   f2 0f 5a c0             cvtsd2ss xmm0,xmm0
         cdf:   c9                      leave
         ce0:   c3                      ret
    

And all you need to know is that the call instruction is calling sqrt@plt
instead of sqrtf@plt.

And if you don't believe the compiler either, I'll quote this sentence from
the C++11 standard, section 2.14.4 [lex.fcon]:

> The type of a floating literal is double unless explicitly specified by a
> suffix. The suffixes f and F specify float, the suffixes l and L specify
> long double.

So if you do `sqrt(x*x + 1.)` the right operand of plus is double, therefore
the argument to sqrt is double, and you get the double version.

~~~
acqq
Just using templates with floating point numerical computation is a sure red
flag. One has to adjust the code to the limits imposed by each internal
representation.

Floating point code is not to be written without figuring out much more than
copying some math formula from somewhere.

In this case the article maybe attempts to demonstrate that fact, but the
discussion of what's actually happening is surely lacking.

~~~
mkl
> Just using templates with floating point numerical computation is a sure red
> flag.

What do you think of Eigen, then? It seems to do pretty well, and is very
template heavy.

~~~
acqq
I’m not claiming it’s impossible in general, just that it’s much more probable
that the limits have to be defined explicitly for different sizes. Template
speziasiations etc.

------
DoctorOetker
somewhat related question: how do I most easily verify / disprove that

(a+b)-b == a exactly

for a,b floating point numbers?

(i.e. I don't want the compiler to optimize the LHS away to a...)

~~~
Doxin
You could prevent the compiler from optimising it by either running with
optimisations turned off or by declaring a and b as volatile. Declaring a
variable as volatile tells the compiler that something external to the program
(and indeed even external to the processor or memory!) might change the value
of that variable at any time. In this case that's not _actually_ true but it
should stop the compiler from optimising the variable access in any case.

The easiest case to disprove your proposed invariant is to set a to 0 and b to
NaN. The output is NaN and not 0.

------
geoalchimista
I think it is well known that floating-point numbers have their limitations.
How would the IEEE 754 floating-point system know when to use L'Hôpital's
rule? If you know there is a singularity (0 / 0 or inf / inf) then you don't
have to use floating-point; use calculus instead.

Edit: What I want to say is that this is not a problem of floating-point
number (not C or C++ either). It works _as intended_. Trying to increase the
precision for evaluating (exp(x) - 1) / x at x = 0 is like beating a dead
horse. Instead, a robust approach would be to expand (exp(x) - 1) / x at x = 0
as a Taylor series for very small x:

1 + x/2 + x^2/6 + x^3/24 + x^4/120 + O(x^5)

