
The quadratic formula and low-precision arithmetic - ingve
https://www.johndcook.com/blog/2018/04/28/quadratic-formula/
======
ColinWright
We have, in the past, had code that computes the roots of quadratic equations
in what appears to be a needlessly complex way. It's deliberately uncommented,
and deliberately has an innocuous name.

When a newer employee comes across the code they often "simplify" it using the
quadratic formula they know so well, and which they believe always works.

The tests go red.

They roll back and look to try to understand. Eventually they either work it
out, or they come and ask for advice. We work through the test case with them,
and point out just how important it is that:

* There are tests;

* Routines have useful and enlightening names;

* If the code is complex, there's probably a reason, although not always;

* Sometimes a comment might be useful.

It's proven to be a really useful learning experience, and gets people up to
speed very quickly with things. It's then left again as a learning experience
waiting to happen.

Of course, this can only work in a culture where people work constructively to
understand what's going on, and support each other - a place where it's OK to
say that you don't understand something, and know that others will work with
you. I've been in places where this wouldn't work.

All in all, I've been lucky.

The tests, by the way, only have a moderate size _b,_ but have a very small
_a._ In this way _b^2-4ac_ is very close to _b_ , so one of the solution paths
is subtracting nearly equal quantities, getting small rubbish. But then you go
on to divide by a very small number, giving big rubbish.

The geometric interpretation is that the parabola is very, very flat, so if
you move it even the tiniest amount up or down the roots move very far, very
fast.

It's a useful example - my thanks to johndcook for the write-up.

~~~
murkle
Also from Colin :) [http://www.solipsys.co.uk/cgi-
bin/sews.py?NumericallyStableS...](http://www.solipsys.co.uk/cgi-
bin/sews.py?NumericallyStableSolutionForTheQuadraticEquation)

------
jostylr
From the book "real computing made real" by Acton, the suggestion is to first
view (not always!) the quadratic as x^2 + 2bx + c, so its roots are -b +/\-
sqrt(b^2 - c). The product of the roots is c and therefore we can compute the
root that does not do the subtraction and then divide c by that root to get
the other root.

This avoids any of the bad subtraction.

Some sample js code that corresponds to quadratic in the post:

    
    
        b = 5e7
        c = 1
    
        d = b*b - c
    
        if (d >= 0) {
            m = Math.sqrt(d)   
            if ( b> 0) {
                r = -b - m
                s = c/r
            } else if ( b <  0) {
                s = -b + m
                r = c/s
            } else {
                r = m
                s = -r
            }
            console.log(r, s)
        } else {
            console.log("no real roots");
        }
    

This gives the correct roots for this problem in one go.

~~~
murkle
Numerical Recipes says to find q=-1/2[b+sgn(b)sqrt(b^2-4ac)] and then find
x1=q/a and x2=c/q - is that better?

[http://www.solipsys.co.uk/cgi-
bin/sews.py?NumericallyStableS...](http://www.solipsys.co.uk/cgi-
bin/sews.py?NumericallyStableSolutionForTheQuadraticEquation)

------
venning
> _The quadratic equation violated the cardinal rule of numerical analysis:
> avoid subtracting nearly equal numbers. The more similar two numbers are,
> the more precision you can lose from subtracting them._

Are there other "rules" like this, regarding numerical analysis? I understand
this, but I've never thought of articulating it like that.

~~~
nabla9
If you add together many numbers, do it in ascending order, numbers with
smallest absolute values first.

Comparing floats is tricky. Use either epsilon, epsilon with relative error or
ULP's (all are tricky)

~~~
jjgreen
That costs a sort, better to use the Kahan summation formula
[https://en.wikipedia.org/wiki/Kahan_summation_algorithm](https://en.wikipedia.org/wiki/Kahan_summation_algorithm)

~~~
Retric
Look at the accuracy comment, it's faster but less accurate than sorting
first.

------
darkmighty
> So we might be interested in low-precision arithmetic to save energy in a
> battery powered mobile device, or to save clock cycles in a server
> application manipulating a lot of data. This means that the numerical tricks
> that most people have forgotten about are relevant again

The more I experience engineering advances the more I see how cyclical
techniques and technologies are. Almost everything old that was useful will
eventually be needed again. Early computers were severely memory and cpu
restricted. Then we got giant computers. Then we're putting microcontrollers
everywhere and need those frugal techniques again. And so on the cycle will
continue.

A possible conclusion is that the danger of obsolescence from a really
profound knowledge in a field is far lower than one would expect, specially if
you can weather the cycles of technology and find new markets to apply your
expertise.

~~~
crankylinuxuser
Or do what I do and work with integers and do the heavy stuff on a big
computer with lots of CPU and memory.

Integer math is fast, easy, and not error prone, unless you cross the limits
in limits.h . But those are simple bounds checking, or you size the vars large
enough so that you never need to worry about them. Primarily, if you use
floats, you grab the data and send it on its way. Don't process if you don't
have to. Drop it to MQTT/CoAP/AMQP.

I'm thinking in the similar way Klipper offloads 3d printer math to a heavy
computer with GHz and GBs of ram for trig math and array compensation.

~~~
tripletao
I'm pretty sure integer (i.e., fixed point) math suffers from the same problem
here. The only difference is that (a) for a given size, you get a little more
precision, since you're not spending bits on the exponent, and (b) you have to
hand-code the shift where you lose precision, instead of letting the FPU
automatically make the shift where you lose precision.

Sometimes you can get clever with the modular arithmetic--like, it's okay to
add (or subtract) a list of fixed-point numbers where partial sums overflow
and wrap, as long as the final result doesn't. I don't know any useful analogy
to that here, though.

~~~
crankylinuxuser
When dealing with floating point, every conversion and computation leads to
'wrong answers', that are only in close proximity to the correct answer.

This paper discusses a great deal the sources of error, and how the IEEE
standard does things.
[https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.h...](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html)

Using integers instead bypasses this source of error. The only caveat that I
keep attenion to is that you have to be able to expand the size of the space
to prevent decimals from being used (think of multiplying by 100 when dealing
with cash). And if you do deal with floats, do the math you have to and then
drop back into ints, or save them and send them off.

~~~
tripletao
Your comment is deeply misleading. Floating point is mostly just fixed point
plus some rules for automatically moving the decimal point. The point is that
depending how you express the formula, you need to "expand the size of the
space to prevent decimals from being used" (what do you even mean by that, if
your answer has no exact binary representation? how would you express 1/7, or
10000/7, or 65536/7?) by spectacularly more or less.

If you think I'm wrong: Can you code the quadratic formula using integer or
other fixed-point math, and demonstrate why it doesn't need the attention to
bad numerical conditioning noted in the parent?

~~~
crankylinuxuser
> Your comment is deeply misleading. Floating point is mostly just fixed point
> plus some rules for automatically moving the decimal point.

Uhhhhh..... I hope this is a farcical comment. Because if you believe this,
then you might want to read up here: [https://en.wikipedia.org/wiki/Floating-
point_arithmetic](https://en.wikipedia.org/wiki/Floating-point_arithmetic)

~~~
tripletao
Perhaps I'm misunderstanding you somewhere; but this is basically what I do
for a living, so I don't think I'm wrong. I have carefully re-read my comments
above, and would urge you to do the same.

Again: Can you write the quadratic formula in the way that you think is better
(in pseudocode, or a language of your choice), and give examples of inputs
where it performs better than floating point? If you don't do this and I'm
wrong, then you're missing the opportunity to teach me (and all the people I'm
misleading with my comments) something. If I'm right, then you're missing the
opportunity to learn something yourself.

------
drwells
Rounding issues in the quadratic formula have bitten me several times: in
particular, the inverse of a 2D bilinear mapping requires solving a quadratic
equation which (if things are skewed) can be very badly conditioned. On the
other hand, if one uses a bilinear map on a parallelogram then the quadratic
term is gone and one root is zero: also annoying.

In practice I just check the coefficients and then use appropriate versions of
the quadratic formula. I guess that I am luckier than Mr. Cook here since I
only need one root.

code (some of it was written by me, some improvements are by other people):
[https://github.com/dealii/dealii/blob/master/source/fe/mappi...](https://github.com/dealii/dealii/blob/master/source/fe/mapping_q_generic.cc#L93)

This set of cases (which can probably be further improved) is equivalent to
his first answer. I don't think its possible to do better in this test case
due to the catastrophic cancelation in subtracting the discriminant from b.

~~~
obl
inverse bilinear map is a good example where, even if there is an analytical
solution, it's often more robust to just do a couple newton iterations until
you get down to machine precision.

added benefit is that it carries on in 3D (although with a larger jacobian of
course).

~~~
drwells
You are absolutely right re: robustness (and that code falls back to the
iterative method), but IIRC the Newton scheme was at least one order of
magnitude slower than the formula. It was also slightly less accurate simply
because it did more floating point operations.

I have not yet figured out how to make the 3D formula work so we always do the
iterative method in 3D.

A nice trick: we approximate the (bi/tri)linear map as affine (usually quite
accurate) to get a good initial guess.

------
barbegal
I always try to relate these floating point issues back to real situations. Is
there really a case where you would use a quadratic equation such as this and
what kind of precision would you really be interested in? In the first
example, the root is wrong by 25% in percentage terms but is wrong by less
than 2.5e-9 in absolute terms.

What is important is to tailor your maths to the real problem you are solving
(and remember garbage in = garbage out). Very rarely do you actually need 53
bits of precision (as you get in a double), or even need to store intermediate
results with 53 bits of precision.

~~~
math_and_stuff
This exact issue turns out to be critical in the Hessenberg QR algorithm,
which is at the heart of MATLAB's eig.

~~~
barbegal
Yeah I understand why a product like Matlab which needs accuracy across a wide
range of end use cases would need to consider things like this.

------
compumike
This is great and reminds me of an issue we ran into developing the numerical
engine in circuit simulation software; user-specified values spanning many
orders of magnitude mean you're likely to run into issues like this. In our
case the answer was to go to 128-bit floating point operations inside our
sparse matrix solvers, which we published here:
[https://www.edn.com/design/analog/4418707/1/Extended-
precisi...](https://www.edn.com/design/analog/4418707/1/Extended-precision-
simulation-cures-SPICE-convergence-problems)

~~~
CamperBob2
_We measured less than 3% slowdown within a typical well-behaved circuit
simulation._

That's pretty amazing. CPU pipelining at work, or was the profile just
dominated by other things?

------
magicalhippo
A raytracer I worked on had an interesting bug, where a manual optimization
caused some unforeseen floating-point issues down the line.

The short story was that we had an expression of the form x = a/b * c/d * e/f
in a fairly hot code path. As most developers know, divisions are expensive,
so this got rewritten to x = (a * c * e) / (b * d * f).

The problem was that in certain edge cases, the factors were all extremely
small. This caused the product b * d to become a denormal number, and when
multiplied with f it became zero. This of course caused the division to return
infinity.

However, closer inspection of the source of the terms a to f revealed that the
pairs a,b etc always had roughly the same magnitude. Thus even though a and b
were almost denormal numbers, the division a / b was well-behaved and gave a
sensible result, and similar for the others.

Undoing the optimization and using the original form made the code well-
behaved again.

Another expression that had to be de-optimized for the same reason was x = a /
(b * b), which had to be rewritten as x = (a / b) / b.

So as with subtraction, division of IEEE floating-point numbers can be tricky.

This[1] page contains a nice set of tables with the results of performing IEEE
floating-point operations on edge-case inputs.

[1]:
[https://www.cs.uaf.edu/2011/fall/cs301/lecture/11_09_weird_f...](https://www.cs.uaf.edu/2011/fall/cs301/lecture/11_09_weird_floats.html)

------
abecedarius
I wonder how much the sqrt function is simplifying our lives here — might it
be simpler to deal these issues in terms of Newton’s method directly, just
ignoring the quadratic formula?

------
jimhefferon
Unum's? I bet that many of us here would be very interested in the thoughts on
the topic of folks here who know about it. Is it a revolution or is it
nonsense, an engineering mistake?

~~~
johndcook
Posits are a hardware-friendly implementation of unums, a sort of halfway
house on the road to Gusafson's full vision of unums. Posits have a fixed size
and are meant to be a drop-in replacement for IEEE floats. They can and are
being implemented in hardware.

Unums are more ambitious. They would have variable lengths, so implementing
them would require more changes than just changing the way a fixed-length set
of bits is interpreted.

~~~
dnautics
As part of the spec, a true posit implementation includes an exact accumulator
unit. This has been described for decades, is pipelineable, and allows for,
say 10e20+1-10e20 == 1. We've shown that this gives speedup and about 10^5
improvement in accuracy for linpack 1000.

------
dvfjsdhgfv
Whenever I use floating point for anything serious, I always feel tempted to
use modules like decimal or bigfloat from the start, just to eliminate
potential problems that can (and eventually will) crop up.

~~~
maxlybbert
I worked at a small company that had a mathematician on staff. He wanted to
retire, and hired a replacement.

The new hire made some comments that I thought sounded irresponsible, such as
almost nobody does numerical analysis and instead they use double precision
because, usually, even with catastrophic cancellation there are enough bits to
give a useful answer. But, apparently, “use a higher precision” is a common
answer in scientific programming, _and_ it’s much easier to get right than
numerical analysis ( [https://www.davidhbailey.com/dhbtalks/dhb-expm-
sun.pdf](https://www.davidhbailey.com/dhbtalks/dhb-expm-sun.pdf) , first and
last slides).

If you can afford it, that probably is the best approach. Cook’s article
specifically mentions that the issue crops up with single precision floating
point (which people might want for SIMD or GPU reasons, or on mobile to save
power). In other words, Cook is talking about cases where you can’t afford
extended precision.

------
user51442
Interesting historical piece from George Forsythe:
[http://i.stanford.edu/pub/cstr/reports/cs/tr/66/40/CS-
TR-66-...](http://i.stanford.edu/pub/cstr/reports/cs/tr/66/40/CS-TR-66-40.pdf)

------
rocqua
> CPUs have gotten much faster, but moving bits around in memory has not.

Would it then not suffice to just load 32 bit floats, convert them to 64 bits
on the processor, and then store them again as 32 bits? Essentially, we are
'compressing' our data in memory, whilst doing the computational steps with
full precision.

~~~
static_noise
There is the 80 bit "long double" type that contains excess precision over the
64 bit double type. But if you want to write code that's both, portable, and
reliable you have to turn off a lot of optimizations and use 32 or 64 bit
types with standardised behavior.

------
nearmuse
Am I missing something or is it just an example from "What every computer
scientist should know about floating point arithmetic"? There are some other
nice ones there as well. And when any of this ceased being relevant?

------
gerdesj
_Well, there’s an interesting wrinkle. When the linear coefficient b is large
relative to the other coefficients, the quadratic formula can give wrong
results when implemented in floating point arithmetic._

No, there is no wrinkle whatsoever: your computer, guided by your assumptions
and given inadequate input (let alone processing logic) got it wrong.

The day you start to blame inputs for errors in output, for which you have
complete control is the one that you should mark as a good point to really
consider whether you are in the right game.

~~~
shadowfacts
He’s not blaming the inputs. He’s simply saying that some inputs can produce
incorrect outputs. He’s not saying the incorrect outputs are the _fault_ of
the inputs. He goes on to explain where the blame actually lies, in floating
point precision.

