
Approximating sin(x) to 5 ULP with Chebyshev polynomials - wallacoloo
http://mooooo.ooo/chebyshev-sine-approximation/
======
gp7
This page crashes my chrome tab!!

Anyway, anyone interested enough in this to be reading this comment thread
should go look up approximation theory and approximation practice by trefethen
[0], which is a very very good book, that covers chebyshev polynomials in a
very clear way. The real party trick you get out of it, though, is that by
taking samples of functions at the roots of chebyshev basis polynomials and
applying the discrete cosine transform, you can get a set of chebyshev
coefficients to approximate a function in O(n log n) time

[0]
[https://people.maths.ox.ac.uk/trefethen/ATAP/](https://people.maths.ox.ac.uk/trefethen/ATAP/)

~~~
m_myers
> This page crashes my chrome tab!!

The problem appears to be the SVG image [http://mooooo.ooo/chebyshev-sine-
approximation/horner_ulp.sv...](http://mooooo.ooo/chebyshev-sine-
approximation/horner_ulp.svg), which has a path of 5000+ points.

~~~
wallacoloo
Ah, my bad. I thought browser support for SVGs was pretty robust at this
point. I've gone ahead and rasterized all the images (save for the inline
equations), but clicking on the images will still take you to the original
SVGs.

------
qooiii2
You can do a little better with a minimax polynomial that you can derive with
the Remez algorithm. If you have Matlab, you can use the excellent Chebfun
package's remez command to do this almost instantly for any well-behaved
function.

The best approximation I could find calculated sin(x) and cos(x) for
[-pi/4,pi/4] and then combined these to cover all other inputs. These
functions need only 4 or 5 terms to get +/-1 ULP accuracy for 32-bit floats.

I thought the harder problem was the argument reduction to enable calculating
something like sin(10^9). It turns out that you need a lot of bits of pi to do
this accurately. While testing my implementation, I was surprised to learn
that the x87 fsin implementation only has 66 bits of pi, so it gives extremely
wrong answers for large inputs.

~~~
wallacoloo
> You can do a little better with a minimax polynomial that you can derive
> with the Remez algorithm.

Yup! This was something I meant to mention in a footnote. If I could get
equivalent [infinite precision] error bounds using a minimax polynomial _of a
lower degree_ , I think this would make a difference. Otherwise, even if the
theoretical error bounds are lower for a minimax polynomial _of the same
degree_ , it wouldn't do anything to combat the noise introduced by rounding
errors, assuming you keep everything else the same - and that noise seems to
be by far the limiting factor. But that's just conjecture - no way to know for
certain without trying it.

I do still like the Chebyshev polynomial approach because it's the only non-
iterative approach I know of. It's something that I could implement without
the aid of a CAS program pretty trivially, if I wanted to.

> The best approximation I could find calculated sin(x) and cos(x) for
> [-pi/4,pi/4] and then combined these to cover all other inputs.

Yes, that also seems to be the way to go (at least for precision). I didn't
mention it in the article, but even though the code I showed was rust, the
actual implementation is in the form of an AST for a plugin system that only
supports f32 addition, multiplication, division, and modulo as operations (no
branching, f32<->int conversions, etc). My entire motivation was to create a
sine function for this system. It's known that the inputs to the function will
be in (-pi, pi), so I don't have to perform reduction over any larger range. I
don't see a practical way to reduce down to (-pi/4, pi/4) using the operations
available to me, but I could be overlooking something.

~~~
qooiii2
Under those constraints, Chebyshev approximation will probably outperform
minimax. Minimax approximations have non-zero error at the endpoints, so they
wouldn't quite hit zero at sin(pi) and your relative error would be horrible.

~~~
psi-squared
If you want exactly zero at the end points, you could do something like the
post does, of approximating sin(x) / x(pi+x)(pi-x), or similar. You can still
do that with the Remez algorithm.

Also, a while ago I realized that you can tweak the Remez algorithm to
minimize relative error (rather than absolute error) for strictly-positive
functions - it's not dissimilar to how this blog post does it for Chebyshev
polynomials, in fact. I should really write a blog post about it, but it's
definitely doable.

So combining those two, you should be able to get a good "relative minimax"
approximation for pi, which might be better than the Chebyshev approximation
depending on your goals. Of course, you still need to worry about numerical
error, and it looks like a lot of the ideas in the original post on how to
deal with that would carry over exactly the same.

------
dnautics
this is really beautiful. If you used an fused dot product [0], you probably
very easily get to within 0.5 ULP.

[0]
[http://ieeexplore.ieee.org/document/6545890/?arnumber=654589...](http://ieeexplore.ieee.org/document/6545890/?arnumber=6545890)

~~~
jacobolus
(Disclaimer: I’m not a numerical analyst or hardware expert.) I’m not sure if
you can use a dot product unit like the one in your link to make as much of a
difference when handling recurrences like “Clenshaw’s algorithm” for
evaluating Chebyshev polynomials (the method used in the OP, an analog of
Horner’s algorithm), as you can get for taking a big dot product or computing
an FFT or whatever. What I expect you really want is to keep your temporary
variables in higher precision all the way through the recurrence.

You can probably get some improvement vs. separate floating point addition and
multiplication with FMA instructions on existing standard hardware (and also a
slight speed boost). I haven’t ever carefully tested the accuracy improvement
in practice though.

If you are willing to take a hit on performance, you can even save the lower
order bits of higher-precision accumulated floating point numbers by using two
floats (doubles) per variable, and get almost double the precision with the
same hardware (keywords: “compensated arithmetic”, “Kahan summation”).

Or see
[http://www.chebfun.org/examples/cheb/Turbo.html](http://www.chebfun.org/examples/cheb/Turbo.html)
[https://arxiv.org/abs/1404.2463](https://arxiv.org/abs/1404.2463) for a kind
of mind-blowing, much mathier trick.

------
shuber_
Note that Chebyshev polynomials have an intimate relationship to trigonometric
functions:

cos(n x) can be expressed as a polynomial of cos(x), namely T_n(cos(x)). In
other words, T_n(x) = cos(n acos(x)).

------
kazinator
Here is approximating Fibonacci with polynomials. Evidently I wrote this in
June, 2014:

    
    
      #include <math.h>
      #include <stdio.h>
    
      int fib(int n)
      {
        double out = 0.002579370;
        out *= n; out -= 0.065277776;
        out *= n; out += 0.659722220;
        out *= n; out -= 3.381944442;
        out *= n; out += 9.230555548;
        out *= n; out -= 12.552777774;
        out *= n; out += 7.107142857;
        out *= n;
        return floor(out + 0.5);
      }
    
      int main(void)
      {
        int i;
        for (i = 0; i < 9; i++) {
          printf("fib(%d) = %d\n", i, fib(i));
          switch (i) {
          case 7:
            puts("^ So, is this calculation fib or not?");
            break;
          case 8:
            puts("^ Answer to life, universe & everything responds negatively!");
          }
        }
        return 0;
      }
    

Run:

    
    
      $ ./fib2
      fib(0) = 0
      fib(1) = 1
      fib(2) = 1
      fib(3) = 2
      fib(4) = 3
      fib(5) = 5
      fib(6) = 8
      fib(7) = 13
      ^ So, is this calculation fib or not?
      fib(8) = 42
      ^ Answer to life, universe & everything responds negatively

------
grondilu
It seems to me that it makes more sense to approximate sinPi(x) = sin(pi x)

    
    
        scaling[x_] := 8/3  (x  - x^3)
        indices = {0, 2, 4, 6, 8, 10}
        inner[g_, h_] := NIntegrate[g h/Sqrt[1 - x^2], {x, -1, 1}, WorkingPrecision-> 40, PrecisionGoal-> 20]
        cheb[i_] := inner[ChebyshevT[i, x], Sin[Pi x]/scaling[x]]/inner[ChebyshevT[i,x], ChebyshevT[i, x]]
        coeff = cheb /@ indices
        apxi[i_] := coeff[[1 + i/2]]ChebyshevT[i, x]
        sinPi[x_] := Total[apxi /@ indices]scaling[x]
        Plot[sinPi[x], {x, -1, 1}]
        N[Simplify[sinPi[x]/(1-x^2)], 9]
    

This gives the following polynomial expression for sinPi(x)/(1-x²):

    
    
        3.14159264x -2.02611946x^3 +0.524036151x^5 -0.0751872634x^7 +0.00686018743x^9-0.000385937753x^11
    

So the function becomes:

    
    
        sub sinPi($x) {
            my $x2 = $x²;
            $x * (1 - $x2) *
            reduce * * $x2 + *,
            -0.000385937753,
             0.00686018743,
            -0.0751872634,
             0.524036151,
            -2.02611946,
             3.14159264
        }
    

That being said, cool stuff, definitely.

------
grondilu
Notice that your program could use a fold [1]. Not sure it would be much more
efficient, but that would give it a more functional-programming style, which
is arguably more elegant.

I wrote it in Perl 6 for instance:

    
    
        sub wallace-sin($x) {
            constant pi_major =  3.1415927e0;
            constant pi_minor = -0.00000008742278e0;
            my $x2 = $x²;
            ($x - pi_major - pi_minor) *
            ($x + pi_major + pi_minor) * $x *
            reduce * * $x2 + *,
                0.00000000013291342e0,
               -0.000000023317787e0,
                0.0000025222919e0,
               -0.00017350505e0,
                0.0066208798e0,
               -0.10132118e0;
        }
    
    

1\. [https://en.wikipedia.org/wiki/Fold_%28higher-
order_function%...](https://en.wikipedia.org/wiki/Fold_%28higher-
order_function%29)

~~~
jacobolus
This is silly. You should (almost) never optimize low-level numerical
building-block routines for code style at the expense of performance.

The proper thing to do is make a generic library function for evaluating
Chebyshev polynomials, and then pass in the polynomial coefficients and the
values to evaluate in arrays. That’s a much more “functional” approach than
hard-coding the coefficients could ever be.

Then if you want it to go fast, you can vectorize the code and use FMA
instructions, and run it on multiple CPU cores, as in
[https://github.com/alkashani/numcrunch/blob/master/src/clens...](https://github.com/alkashani/numcrunch/blob/master/src/clenshaw_vector.c)
(this is code my wife wrote for me; I think there’s still a factor of 2 or 4
being left on the table getting data flowing efficiently through the CPU
caches, but we didn’t really deeply profile it to figure out what the issue
was). Of course you can do even better by running this type of highly
parallelizable code on a GPU.

~~~
phkahler
>> This is silly. You should (almost) never optimize low-level numerical
building-block routines for code style at the expense of performance. >> The
proper thing to do is make a generic library function for evaluating Chebyshev
polynomials, and then pass in the polynomial coefficients and the values to
evaluate in arrays.

It's funny that you express concern for performance and then advocate creating
a generic function and passing everything as parameters. For things like the
function in question, doing straight line code is usually the best way.
Optimization should consist of what the author did - find the simplest
mathematical expressions to calculate the answer and just write them down.

On the other hand, if for some reason a program is really doing this
calculation over and over your approach (worrying about data flow and
parallelism) may become preferable. I just haven't seen a case where sin(x) is
used that much ;-)

~~~
jacobolus
I thought the OP’s code was fine. My main point was that the top-level
poster’s argument about what counted as “functional” seemed silly, and that if
he really wanted to go all the way he could write a more completely generic
function which could also be made faster by getting data flowing through the
CPU more efficiently or using a GPU. (I shouldn’t have phrased this as “the
proper thing to do” without clarification, but I can’t edit it now.)

> _if for some reason a program is really doing this calculation over and
> over_

Most of the time when you want to evaluate sin( _x_ ) for one number _x_ , you
want to also evaluate it for a million other numbers. For example, you might
be updating the velocity vectors of all the rockets in a space game, finding
the map coordinates of a polyline representing the coastline of England,
converting color space coordinates for every pixel in an image from CIELAB to
CIELCh, or whatever.

In cases where you are just calling the sine function as a one-off,
performance is pretty well irrelevant.

* * *

Since sin( _x_ ) has a few steps before and after the polynomial evaluation,
you are right that it usually makes sense to hard-code the whole function with
domain normalization etc. (ideally still vectorized), though those other steps
could also be genericized if necessary, since you could also use them for
various other special functions.

The nice thing about just storing the polynomial coefficients as data is that
you can then do more things with them than evaluation alone. For inspiration,
look at the Chebfun project,
[http://www.chebfun.org/](http://www.chebfun.org/)
[https://people.maths.ox.ac.uk/trefethen/cacm.pdf](https://people.maths.ox.ac.uk/trefethen/cacm.pdf)

* * *

One last thing: I recommend trying to re-frame trigonometry problems as vector
problems whenever possible, and thereby not needing the sine function at all.
Code which avoids transcendental function evaluations altogether is usually
simpler to reason about and faster.

~~~
phkahler
>> One last thing: I recommend trying to re-frame trigonometry problems as
vector problems whenever possible, and thereby not needing the sine function
at all. Code which avoids transcendental function evaluations altogether is
usually simpler to reason about and faster.

I couldn't agree with that more!

------
gravypod
Are there other people working on math tricks like this to make complex
functions cheap?

~~~
stephencanon
sin( ) is typically implemented with (more sophisticated versions of) exactly
the same "tricks"; most library implementations trade off a little speed for
more accuracy (and obviously support inputs from the whole number line, not
just (-pi,pi)), but honestly they're pretty fast already.

There are also some libraries that provide vectorized implementations, often
with a few options for different accuracy/performance tradeoffs.

~~~
qntty
Is the typical implementation you're referring to CORDIC[1]?

[1]
[https://en.wikipedia.org/wiki/CORDIC](https://en.wikipedia.org/wiki/CORDIC)

~~~
dnautics
[https://en.wikipedia.org/wiki/Pad%C3%A9_approximant](https://en.wikipedia.org/wiki/Pad%C3%A9_approximant)

~~~
stephencanon
Sometimes Padé, more often Minimax or Carathéodory-Fejer.

------
aeleos
This is pretty cool. I always find it interesting when cool mathematics ideas
are used to help computer science.

I remember a while back I tried implementing something like this in a hobby
operating system I was making, with constants similar to the one in the post.
I found them on some random paper that included constants for a bunch of
different trig functions. I don't think it ended up working and I ended up
using a table of values that it estimated between.

~~~
tanderson92
This is solidly numerical analysis, not really computer science.

~~~
kazinator
Once upon a time, there was no difference. :)

------
pmalynin
The site crashes the Chrome Tab on 57.0.2987.133 (64-bit) Windows when I
scroll down :/

------
ezequiel-garzon
Very basic question: what is the programming language of the code snippets?

~~~
simias
It's Rust

~~~
JorgeGT
Some of them are Mathematica / Wolfram Language.

------
hpcjoe
One of the _major_ concerns I have with this code, and codes derived from
this, has to do with specific numerical underflow that can be seen to be
happening before the series is summed.

1st:

    
    
      let pi_major = 3.1415927f32;
      let pi_minor = -0.00000008742278f32;
    

[...]

    
    
      (x - pi_major - pi_minor) *
      (x + pi_major + pi_minor) * p1 * x
    

Focus on the x, pi_minor and pi_major.

This is fine, pi_major is basically pi. pi_minor is a small constant.

One knows from basic numerical analytics as applied to computing systems
floating point [0] to never do additions/subtractions along the lines of x
+/\- c*epsilon, and expect to have anything close to an accurate result.

What you are really doing is adding/subtracting numbers of such widely
different scales. This is a bad thing in general, and the only real way to
handle this is to increase precision. Otherwise you completely swamp the small
number with a larger one, and you lose all precision from your smaller number.

Simple example in Julia (0.5.1 or higher)

julia> deltax = Float16(0.1)

Float16(0.1)

julia> sum = Float16(0)

Float16(0.0)

julia> for i=1:1000000 sum+=deltax end

julia> sum

Float16(256.0)

Yes, that is right, according to this, 1/10th as a Float16 summed 1 million
times reports as 256. Which is wildly ... ridiculously ... wrong.

Then redo this as Float32

julia> deltax2 = Float32(0.1)

0.1f0

julia> sum2 = Float32(0)

0.0f0

julia> for i=1:1000000

    
    
           sum2+=deltax2
    
           end
    

julia> sum2

100958.34f0

Much closer, but you can still see the impact from the catastrophic loss of
precision [1]. This is a very well known phenomenon, that people seem to
forget about every now and then, until it comes back to bite you.

I illustrated this for an HPC/parallel programming class I taught a decade ago
[2] using some simple C code. The math isn't the problem. Its the order of
accumulation of error.

This should be fairly sobering to those whom are happy running f16 codes ...
you need to be very careful about the order of your operations (far more so
than with f32, or f64), lest you suffer a catastrophic loss of precision, and
wind up with a code that "computes faster", or as we old HPC types like to say
about codes that make these sorts of mistakes "generates garbage at a
tremendous rate".

During my class a decade ago, I had an astronomer whom coded in Fortran
complain to me bitterly that he's always done sums this way, and there was no
way that this concept was correct. The C code was written to demonstrate the
effect to him.

It is sobering how much garbage numerical code there is out there. It gets
even harder to distinguish good numerical code from bad numerical code when
people get lost in almost irrelevant things, like coding style, or language of
implementation.

If you really want to freak someone out, and you have a calculator app, see if
you can take 1/3 + 1/3 + 1/3 - 1 (or 1/7 + ... + 1/7 - -1). If they use less
than f32, the 1/3 trick will get you a few ULP from zero. If they use
something below f64, the 1/7 and a number of others will show this.

Basically do a forward and reverse function application of a non-zero non-
pathological argument, and subtract the original value from this. A compiler
may optimize this away. An interpreter wont. And you'll see the issue of
precision right there in the non-zero result.

Note as well, some apps will try to be "smart" and handle that case (sort of
like optimizing benchmark codes with specialized handling). This is more
troubling than doing the right thing, but I've always been able to find ways
to expose the precision limits with simple tests like this.

The limited precision isn't a problem, its how you make use of it, and use,
compare, add/subtract numbers with it that matters.

[0]
[https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.h...](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html)

[1]
[https://en.wikipedia.org/wiki/Loss_of_significance](https://en.wikipedia.org/wiki/Loss_of_significance)

[2] [https://scalability.org/2007/11/why-am-i-surprised-that-
peop...](https://scalability.org/2007/11/why-am-i-surprised-that-people-found-
this-surprising/)

[edited for code clarity]

~~~
jwmerrill
These are good points in general, but I don't think they're relevant to the
author's partitioning of PI to get extra precision near the zeros. The article
does a good job of explaining exactly what's going on here, and has plots to
prove that nothing is going catastrophically wrong.

------
rofrol
eli5?

