Implementing Cosine in C from Scratch (2020)

dancsi · on March 29, 2022

The implementation of the __cos kernel in Musl is actually quite elegant. After reducing the input to the range [-pi/4, pi/4], it just applies the best degree-14 polynomial for approximating the cosine on this interval. It turns out that this suffices for having an error that is less than the machine precision. The coefficients of this polynomial can be computed with the Remez algorithm, but even truncating the Chebyshev expansion is going to yield much better results than any of the methods proposed by the author.

l- · on March 29, 2022

https://github.com/ifduyue/musl/blob/master/src/math/__cos.c

mnarayan01 · on March 30, 2022

For anyone else confused by this, the control logic described in the comment happens in https://github.com/ifduyue/musl/blob/master/src/math/cos.c

llimllib · on March 29, 2022

> Input y is the tail of x.

What does "tail" mean in this context?

lifthrasiir · on March 29, 2022

It is a double-double representation [1], where a logical fp number is represented with a sum of two machine fp numbers x (head, larger) and y (tail, smaller). This effectively doubles the mantissa. In the context of musl this representation is produced from the range reduction process [2].

[1] https://en.wikipedia.org/wiki/Quadruple-precision_floating-p...

[2] https://github.com/ifduyue/musl/blob/6d8a515796270eb6cec8a27...

jhgb · on March 29, 2022

Does it make sense to use a double-double input when you only have double output? Sine is Lispchitz-limited by 1 so I don't see how this makes a meaningful difference.

lifthrasiir · on March 29, 2022

The input might be double but the constant pi is not. Let f64(x) be a function from any real number to double, so that an ordinary expression `a + b` actually computes f64(a + b) and so on. Then in general f64(sin(x)) may differ from f64(sin(f64(x mod 2pi))); since you can't directly compute f64(sin(x mod 2pi)), you necessarily need more precision during argument reduction so that f64(sin(x)) = f64(sin(f64timeswhatever(x mod 2pi))).

jhgb · on March 29, 2022

But am I correct in thinking that is at worst a 0.5 ulp error in this case? The lesser term in double-double can't be more than 0.5 ulp of the greater term and sensitivity of both sine and cosine to an error in the input will not be more than 1.

Also, in case of confusion, I was specifically commenting on the function over the [-pi/4, pi/4] domain in https://github.com/ifduyue/musl/blob/master/src/math/__cos.c , which the comment in https://news.ycombinator.com/item?id=30846546 was presumably about.

lifthrasiir · on March 29, 2022

Yeah, sine and cosine are not as sensitive (but note that many libms target 1 or 1.5 ulp error for them, so a 0.5 ulp error might still be significant). For tangent however you definitely need more accurate range reduction.

adgjlsfhk1 · on March 29, 2022

Double rounding can still bite you. You are forced to incur up to half an ulp of error from your polynomial, so taking another half ulp in your reduction can lead to a total error of about 1 ulp.

ufo · on March 29, 2022

I think the polynomial calculation in the end looks interesting. It doesn't use Horner's rule.

lifthrasiir · on March 29, 2022

It does use Horner's rule, but splits the expression into two halves in order to exploit instruction-level parallelism.

jacobolus · on March 29, 2022

Considering the form of both halves is the same, are compilers smart enough to vectorize this code?

adgjlsfhk1 · on March 29, 2022

I might be wrong but I would think for something like this vectorizing wouldn't save time (since you would have to move data around before and afterwards. The real benefit of this is it lets you run two fma operations in parallel.

brandmeyer · on March 29, 2022

This has been the standard algorithm used by every libm for decades. Its not special to Musl.

enriquto · on March 29, 2022

But isn't this code rarely called in practice? I guess on intel architectures the compiler just calls the fsin instruction of the cpu.

stephencanon · on March 29, 2022

No. FSIN has accuracy issues as sibling mentions, but is also much slower than a good software implementation (it varies with uArch, but 1 result every ~100 cycles is common; even mediocre scalar software implementations can produce a result every twenty cycles).

adgjlsfhk1 · on March 29, 2022

No. The fsin instruction is inaccurate enough to be useless. It gives 0 correct digits when the output is close to 0.

enriquto · on March 29, 2022

> 0 correct digits when the output is close to 0

this is an amusing way to describe the precision of sub-normal floating point numbers

adgjlsfhk1 · on March 29, 2022

It's not just sub-normal numbers. As https://randomascii.wordpress.com/2014/10/09/intel-underesti... shows, fsin only uses 66 bits of pi, which means you have roughly no precision whenever abs(sin(x))<10^-16 which is way bigger than the biggest subnormal (10^-307 or so)

ramshorns · on March 30, 2022

In that range, just returning x would be way better. Maybe even perfect actually - if x is less than 10^-16, then the error of x^3/6 is less than the machine precision for x.

adgjlsfhk1 · on March 30, 2022

the error isn't when x is small. it's when sin(x) is small. the problem happens for x near multiples of pi

lifthrasiir · on March 29, 2022

It is much more amusing if you describe it in ulps; for some inputs the error can reach > 2^90 ulps, more than the mantissa size itself.

jhgb · on March 29, 2022

FSIN only works on x87 registers which you will rarely use on AMD64 systems -- you really want to use at least scalar SSE2 today (since that is whence you receive your inputs as per typical AMD64 calling conventions anyway). Moving data from SSE registers to the FP stack just to calculate FSIN and then moving it back to SSE will probably kill your performance even if your FSIN implementation is good. If you're vectorizing your computation over 4 double floats or 8 single floats in an AVX register, it gets even worse for FSIN.

stephencanon · on March 29, 2022

Moving between x87 and xmm registers is actually fairly cheap (it's through memory, so it's not free, but it's also not _that_ bad). FSIN itself is catastrophically slow.

jhgb · on March 29, 2022

Fair enough, and I imagine there may even be some forwarding going on? There often is when a load follows a store, if I remember correctly. (Of course this will be microarchitecture-dependent.)

chrisseaton · on March 29, 2022

> I guess on intel architectures the compiler just calls the fsin instruction of the cpu.

Do people do that in practice? It's on the FPU, which is basically legacy emulated these days, and it's inaccurate.

enriquto · on March 29, 2022

> the FPU, which is basically legacy

you'll pry my long doubles from my cold, dead hands!

jhgb · on March 29, 2022

The question is if you wouldn't be better served with double-doubles today. You get ~100 bits of mantissa AND you can still vectorize your computations.

enriquto · on March 29, 2022

Sure. There should be a gcc flag to make "long double" become quadruple precision.

The thing is, my first programming language was x86 assembler and the fpu was the funniest part. Spent weeks as a teenager writing almost pure 8087 code. I have a lot of emotional investment in that tiny rolling stack of extended precision floats.

colejohnson66 · on March 29, 2022

How is "quad floating point" implemented on x86? Is it software emulation?

chrisseaton · on March 29, 2022

You're selectively quoting me - I said it's 'legacy emulated'. It's emulated using very long-form microcode, so basically emulated in software. I didn't say it was simply 'legacy'.

eyelidlessness · on March 30, 2022

I’m completely out of my depth reading through these comments so I don’t have much of value to contribute, but I do think I can gently say I think the selective quoting was harmless, but deliberate to fit the shape of a harmless joke’s setup. I don’t think there was any intent to misrepresent your more salient information.

lifthrasiir · on March 29, 2022

x87 trig functions can be very inaccurate due to the Intel's original sloppy implementation and subsequent compatibility requirements [1].

[1] http://nighthacks.com/jag/blog/134/index.html

jcelerier · on March 29, 2022

Wasn't there some blog article a few years ago which showed how glibc's implementation was faster than fsin ?

ajross · on March 29, 2022

That's got nothing to do with Musl per se, that's just the SunPro code that basically every C library uses. I'm sure the polynomials themselves (there's another one for the "crest" of the sin curve, and still more for log and exp and the rest of the family) date from somewhere in computing pre-history. Optimizing polynomial fits to analytic functions was something you could do on extremely early computers.

jeremysalwen · on March 30, 2022

Thanks for explaining this. I actually wrote a SIMD implementation of trig functions years ago, using the techniques you describe.

You can check it out: https://github.com/jeremysalwen/vectrig

I compared several different methods of generating polynomials of different sizes for speed and precision (spoilers: taylor series were the worst and minimax polynomials (Remez algorithm) were the best).

Another (surprising) thing which I learned during the project was that the range reduction was just as (if not more) important to the accuracy of the implementation than the polynomial. If you think about it, you will realize that it's actually pretty difficult to quickly and accurately compute the sin of large numbers like 2^50.

I also tried to directly optimize the coefficients for the accuracy of the polynomial on the required range, but that experiment was unsuccessful.

It's all there in the repository, the implementations, notes about the different polynomials used, and the accuracy/speed statistics for the different methods.

jhgb · on March 30, 2022

> I compared several different methods of generating polynomials of different sizes for speed and precision (spoilers: taylor series were the worst and minimax polynomials (Remez algorithm) were the best).

I would have expected at least an LSQ approximation with a basis of Legendre polynomials thrown into the mix. I got that as a basic homework in my numerics class once, after we've shown to ourselves in the class that [1, x, x², x³...] is not a really good basis to project things onto.

bertr4nd · on March 30, 2022

Are there libraries/tools that people use to do Remez/Chebyshev/etc. function expansions? I can do a basic Taylor series expansion by hand but I’m out of my depth with more sophisticated techniques.

adgjlsfhk1 · on March 31, 2022

sollya is the king here

bertr4nd · on March 31, 2022

Thanks!

toolslive · on March 29, 2022

Yup. Chebyshev is the way to go (after 30s of consideration)

stncls · on March 29, 2022

This is a very nice article, and I think the three initial objectives are fully attained (simple enough, accurate enough, fast enough).

Some of the performance claims have a caveat, though: Lookup tables and micro-benchmarks don't mix well. At all.

I just added a simple loop that thrashes the L3 cache every 500 iterations (diff here: https://goonlinetools.com/snapshot/code/#sm4fjqtvjyn36dyednc...). Now the method recommended at the end (cos_table 0_001 LERP) is slower than glibc's cos() (while still having an accuracy that is more than 10^8 times worse)!

Time benchmark output:

    cos_table_1_LERP                    0.0038976235167879
    cos_table_0_1_LERP                  0.0042602838286585
    cos_table_0_01_LERP                 0.0048867938030232
    cos_table_0_001_LERP                0.0091254562801794
    cos_table_0_0001_LERP               0.0139627164397000
    cos_math_h                          0.0089332715693581

Lookup table sizes:

    cos_table_1_LERP                        64 bytes
    cos_table_0_1_LERP                     512 bytes
    cos_table_0_01_LERP                   5040 bytes
    cos_table_0_001_LERP                 50280 bytes
    cos_table_0_0001_LERP               502664 bytes

adgjlsfhk1 · on March 29, 2022

How long does the cache thrashing loop take compared to 500 iterations? I'm not sure how much you can trash the cache while still being performance bottle-necked by a table-based implementation of a trig function.

stncls · on March 29, 2022

Yes of course the cache thrashing takes a lot longer. But I modified the code to time only the cos() functions (see diff), using the rdtsc instruction. That's why it had to be every 500 iterations: If it was after each iteration, then rdtsc itself would become the bottleneck!

adgjlsfhk1 · on March 29, 2022

My point wasn't about how you were timing the code, what I meant is in the real world, I don't think there are programs where this will be an issue because if they are thrashing their caches too much, they won't be bottlenecked on trig. The advantage of table based functions is that they are really fast if you are calling a lot of them at once, but that's also the only case where they need to be fast. If only 1/10000 instructions is trig, it's OK if the trig is a little slower.

gpderetta · on March 30, 2022

The trig call might just be a small component of a piece of cide that is called very often and wants the cache for its own needs.

I think that the grandparent concerns are very legitimate.

nwallin · on March 29, 2022

It's all well and good to have a library with fast sin/cos/tan functions, but always remember to cheat if you can.

For instance, in a game, it's common to have a gun/spell or whatever that shoots enemies in a cone shape. Like a shotgun or burning hands. One way to code this is to calculate the angle between where you're pointing and the enemies, (using arccos) and if that angle is small enough, apply the damage or whatever.

A better way to do it to take the vector where the shotgun is pointing, and the vector to a candidate enemy is, and take the dot product of those two. Pre-compute the cosine of the angle of effect, and compare the dot product to that -- if the dot product is higher, it's a hit, if it's lower, it's a miss. (you can get away with very rough normalization in a game, for instance using the rqsrt instruction in sse2) You've taken a potentially slow arccos operation and turned it into a fast handful of basic float operations.

Or for instance, you're moving something in a circle over time. You might have something like

    float angle = 0
    for (...)
      angle += 0.01
      <something with sin(angle) cos(angle)>

Instead, you might do:

    const float sin_dx = sin(0.01)
    const float cos_dx = cos(0.01)
    float sin_angle = 0
    float cos_angle = 1
    for (...)
      const float new_sin_angle = sin_angle * cos_dx + cos_angle * sin_dx
      cos_angle = cos_angle * cos_dx - sin_angle * sin_dx
      sin_angle = new_sin_angle

And you've replaced a sin/cos pair with 6 elementary float operations.

And there's the ever popular comparing squares of distances instead of comparing distances, saving yourself a square root.

In general, inner loops should never have a transcendental or exact square root in them. If you think you need it, there's almost always a way to hoist from an inner loop out into an outer loop.

djmips · on March 30, 2022

Nice! Also see

https://fgiesen.wordpress.com/2010/10/21/finish-your-derivat...

skulk · on March 29, 2022

I learned how to write programs from Scratch[0]. Back in the pre-1.4 days (IIRC) there was no native sin/cos functions, so they had to be "implemented" somehow to translate arbitrary angles into delta (x,y). The way that I found people did this was as follows (for sin(A), cos(A)):

Invisible sprite:

[ when I receive "sine" ]:

  [ go to center ]

  [ turn A degrees ]

  [ move 1 step forward ]

Now, anyone that wants to know sin(A), cos(A) just has to read that sprite's X and Y position.

[0]: https://scratch.mit.edu

em3rgent0rdr · on March 29, 2022

I guess scratch (or LOGO) is Turing complete, so someone could make a CPU built using a little turtle in a box where all functionality is built out of LOGO commands.

Asraelite · on March 30, 2022

It's not just Turing complete but even practical. People have built entire CPU emulators that run pretty fast, e.g. https://scratch.mit.edu/projects/175052082/

em3rgent0rdr · on March 31, 2022

Amazing!

aaaronic · on March 29, 2022

Mandatory Physics "troll" comment:

For small angles, sin(x) ~= x and cos(x) ~= 1 -- the ["small angle approximation"](https://en.wikipedia.org/wiki/Small-angle_approximation)

It's actually kind of ridiculous how many times it comes up and works well enough in undergraduate Mechanics.

jhgb · on March 29, 2022

This a perfectly nice Programming 101 example of a recursive function -- if an angle is too large, calculate its sine using the sine of the half-angle, otherwise return the angle (or, if you're fancy, some simple polynomial approximation of the sine). I'm sure everyone did this in school (we did, in fact).

eyelidlessness · on March 30, 2022

I didn’t do this (or any programming, or any trigonometry) in school, but I’m appreciative of tidbits like this from people who did. I definitely feel comfortable with recursion, and have a stalled art project which could benefit from better understanding how to apply trig as I accumulate understanding in a much less structured way.

gugagore · on March 29, 2022

It's not obvious when you should switch over to an approximation (base case), so I'd say it's not a good example to introduce recursion. I have never seen it, and I did not do it in school.

scythe · on March 29, 2022

It's "obvious" if you remember how the Taylor series for sin(x) works. The error between x and sin(x) is bounded above by x^3/6. So there you go.

jhgb · on March 29, 2022

It's not obvious but that makes it a nice example in parameter optimization as well.

aaaronic · on March 29, 2022

When the difference between iterations is less than whatever threshold of accuracy you're looking for, I'd assume.

"Relaxation" algorithms for calculating fields work that way.

SnowHill9902 · on March 29, 2022

For a physicist sin(x) ~= x if x^3/6 << x

dekhn · on March 29, 2022

IIRC this comes up in the derivation of the boltzmann distribution. I also recall stirlings approximation for x!.

andi999 · on March 29, 2022

Dont forget optics, there tan x=sin x

aaaronic · on March 29, 2022

Yep, it's all over Physics, really. Many non-mechanical concepts are often mapped to a mechanical analog model (so many tiny oscillators out there!).

Sharlin · on March 29, 2022

Useful in astronomy and telephoto photography as well.

drBonkers · on March 30, 2022

How small is small enough to use this approximation?

stdbrouw · on March 30, 2022

About 10°, depending on what kind of accuracy you'd like.

https://en.wikipedia.org/wiki/Small-angle_approximation#Erro...

dboreham · on March 29, 2022

This takes me back in time, to c. 1982 when I got curious about how computers calculated trig functions. Luckily I was using one of the first OSS systems (modulo AT&T lawyers): Unix V7. So I pulled up the source and took a look. You can still see it here: https://github.com/v7unix/v7unix/blob/master/v7/usr/src/libm... There's a comment referencing "Hart & Cheney", which is this book : https://www.google.com/books/edition/Computer_Approximations... Also luckily I had access to a library that had the book which I was able to read, at least in part. This taught me a few things : that the behavior of a computer can be a white box, not a black box; the value of having access to the source code; to use comments effectively; that for many difficult tasks someone probably had written a book on the subject. Also that polynomials can approximate transcendental functions.

aj7 · on March 29, 2022

So 57 years ago, there weren’t any programming books. Well, there was one. McCracken and Dorn, Numerical Methods and Fortran Programming. A kindly childless couple, Ben and Bluma Goldin, who lived in the same apartment house in Brooklyn, bought me this book as a gift. An axiomatic lesson from the book was distrust of Taylor expansions, and the necessity of moving the expansion point of any function to a favorable location. Because harmonic functions are very well behaved, it should also be obvious that even a very coarse lookup table + interpolation should converge to high accuracy quickly. Finally, use of trigonometric identities to zero in on the desired angle is helpful. OK I’ll shut up now.

eesmith · on March 30, 2022

Going on a tangent, as I understand it, the first programming book was "The Preparation of Programs for an Electronic Digital Computer: With special reference to the EDSAC and the Use of a Library of Subroutines", by Maurice Wilkes, David Wheller and Stanley Gill (1951).

https://archive.org/details/programsforelect00wilk

chmaynard · on March 29, 2022

Why tell us the couple's names and tell us they were childless? Strange. I'm not inclined to trust anything you said after that.

unethical_ban · on March 30, 2022

It's common in storytelling to add details. It's frustrating you call it bad faith because you don't get it.

aj7 · on March 30, 2022

They are long dead. It was sad. They doted on me and my sister because they had no kids. He was a salesman. He knew how to get books in Manhattan. He used to get my sister classics with the most beautiful color plate illustrations. In the sixties, these had little value. Today?

todd8 · on March 29, 2022

The article uses a game movement function as a possible motivation for needing the cosine. If object.rotation is constant as the spiral animation seems to suggest, there is an important optimization for rapidly computing the sequence: cos(theta), cos(theta + delta), cos(theta + 2delta), cos(theta + 3delta), ...

This comes up in movement under constant rate of rotation as well as Fourier transforms. See [1] for details, but the basic idea uses the simple trig identities:

    cos(theta + delta) = cos(theta) - (a*cos(theta) + b*sin(theta))
    sin(theta + delta) = sin(theta) - (a*sin(theta) - b*cos(theta))

The values for a and b must be calculated, but only once if delta is constant:

    a = 2 sin^2(delta/2)
    b = sin(delta)

By using these relationships, it is only necessary to calculate cos(theta) and sin(theta) once for the first term in the series in order to calculate the entire series (until accumulated errors become a problem).

[1] Press, William H. et. al., Numerical Recipes, Third Edition, Cambridge University Press, p. 219

aaaronic · on March 29, 2022

I teach lookup tables specifically for sine/cosine (we just use one with a phase shift and modulus) on the Gameboy Advance. The hardware has no floating point unit support, so we do it in 8.8 fixed point with 1-degree resolution.

It turns out the BIOS on the hardware can do sine/cosine, but you have to do that via inline assembly (swi calls) and it's no faster than the simple lookup table.

This is all just to get terms for the affine matrix when rotating sprites/backgrounds.

xioxox · on March 29, 2022

Here's a solution which is vectorizable: http://gallium.inria.fr/blog/fast-vectorizable-math-approx/

I've also read that Chebyshev polynomials are far better for approximating functions than Taylor series (eg https://en.wikipedia.org/wiki/Approximation_theory)

freemint · on March 29, 2022

This matters less once you put things through the horner scheme. And if you are not you are leaving performance on the table.

arthur2e5 · on March 29, 2022

Horner does not conflict with vectorization in this case. If you read the code, it actually uses Horner — the vectorizability applies to when you call the online function in a loop.

phonebucket · on March 29, 2022

This is a fun article!

Alternative avenues that complement the approaches shown:

- Padé Approximants (https://en.wikipedia.org/wiki/Pad%C3%A9_approximant) can be better than long Taylor Series for this kind of thing.

- Bhaskara I's sin approximation (https://en.wikipedia.org/wiki/Bhaskara_I%27s_sine_approximat...) is easily adaptable to cosine, remarkably accurate for its simplicity and also fast to calculate.

nwallin · on March 29, 2022

I have a few cases where I think a Padé approximation could really help me, but I've never been able to figure out, given a function, how to get the coefficients.

Do you have any suggested reading?

m1kal · on March 30, 2022

I learned from Bender, Orszag "Advanced Mathematical Methods for Scientists and Engineers I".

dietrichepp · on March 30, 2022

People reach for the Taylor series because it’s what they remember from calculus, but there’s the Remez algorithm for finding the polynomial which minimizes the maximum error.

If you’re measuring “how good is my cosine function” by calculating the maximum error, then it makes sense to use an optimization algorithm that minimizes that error.

The Remez algorithm is rather elegant. The basic idea is that you come up with a polynomial, figure out where the local error maximums are, and then calculate a new polynomial with better local error maximums in the same locations. Repeat. With the right conditions it converges quickly and to a global minimum.

For example, let’s say that you try to come up with a quarter-wave cosine approximation. You end up with a polynomial that has local error maximums at x=[0, 0.1, 0.3, 0.7, 1.4, 1.57], and various values of ∆y. You create a new polynomial and design it to have exact errors of ±∆y at the same x locations… the errors will alternate in sign as the polynomial goes above and below the target function. However, when you design the function, you’re just doing a polynomial fit through these (x,y) coordinates, so the new polynomial will have local error maximums at different x coordinates. Note that the boundaries are also error maximums. You end up with exactly the right number of degrees of freedom to solve the equation.

Each time through the loop you get a new set of x coordinates. Under the right conditions, these converge. Eventually the ∆y errors are all nearly equal to ± some global error maximum, with alternating signs.

I used this to approximate sine and exp functions for an audio synthesis tool I’m working on—the nice thing about polynomials over lookup functions is that they are easier to convert to SIMD.

Here is the NumPy code I used to find the coefficients. Note that this is not taken from any kind of numerical recipes cookbook and I can’t really vouch for the algorithm I used from a numerical standpoint—I only checked that it works empirically.

https://github.com/depp/ultrafxr/blob/master/math/coeffs/cal...

nonrandomstring · on March 29, 2022

Audio stuff uses different methods depending on the application. Additive oscillators need to be accurate and can use some closed forms and series, mostly people use Taylor and quadratic interpolation as the article shows. For tables remember you only really need to compute amd store one quadrant and shift it around. But for LFOs people use all manner of dirty and quick methods like Bhaskara.

https://en.wikipedia.org/wiki/Bhaskara_I%27s_sine_approximat...

anonymousiam · on March 29, 2022

I did the same thing for sqrt() seven years ago. It benchmarked at about five times faster than math.h/libm, but the trade was a reduction in the maximum value of the result. My version would not produce accurate results for inputs greater than 2^16. It did work very well for generating real-time laser pointing X/Y coordinates for an Archimedean Spiral.

https://en.wikipedia.org/wiki/Archimedean_spiral

(This was on a MSP430 platform with a FPU that only did multiplication.)

adgjlsfhk1 · on March 29, 2022

I'm surprised this was faster than an initial guess (half the exponent and mantisa) followed by newton's method.

anonymousiam · on March 29, 2022

Newton's method involves division, which is a problem on an embedded platform with a limited FPU that can only do multiplication. I tried several classic approaches and benchmarked all of them against the function in libm. I reviewed what I did and it turns out that my own memory is not that great. My version (nsqrt) is limited to input values of less than 2^14 and not 2^16 as I said above. Also, it's only 50% faster and not 5x as I said above.

Even a 50% increase in speed was well worth my effort though.

* Replacement sqrtf() function. This function runs about 50% faster than

* the one in the TI math library. It provides identical accuracy, but

* it is limited to input values less than 2^14. This shouldn't be a problem

* because the passed argument will always be in the range of

* 0-(max spiral scan period). The spiral scan period is presently about

* 140 seconds.

msk-lywenn · on March 29, 2022

My favorite implementation of a super fast but not that accurate cosinus is from a now defunct forum. Good thing we have webarchive ! https://web.archive.org/web/20111104163237/http://www.devmas...

karma_fountain · on March 29, 2022

You can also use the Chebyshev polynomials to economise a higher degree taylor series with bit of a loss of accuracy.

slim · on March 29, 2022

you can't because boycott Chebyshev

rurban · on March 29, 2022

You can't boycott Chebyshev, he died 1894 or so.

waynecochran · on March 29, 2022

You are doing the polynomial evaluation wrong for Taylor Series. First of all you should be using Horner's Method to reduce the number of multiplications needed and then you should be using FMA to increase precision of the solution. https://momentsingraphics.de/FMA.html

blt · on March 29, 2022

If you want sinusoidal oscillation over time, then you can integrate the second-order differential equation x'' = -ω² x instead. This only costs a few arithmetic instructions per time step. On the other hand, it adds dependency on a state other than the current time, and you must be careful to use energy-conserving integration.

duped · on March 29, 2022

Here's a quick way of doing it with one matrix multiplication, you just need to precompute cos(ω) and sin(ω) with a bit of intuition:

   [ x(n + 1) ] = [cos(ω), -sin(ω)] [ x(n) ]
   [ y(n + 1) ]   [sin(ω),  cos(ω)] [ y(n) ]

   where 

   x(n) = cos(ω n)
   y(n) = sin(ω n)
   x(0) = 1
   y(0) = 0
   ω = frequency (radians) * time_step

There are two ways to look at this, from the systems perspective this is computing the impulse response of a critically stable filter with poles at e^(+/-jω). Geometrically, it's equivalent to starting at (1, 0) and rotating around the unit circle by ω radians each time step. You can offset to arbitrary phases.

It's not suitable for numerous cases (notably if the frequency needs to change in real time) but provided that the sum of the coefficients sum to `1.0` it's guaranteed stable and will not alias.

hatsunearu · on March 29, 2022

Since cosine is even and periodic, you really only need like, up to x = 0 to pi/4 and the rest you can infer just from that data.

mhh__ · on March 29, 2022

That is how it's often implemented i.e. shift back to the domain you actually need to be able to compute.

sylware · on March 30, 2022

I have been looking for papers which are _very clean and serious_ mathematical proofs of accuracy for approximants of transcendental functions (remez/chebysheve/etc). Any pointers?

em3rgent0rdr · on March 29, 2022

Speaking of lookup tables, I recall reading the DX-7 (https://www.righto.com/2021/11/reverse-engineering-yamaha-dx... shared a few months ago on HN), that DX-7 has a 4096-entry 14-bit sine ROM built into the chip itself.

0x20cowboy · on March 30, 2022

I haven’t seen this version mentioned in the thread - if you don’t need a lot of precision, here is a simple 4 line version[1] and here’s how it works[2].

Not sure who initially came up with it.

[1] https://github.com/robrohan/wefx/blob/1a918cc2d5ad87402a3830...

[2] https://www.desmos.com/calculator/lo7cf60mjz

user_7832 · on March 29, 2022

Slightly off topic but kudos to the author for explaining the process extremely clearly. As someone without experience of low level programing this was quite understandable for me (albeit with uni-level maths knowledge).

arthur2e5 · on March 29, 2022

It’s an old article, so… guess I won’t beat a dead horse as much.

* The range reduction is, well, probably worse than acos. Look for Payne–Hanek–Corbett. Not an issue for games, I assume.

* The LERP is nice and all, but Qt actually has an even better one: instead of doing a LERP, you use the local derivative also available from the lookup table. https://stackoverflow.com/a/52841086

Others have already mentioned the Remez and the Horner stuff. Won’t repeat here.

ok123456 · on March 29, 2022

A lot of Basic implementations, that had software support for floating point numbers, implemented it like this. Microsoft BASIC used a 6th degree polynomial iirc.

deepspace · on March 29, 2022

I was surprised that pre-calculating xx as double xx = xx made a difference.

The author's compiler must not be great at optimization if this is the case.

FartyMcFarter · on March 30, 2022

I haven't checked, but off the top of my head this could be because this optimisation changes the precision of the result enough for it to be an illegal optimisation.

Some compilers have options such as -ffast-math that result in ignoring some precision issues. This could be worth a try as an alternative to manual optimisation.

anugrahit · on March 29, 2022

http://www.realitypixels.com/turk/computergraphics/FixedPoin...

cordic beautifully explained; i have used this approach on fixed point dsp processors in another life; its simplicity, and elegance is beautiful (on controllers with no fp)

dilyevsky · on March 29, 2022

I remember researching cordic for some pet fpga project i had and finding some libs/discussions so it’s definitely used.

Gigachad · on March 30, 2022

Can someone explain to me why calculating cosine seems to be one of the most complicated tasks? I have tried to read about it a few times and never really found an explanation for calculating it which makes sense.

userbinator · on March 30, 2022

However, I can't find any references of whether it is used much today.

CORDIC is still very common in embedded systems, especially with CPUs that have no hardware multiply/divide.

charlieyu1 · on March 29, 2022

I think off-zero Taylor series along with lookup table should be at least tried. You can approximate cos(x+a) using only cos(a) and sin(a) and the series converges really fast

kevin_thibedeau · on March 29, 2022

Remez always beats Taylor series.

ThePhysicist · on March 29, 2022

I remember doing this in Macromedia Flash as I wanted to make stuff go in circles and the language didn't have any higher math functions back then. Good ole times.

helltone · on March 30, 2022

What would it take to have a parameterized algorithm where you specify the precision you want and get a fast implementation for that precision?

dekhn · on March 29, 2022

lookup tables and lerp are the most common solution I've seen (for a wide range of functions). You can also memorize with an LRU cache.

phkahler · on March 29, 2022

>> lookup tables and lerp are the most common solution I've seen

The author missed one nice table method. When you need sin() and cos() split the angle into a coarse and fine angle. Coarse might be 1/256 of a circle and fine would be 1/65536 of a circle (whatever that angle is). You look up the sin/cos of the coarse angle in a 256 entry table. Then you rotate that by the fine angle which uses another 256 sin * cos entries. The rotation between coarse values is more accurate than linear interpolation and give almost the same accuracy as a 65536 entry table using only 768 entries. You can use larger tables or even another rotation by finer angle steps to get more accuracy.

LeifCarrotson · on March 29, 2022

> Then you rotate that by the fine angle

Explain?

Are you trying to reduce the error in the linear interpolation by describing the convex curve between the endpoints?

The derivative of the cosine is the sine, and the derivative of the sine is the cosine...so I'd expect the required output of the fine angle table to interpolate 0.15 between 0.1 and 0.2 and the required output to interpolate 45.15 between 45.1 and 45.2 degrees will be very different.

phkahler · on March 29, 2022

>> Explain?

Yeah I thought that might not be clear enough. Let's say you want 0.1 degree accuracy but don't want a 3600 entry table. Make one table for every 1 degree. You get to use the same table for sin and cos by changing the index. That is the coarse table with 360 entries. Then make another table with sin and cos values for every 0.1 degrees, or specifically 0.1*n for n going from 0-9. This is another 20 values, 10 sin and 10 cos for small angles.

Take your input angle and split it into a coarse (integer degrees) and fine (fractional degrees) angle. Now take the (sin(x),cos(x)) from the coarse table as a vector and rotate it using the sin & cos values of the fine angle using a standard 2x2 rotation matrix.

You can size these tables however you need. I would not use 360 and 10 for their sizes, but maybe 256 and 256. This can also be repeated with a 3rd table for "extra fine" angles.

marginalia_nu · on March 29, 2022

Remember that cos(a+b)=cos(a)cos(b)-sin(a)sin(b)

LeifCarrotson · on March 31, 2022

Ah, now it makes sense. Thanks for the reminder of 8th grade trig...it's been a while!

munificent · on March 29, 2022

Yeah, lerping into a lookup table is common in audio and (old school) games.

A common trick to make that faster is to not use doubles and radians to represent the phase. Instead, represent phase using an integer in some power-of-two range. That lets you truncate the phase to fit in a single period using a bit mask instead of the relatively slow modulo.

spacechild1 · on March 29, 2022

Or use some evil bit hackery :-) https://github.com/pure-data/pure-data/blob/18c9695ba4de543c...

marginalia_nu · on March 29, 2022

> You can also memorize with an LRU cache.

That's probably not going to perform well for an operation this fast. The computation is faster than main memory read.

dekhn · on March 29, 2022

You can fit quite a large LRU cache in the L2 cache of the processor. You never want to go to main memory for a lerp table.

Sharlin · on March 29, 2022

In some cases (eg. oldskool 2D graphics) a 256-entry LUT is enough. And a 8-bit fixed point format can be enough so the whole table fits in 256 bytes.

naoqj · on March 29, 2022

>I couldn't find any notable projects on GitHub that use a table for trig functions, but I'm sure they exist.

Doom?

spacechild1 · on March 29, 2022

Pure Data: https://github.com/pure-data/pure-data/blob/18c9695ba4de543c...

SuperCollider: https://github.com/supercollider/supercollider/blob/ec2dedfd...

zekica · on March 29, 2022

Doom also uses the famous fast inverse square root method which although I know how it works still amazes me.

naoqj · on March 29, 2022

It doesn't, to the best of my knowledge. It appeared in an id game for the first time in Quake 3.

freemint · on March 29, 2022

Julia?

adgjlsfhk1 · on March 29, 2022

Julia uses a very similar approach to musl here. Reduction mod 2pi followed by polynomial kernel. We use tables for `exp` and `log`, but the trig functions have good reductions so we use those instead. The implementation is here https://github.com/JuliaLang/julia/blob/03af78108afe6058f54c... for anyone interested.

snicker7 · on March 30, 2022

In the Julia repl, just type '@less sin(pi)' to view the source code of sin. I don't know of any other programming language that makes it this easy to look at its internals.

adgjlsfhk1 · on March 30, 2022

That's how we nerd snipe people into contributing to Base (only 10% joking).

dheera · on March 29, 2022

> So we are really only calculating cosine from 0 to pi.

You could go further and only to 0 to pi/2 as it is mirrored and flipped for pi/2 to pi.

Or you could go even further and do only 0 to pi/4 and use a simple trig identity and your presumably parallely-developed sin function for the cos of pi/4 to pi.

> One of the implementations is nearly 3x faster than math.h

Is this true even for -O3?

gweinberg · on March 29, 2022

I think you should be able to get good results for precalculating the cosines for a set of angles and then using the formula for the cosine of the sum of 2 angles combined with the taylor series. The series converges very quickly for very small angles, right?

marcodiego · on March 29, 2022

I don't know if this benchmark is trustworthy. Using a lookup table pollutes cache. It definitely needs to be compared with interactions on other parts of the code to get a good performance measurement.

SeanLuke · on March 29, 2022

A LERP lookup table is fast; but only somewhat slower, and probably quite a bit more accurate, would be a lookup table interpolated by a Catmull-Rom spline.