The implementation of the __cos kernel in Musl is actually quite elegant. After reducing the input to the range [-pi/4, pi/4], it just applies the best degree-14 polynomial for approximating the cosine on this interval. It turns out that this suffices for having an error that is less than the machine precision. The coefficients of this polynomial can be computed with the Remez algorithm, but even truncating the Chebyshev expansion is going to yield much better results than any of the methods proposed by the author.
It is a double-double representation [1], where a logical fp number is represented with a sum of two machine fp numbers x (head, larger) and y (tail, smaller). This effectively doubles the mantissa. In the context of musl this representation is produced from the range reduction process [2].
Does it make sense to use a double-double input when you only have double output? Sine is Lispchitz-limited by 1 so I don't see how this makes a meaningful difference.
The input might be double but the constant pi is not. Let f64(x) be a function from any real number to double, so that an ordinary expression `a + b` actually computes f64(a + b) and so on. Then in general f64(sin(x)) may differ from f64(sin(f64(x mod 2pi))); since you can't directly compute f64(sin(x mod 2pi)), you necessarily need more precision during argument reduction so that f64(sin(x)) = f64(sin(f64timeswhatever(x mod 2pi))).
But am I correct in thinking that is at worst a 0.5 ulp error in this case? The lesser term in double-double can't be more than 0.5 ulp of the greater term and sensitivity of both sine and cosine to an error in the input will not be more than 1.
Yeah, sine and cosine are not as sensitive (but note that many libms target 1 or 1.5 ulp error for them, so a 0.5 ulp error might still be significant). For tangent however you definitely need more accurate range reduction.
Double rounding can still bite you. You are forced to incur up to half an ulp of error from your polynomial, so taking another half ulp in your reduction can lead to a total error of about 1 ulp.
I might be wrong but I would think for something like this vectorizing wouldn't save time (since you would have to move data around before and afterwards. The real benefit of this is it lets you run two fma operations in parallel.
No. FSIN has accuracy issues as sibling mentions, but is also much slower than a good software implementation (it varies with uArch, but 1 result every ~100 cycles is common; even mediocre scalar software implementations can produce a result every twenty cycles).
It's not just sub-normal numbers. As https://randomascii.wordpress.com/2014/10/09/intel-underesti... shows, fsin only uses 66 bits of pi, which means you have roughly no precision whenever abs(sin(x))<10^-16 which is way bigger than the biggest subnormal (10^-307 or so)
In that range, just returning x would be way better. Maybe even perfect actually - if x is less than 10^-16, then the error of x^3/6 is less than the machine precision for x.
FSIN only works on x87 registers which you will rarely use on AMD64 systems -- you really want to use at least scalar SSE2 today (since that is whence you receive your inputs as per typical AMD64 calling conventions anyway). Moving data from SSE registers to the FP stack just to calculate FSIN and then moving it back to SSE will probably kill your performance even if your FSIN implementation is good. If you're vectorizing your computation over 4 double floats or 8 single floats in an AVX register, it gets even worse for FSIN.
Moving between x87 and xmm registers is actually fairly cheap (it's through memory, so it's not free, but it's also not _that_ bad). FSIN itself is catastrophically slow.
Fair enough, and I imagine there may even be some forwarding going on? There often is when a load follows a store, if I remember correctly. (Of course this will be microarchitecture-dependent.)
The question is if you wouldn't be better served with double-doubles today. You get ~100 bits of mantissa AND you can still vectorize your computations.
Sure. There should be a gcc flag to make "long double" become quadruple precision.
The thing is, my first programming language was x86 assembler and the fpu was the funniest part. Spent weeks as a teenager writing almost pure 8087 code. I have a lot of emotional investment in that tiny rolling stack of extended precision floats.
You're selectively quoting me - I said it's 'legacy emulated'. It's emulated using very long-form microcode, so basically emulated in software. I didn't say it was simply 'legacy'.
I’m completely out of my depth reading through these comments so I don’t have much of value to contribute, but I do think I can gently say I think the selective quoting was harmless, but deliberate to fit the shape of a harmless joke’s setup. I don’t think there was any intent to misrepresent your more salient information.
That's got nothing to do with Musl per se, that's just the SunPro code that basically every C library uses. I'm sure the polynomials themselves (there's another one for the "crest" of the sin curve, and still more for log and exp and the rest of the family) date from somewhere in computing pre-history. Optimizing polynomial fits to analytic functions was something you could do on extremely early computers.
I compared several different methods of generating polynomials of different sizes for speed and precision (spoilers: taylor series were the worst and minimax polynomials (Remez algorithm) were the best).
Another (surprising) thing which I learned during the project was that the range reduction was just as (if not more) important to the accuracy of the implementation than the polynomial. If you think about it, you will realize that it's actually pretty difficult to quickly and accurately compute the sin of large numbers like 2^50.
I also tried to directly optimize the coefficients for the accuracy of the polynomial on the required range, but that experiment was unsuccessful.
It's all there in the repository, the implementations, notes about the different polynomials used, and the accuracy/speed statistics for the different methods.
> I compared several different methods of generating polynomials of different sizes for speed and precision (spoilers: taylor series were the worst and minimax polynomials (Remez algorithm) were the best).
I would have expected at least an LSQ approximation with a basis of Legendre polynomials thrown into the mix. I got that as a basic homework in my numerics class once, after we've shown to ourselves in the class that [1, x, x², x³...] is not a really good basis to project things onto.
Are there libraries/tools that people use to do Remez/Chebyshev/etc. function expansions? I can do a basic Taylor series expansion by hand but I’m out of my depth with more sophisticated techniques.
This is a very nice article, and I think the three initial objectives are fully attained (simple enough, accurate enough, fast enough).
Some of the performance claims have a caveat, though: Lookup tables and micro-benchmarks don't mix well. At all.
I just added a simple loop that thrashes the L3 cache every 500 iterations (diff here: https://goonlinetools.com/snapshot/code/#sm4fjqtvjyn36dyednc...). Now the method recommended at the end (cos_table 0_001 LERP) is slower than glibc's cos() (while still having an accuracy that is more than 10^8 times worse)!
How long does the cache thrashing loop take compared to 500 iterations? I'm not sure how much you can trash the cache while still being performance bottle-necked by a table-based implementation of a trig function.
Yes of course the cache thrashing takes a lot longer. But I modified the code to time only the cos() functions (see diff), using the rdtsc instruction. That's why it had to be every 500 iterations: If it was after each iteration, then rdtsc itself would become the bottleneck!
My point wasn't about how you were timing the code, what I meant is in the real world, I don't think there are programs where this will be an issue because if they are thrashing their caches too much, they won't be bottlenecked on trig. The advantage of table based functions is that they are really fast if you are calling a lot of them at once, but that's also the only case where they need to be fast. If only 1/10000 instructions is trig, it's OK if the trig is a little slower.
It's all well and good to have a library with fast sin/cos/tan functions, but always remember to cheat if you can.
For instance, in a game, it's common to have a gun/spell or whatever that shoots enemies in a cone shape. Like a shotgun or burning hands. One way to code this is to calculate the angle between where you're pointing and the enemies, (using arccos) and if that angle is small enough, apply the damage or whatever.
A better way to do it to take the vector where the shotgun is pointing, and the vector to a candidate enemy is, and take the dot product of those two. Pre-compute the cosine of the angle of effect, and compare the dot product to that -- if the dot product is higher, it's a hit, if it's lower, it's a miss. (you can get away with very rough normalization in a game, for instance using the rqsrt instruction in sse2) You've taken a potentially slow arccos operation and turned it into a fast handful of basic float operations.
Or for instance, you're moving something in a circle over time. You might have something like
float angle = 0
for (...)
angle += 0.01
<something with sin(angle) cos(angle)>
And you've replaced a sin/cos pair with 6 elementary float operations.
And there's the ever popular comparing squares of distances instead of comparing distances, saving yourself a square root.
In general, inner loops should never have a transcendental or exact square root in them. If you think you need it, there's almost always a way to hoist from an inner loop out into an outer loop.
I learned how to write programs from Scratch[0]. Back in the pre-1.4 days (IIRC) there was no native sin/cos functions, so they had to be "implemented" somehow to translate arbitrary angles into delta (x,y). The way that I found people did this was as follows (for sin(A), cos(A)):
Invisible sprite:
[ when I receive "sine" ]:
[ go to center ]
[ turn A degrees ]
[ move 1 step forward ]
Now, anyone that wants to know sin(A), cos(A) just has to read that sprite's X and Y position.
I guess scratch (or LOGO) is Turing complete, so someone could make a CPU built using a little turtle in a box where all functionality is built out of LOGO commands.
This a perfectly nice Programming 101 example of a recursive function -- if an angle is too large, calculate its sine using the sine of the half-angle, otherwise return the angle (or, if you're fancy, some simple polynomial approximation of the sine). I'm sure everyone did this in school (we did, in fact).
I didn’t do this (or any programming, or any trigonometry) in school, but I’m appreciative of tidbits like this from people who did. I definitely feel comfortable with recursion, and have a stalled art project which could benefit from better understanding how to apply trig as I accumulate understanding in a much less structured way.
It's not obvious when you should switch over to an approximation (base case), so I'd say it's not a good example to introduce recursion. I have never seen it, and I did not do it in school.
This takes me back in time, to c. 1982 when I got curious about how computers calculated trig functions. Luckily I was using one of the first OSS systems (modulo AT&T lawyers): Unix V7. So I pulled up the source and took a look. You can still see it here: https://github.com/v7unix/v7unix/blob/master/v7/usr/src/libm...
There's a comment referencing "Hart & Cheney", which is this book : https://www.google.com/books/edition/Computer_Approximations...
Also luckily I had access to a library that had the book which I was able to read, at least in part.
This taught me a few things : that the behavior of a computer can be a white box, not a black box; the value of having access to the source code; to use comments effectively; that for many difficult tasks someone probably had written a book on the subject. Also that polynomials can approximate transcendental functions.
So 57 years ago, there weren’t any programming books. Well, there was one. McCracken and Dorn, Numerical Methods and Fortran Programming. A kindly childless couple, Ben and Bluma Goldin, who lived in the same apartment house in Brooklyn, bought me this book as a gift.
An axiomatic lesson from the book was distrust of Taylor expansions, and the necessity of moving the expansion point of any function to a favorable location.
Because harmonic functions are very well behaved, it should also be obvious that even a very coarse lookup table + interpolation should converge to high accuracy quickly. Finally, use of trigonometric identities to zero in on the desired angle is helpful. OK I’ll shut up now.
Going on a tangent, as I understand it, the first programming book was "The Preparation of Programs for an Electronic Digital Computer: With special reference to the EDSAC and the Use of a Library of Subroutines", by Maurice Wilkes, David Wheller and Stanley Gill (1951).
They are long dead. It was sad. They doted on me and my sister because they had no kids. He was a salesman. He knew how to get books in Manhattan. He used to get my sister classics with the most beautiful color plate illustrations. In the sixties, these had little value. Today?
The article uses a game movement function as a possible motivation for needing the cosine. If object.rotation is constant as the spiral animation seems to suggest, there is an important optimization for rapidly computing the sequence: cos(theta), cos(theta + delta), cos(theta + 2delta), cos(theta + 3delta), ...
This comes up in movement under constant rate of rotation as well as Fourier transforms. See [1] for details, but the basic idea uses the simple trig identities:
The values for a and b must be calculated, but only once if delta is constant:
a = 2 sin^2(delta/2)
b = sin(delta)
By using these relationships, it is only necessary to calculate cos(theta) and sin(theta) once for the first term in the series in order to calculate the entire series (until accumulated errors become a problem).
[1] Press, William H. et. al., Numerical Recipes, Third Edition, Cambridge University Press, p. 219
I teach lookup tables specifically for sine/cosine (we just use one with a phase shift and modulus) on the Gameboy Advance. The hardware has no floating point unit support, so we do it in 8.8 fixed point with 1-degree resolution.
It turns out the BIOS on the hardware can do sine/cosine, but you have to do that via inline assembly (swi calls) and it's no faster than the simple lookup table.
This is all just to get terms for the affine matrix when rotating sprites/backgrounds.
Horner does not conflict with vectorization in this case. If you read the code, it actually uses Horner — the vectorizability applies to when you call the online function in a loop.
I have a few cases where I think a Padé approximation could really help me, but I've never been able to figure out, given a function, how to get the coefficients.
People reach for the Taylor series because it’s what they remember from calculus, but there’s the Remez algorithm for finding the polynomial which minimizes the maximum error.
If you’re measuring “how good is my cosine function” by calculating the maximum error, then it makes sense to use an optimization algorithm that minimizes that error.
The Remez algorithm is rather elegant. The basic idea is that you come up with a polynomial, figure out where the local error maximums are, and then calculate a new polynomial with better local error maximums in the same locations. Repeat. With the right conditions it converges quickly and to a global minimum.
For example, let’s say that you try to come up with a quarter-wave cosine approximation. You end up with a polynomial that has local error maximums at x=[0, 0.1, 0.3, 0.7, 1.4, 1.57], and various values of ∆y. You create a new polynomial and design it to have exact errors of ±∆y at the same x locations… the errors will alternate in sign as the polynomial goes above and below the target function. However, when you design the function, you’re just doing a polynomial fit through these (x,y) coordinates, so the new polynomial will have local error maximums at different x coordinates. Note that the boundaries are also error maximums. You end up with exactly the right number of degrees of freedom to solve the equation.
Each time through the loop you get a new set of x coordinates. Under the right conditions, these converge. Eventually the ∆y errors are all nearly equal to ± some global error maximum, with alternating signs.
I used this to approximate sine and exp functions for an audio synthesis tool I’m working on—the nice thing about polynomials over lookup functions is that they are easier to convert to SIMD.
Here is the NumPy code I used to find the coefficients. Note that this is not taken from any kind of numerical recipes cookbook and I can’t really vouch for the algorithm I used from a numerical standpoint—I only checked that it works empirically.
Audio stuff uses different methods depending on the application.
Additive oscillators need to be accurate and can use some closed forms
and series, mostly people use Taylor and quadratic interpolation as
the article shows. For tables remember you only really need to compute
amd store one quadrant and shift it around. But for LFOs people use
all manner of dirty and quick methods like Bhaskara.
I did the same thing for sqrt() seven years ago. It benchmarked at about five times faster than math.h/libm, but the trade was a reduction in the maximum value of the result. My version would not produce accurate results for inputs greater than 2^16. It did work very well for generating real-time laser pointing X/Y coordinates for an Archimedean Spiral.
Newton's method involves division, which is a problem on an embedded platform with a limited FPU that can only do multiplication. I tried several classic approaches and benchmarked all of them against the function in libm. I reviewed what I did and it turns out that my own memory is not that great. My version (nsqrt) is limited to input values of less than 2^14 and not 2^16 as I said above. Also, it's only 50% faster and not 5x as I said above.
Even a 50% increase in speed was well worth my effort though.
* Replacement sqrtf() function. This function runs about 50% faster than
* the one in the TI math library. It provides identical accuracy, but
* it is limited to input values less than 2^14. This shouldn't be a problem
* because the passed argument will always be in the range of
* 0-(max spiral scan period). The spiral scan period is presently about
You are doing the polynomial evaluation wrong for Taylor Series. First of all you should be using Horner's Method to reduce the number of multiplications needed and then you should be using FMA to increase precision of the solution. https://momentsingraphics.de/FMA.html
If you want sinusoidal oscillation over time, then you can integrate the second-order differential equation x'' = -ω² x instead. This only costs a few arithmetic instructions per time step. On the other hand, it adds dependency on a state other than the current time, and you must be careful to use energy-conserving integration.
There are two ways to look at this, from the systems perspective this is computing the impulse response of a critically stable filter with poles at e^(+/-jω). Geometrically, it's equivalent to starting at (1, 0) and rotating around the unit circle by ω radians each time step. You can offset to arbitrary phases.
It's not suitable for numerous cases (notably if the frequency needs to change in real time) but provided that the sum of the coefficients sum to `1.0` it's guaranteed stable and will not alias.
I have been looking for papers which are _very clean and serious_ mathematical proofs of accuracy for approximants of transcendental functions (remez/chebysheve/etc).
Any pointers?
I haven’t seen this version mentioned in the thread - if you don’t need a lot of precision, here is a simple 4 line version[1] and here’s how it works[2].
Slightly off topic but kudos to the author for explaining the process extremely clearly. As someone without experience of low level programing this was quite understandable for me (albeit with uni-level maths knowledge).
It’s an old article, so… guess I won’t beat a dead horse as much.
* The range reduction is, well, probably worse than acos. Look for Payne–Hanek–Corbett. Not an issue for games, I assume.
* The LERP is nice and all, but Qt actually has an even better one: instead of doing a LERP, you use the local derivative also available from the lookup table. https://stackoverflow.com/a/52841086
Others have already mentioned the Remez and the Horner stuff. Won’t repeat here.
A lot of Basic implementations, that had software support for floating point numbers, implemented it like this. Microsoft BASIC used a 6th degree polynomial iirc.
I haven't checked, but off the top of my head this could be because this optimisation changes the precision of the result enough for it to be an illegal optimisation.
Some compilers have options such as -ffast-math that result in ignoring some precision issues. This could be worth a try as an alternative to manual optimisation.
cordic beautifully explained; i have used this approach on fixed point dsp processors in another life; its simplicity, and elegance is beautiful (on controllers with no fp)
Can someone explain to me why calculating cosine seems to be one of the most complicated tasks? I have tried to read about it a few times and never really found an explanation for calculating it which makes sense.
I think off-zero Taylor series along with lookup table should be at least tried. You can approximate cos(x+a) using only cos(a) and sin(a) and the series converges really fast
I remember doing this in Macromedia Flash as I wanted to make stuff go in circles and the language didn't have any higher math functions back then. Good ole times.
>> lookup tables and lerp are the most common solution I've seen
The author missed one nice table method. When you need sin() and cos() split the angle into a coarse and fine angle. Coarse might be 1/256 of a circle and fine would be 1/65536 of a circle (whatever that angle is). You look up the sin/cos of the coarse angle in a 256 entry table. Then you rotate that by the fine angle which uses another 256 sin * cos entries. The rotation between coarse values is more accurate than linear interpolation and give almost the same accuracy as a 65536 entry table using only 768 entries. You can use larger tables or even another rotation by finer angle steps to get more accuracy.
Are you trying to reduce the error in the linear interpolation by describing the convex curve between the endpoints?
The derivative of the cosine is the sine, and the derivative of the sine is the cosine...so I'd expect the required output of the fine angle table to interpolate 0.15 between 0.1 and 0.2 and the required output to interpolate 45.15 between 45.1 and 45.2 degrees will be very different.
Yeah I thought that might not be clear enough. Let's say you want 0.1 degree accuracy but don't want a 3600 entry table. Make one table for every 1 degree. You get to use the same table for sin and cos by changing the index. That is the coarse table with 360 entries. Then make another table with sin and cos values for every 0.1 degrees, or specifically 0.1*n for n going from 0-9. This is another 20 values, 10 sin and 10 cos for small angles.
Take your input angle and split it into a coarse (integer degrees) and fine (fractional degrees) angle. Now take the (sin(x),cos(x)) from the coarse table as a vector and rotate it using the sin & cos values of the fine angle using a standard 2x2 rotation matrix.
You can size these tables however you need. I would not use 360 and 10 for their sizes, but maybe 256 and 256. This can also be repeated with a 3rd table for "extra fine" angles.
Yeah, lerping into a lookup table is common in audio and (old school) games.
A common trick to make that faster is to not use doubles and radians to represent the phase. Instead, represent phase using an integer in some power-of-two range. That lets you truncate the phase to fit in a single period using a bit mask instead of the relatively slow modulo.
Julia uses a very similar approach to musl here. Reduction mod 2pi followed by polynomial kernel. We use tables for `exp` and `log`, but the trig functions have good reductions so we use those instead. The implementation is here https://github.com/JuliaLang/julia/blob/03af78108afe6058f54c... for anyone interested.
In the Julia repl, just type '@less sin(pi)' to view the source code of sin. I don't know of any other programming language that makes it this easy to look at its internals.
> So we are really only calculating cosine from 0 to pi.
You could go further and only to 0 to pi/2 as it is mirrored and flipped for pi/2 to pi.
Or you could go even further and do only 0 to pi/4 and use a simple trig identity and your presumably parallely-developed sin function for the cos of pi/4 to pi.
> One of the implementations is nearly 3x faster than math.h
I think you should be able to get good results for precalculating the cosines for a set of angles and then using the formula for the cosine of the sum of 2 angles combined with the taylor series. The series converges very quickly for very small angles, right?
I don't know if this benchmark is trustworthy. Using a lookup table pollutes cache. It definitely needs to be compared with interactions on other parts of the code to get a good performance measurement.
A LERP lookup table is fast; but only somewhat slower, and probably quite a bit more accurate, would be a lookup table interpolated by a Catmull-Rom spline.