The Taylor expansion locally fits a polynom based on the n first derivatives.
If you want to find the "best" nth-degree polynom to approximate the sine function, functional analysis gives the tools for solving that optimization in closed form.
By selecting an appropriate norm (in function space) you can minimize either the maximum error or the error integral over some range (e.g. the 0-\pi range).
Here's a video on the subject. You might want to watch earlier ones also for more context.
It’s worth noting that for fixed-precision math libraries, we usually want the best approximation in the (possibly weighted) L-inf sense, which doesn’t generally have a closed-form solution, so we use iterative methods to generate approximations instead.
Seems also worth noting that although the usual algorithm for finding the minimax polynomial is the very old Remez algorithm, I'm playing with a minimax finder that relies on what I think is a novel algorithm. My algorithm seems better than Remez, and certainly beats it (as in, finds the solution when Remez can not) in many cases, even though I don't have an analysis of the algorithm.
The main idea is to use linear programming with some chosen points from the interval to look for the polynomial that approximates the chosen function over that interval. Unlike Remez, this enables control over which individual points from the interval are chosen as representatives, which enables avoiding ill-behaved points. An example of where this leads to improvements over Remez is when optimizing relative error: Remez would trip over zeros of the function, because they cause singularities in the relative error, however my algorithm works (by avoiding ill-behaved points) as long as the discontinuities are removable.
My algorithm is also a lot more flexible than Remez', for example it allows optimizing over multiple intervals instead of a single one.
1. Their LP constraint matrix is just a Vandermonde matrix according to the linked paper, while my code formulates the LP problem in a smarter way that should make the job much easier for the LP solver. I guess this improvement would be easy to adopt on their end.
2. The linked paper focuses on tiny types like BFloat16, but they mention scalability with large data types as a future goal. My focus was on larger types like IEEE-754 binary64 (double in C) from the start.
3. They account for range reduction and output compensation! This seems really nice, I hope that whatever they did can be applied to my approach to improve it.
I’d like to add to this that iterative solutions are wonderful, and often better than closed-form solutions in a lot of ways. Of course you know that already :-)
oh... also, sine table isn't just a performance thing. Hedgewars does fixed point custom math for deterministic lockstep, so keeps the math simple and predictable on various compilers/hardware.
One of the things I loved most about reading kernel and libc (or rather, libm) sources was the floating-point and FP emulation code. I thought it was immensely cool to see what others just expected an FPU instruction to do, someone had not only written out in C, but also wrote comments explaining the mathematics (rendered in ASCII), often with numerical analysis about error propagation, accuracy, etc.
So in the implementation of cos_table_*_LERP, you did technically 2 step range reduction:
1. Reduce x = x mod 2*pi
2. Reduce index = floor(x / 10^-n), and i - index = 10^n * (x mod 10^-n)
With limited input range and required precision as in the tests, you can combine these 2 range reduction steps:
1. Choose the reduced range as power of 2 instead of power of 10 for cheaper modulus operation, let say 2^-N = 2^-7.
2. Avoid the division in modd(x, CONST_2PI) by multiplying by 2^N / pi.
3. Avoid the round trip double -> int -> double by using the floor function / instruction.
Here is the updated version of cos_table_*_LERP which should have higher throughput and lower latency:
double cos_table_128_LERP(double x) {
x = fabs(x);
double prod = x * TWO_TO_SEVEN_OVER_PI;
double id = floor(prod);
double x = prod - id; /* after this step, 0 <= x < 1 */
int i = ((int)id) & 0xFF; /* i = id mod 2^8 */
return lerp(x, COS_TABLE[i], COS_TABLE[i + 1]);
}
You can also optimize lerp a bit more with the formula:
We do employ this range reduction strategy in a more accurate way for trig functions in LLVM libc:
- with FMA instructions: https://github.com/llvm/llvm-project/blob/main/libc/src/math/generic/range_reduction_fma.h
- without FMA instructions: https://github.com/llvm/llvm-project/blob/main/libc/src/math/generic/range_reduction.h
> We can do better since the cosine values are equivalent every multiple of pi, except that the sign flips.
There is one more step to take: Each π/2 long segment has identical shape, they are just pointing up or down (like you already noticed), and left or right. So you can reduce you basic domain down to not just 0...π, but to 0...π/2.
And you can keep going with this logic, building cos from 0-x by recursively calculating cos from 0-x/2 then mirroring over some line if needed. Eventually you hit machine precision limit aka the base case cos(0) = 1.
And still further: the function can be piecewise decomposed into separate polynomials, which is what the pervasive SunPro code in everyone's libc does. There's one "mostly parabolic" function to handle the curvy top, and another "mostly linear" one to handle the zero crossing region.
Pretty much every implementation does reduction mod pi/2. Switching polynomials is basically free (and halving the domain significantly reduces the number of terms needed).
If doing signal processing, an optimisation is to use an N-bit integer to represent the range 0 to 2.pi. Some example points in the mapping: 0->0, 2^(N-2)->pi/2, 2^(N-1)->pi, 2^N(wraps to 0)->2.pi(wraps to 0).
If your lookup table has an M-bit index, the index to the lookup table is calculated with: index = (unsigned)theta>>(N-M), where theta in the N-bit integer representing the angle.
The fractional part, which can be used for interpolation, is: theta & ((1<<(N-M))-1).
If you choose M=N=word size (16 bits is often nice), the angle can be used directly as an index. With 16-bits accuracy is typically good enough without interpolation and the entire trig operation reduces to an array indexing operation.
This minimises the overhead of converting an angle to a table index.
This method has 2 error terms - phase error and amplitude error. You can only represent sinusoids with a resolution of 2pi/2^N rad/samp (same phase error if you don’t do phase truncation). Maybe that’s enough for most applications, but it’s probably significantly more error than math libraries that calculate cos. Totally depends on your application.
Why just signal processing? Surely this is good for storing e.g. rotations of characters in games, longitude, in fact almost anywhere where you want an angle where a full rotation is the same as no rotation at all, and more precision as you approach 0° is not useful.
Accumulated error can matter, which is where you'd want that precision near 0°. If something is making tiny turns each frame for thousands of frames, floating-point inaccuracy can add up. (A smarter mathematical approach might remedy this, or might not; think of some game component that is supposed to jitter randomly each frame, maybe some particle system.) It's not common, but it's within hypothetical feasibility that the precision would be useful.
He is describing deterministic uniform round to nearest quantization. The most common smarter mathematical approach would be randomized rounding and it will solve this issue.
(Issue being that the deterministic rounding is a biased estimator of the sin function)
There are a bunch of places where randomized rounding works better. But is there any computer hardware that implements this? Obviously the internal precision (of the random element) can't be infinite, but a couple of extra bits would seem worth the complexity.
Tried entering it in Grapher (on the Mac) and got what looked more or less like a parabola. My algebra is rusty — but do you kind of have an x-squared in there?
The author's Taylor Series looked salvageable to me. The range between -Pi/2 and Pi/2 looked fine. That useable range can be re-used to serve all other portions of the circle.
Add a conditional that applies (1-result) for some angles and you're golden.
Taylor expansion is usually not so good for this, you'll fare better with either Legendre or Chebyshev polynomials. Robin Green has some excellent material on the subject, which I am linking below.
Trig functions are where accuracy and performance go to die. Accumulated error is a thing, so when optimizing always consider exactly how you're going to be using those functions so you make the 'right' tradeoffs for your application. One size definitely doesn't fit all here, so test and experiment.
A good idea is to not to compute the values of cos from 0-2pi, but further reduce the range, using
cos(a) = cos(-a), and cos(2a) = 1-2cos(a), or cos(a+pi/4) =...
So we really only ever need to be able to compute cos in the range 0-pi/4.
Then for further accuracy we can do the taylor expansion around pi/8. (or other approximations)
finally the number of terms for a fixed accuracy varies with the distance from pi/8,
I think all real implementations use a lookup table... Small lookup tables (eg. 8 elements) all work out far smaller in silicon area than even one multiplier...
The math.h cos from the article counterexample is pretty trivial, so can you give an example? I doubt linear interpolation between the 8 values is enough, but cubic probably would be. This would also be close to e.g. 8 third order taylor approximations of the function, and a comparison would be interesting.
This is a fine introduction to how you even approach the problem of computing transcendental functions.
It would have been nice to have some discussion of accuracy requirements rather than just eyeballing it and saying “more accuracy than I need”. This is a spot where inexperienced devs can easily get tangled up. An attitude like “ten digits? That’s so many! I’m only making a game, after all” is exactly the sort of thing that gets you into trouble if you start accumulating errors over time, and this is particularly easy to do with trig functions.
6.7948900e-9 which has only the first 5 digits correct (if you assume it would have to have been 6.79490 to get 6 digits correct since that is what the correct value rounds to)
It's unusual, outside x86, for these functions to be hardware accelerated. So just about every libm has to do it in software.
Nearly all libm implementations around are based off Sun's fdlibm from the 80's-90's, including a bunch of the variants cited below, *bsds, etc. They are basically updated slightly, but you can see their heritage in the structure of the code. The original is found on netlib these days: https://www.netlib.org/fdlibm/k_cos.c
14th order polynomial, but only uses ~7 terms. It's supposed to have error < 1 ULP. For fdlibm, it's pretty readable compared to some of the other fun ones. I seem to remember sqrt being a bit gnarly.
I would say the benchmarks with lookup tables are bad. In the benchmark cpu will keep the table in cache but in a real program this cache would have to be shared with rest of the code/data. This would either kill the performance of cosine or rest of the app
If you like this sort of thing there’s a great book: Methods and Programs for Mathematical Functions by Stephen Moshier. It covers the computation of all sorts of special functions. The code is available in the cephes library but the book may be out of print.
Nice. I'd love to see how this changes when you have SIMD and multiple cosines to compute in parallel. Also when you have to compute sine and cosine simultaneously which is often the case, and then you may be more interested in polar error than cartesian error.
You can do this with: https://github.com/RJVB/sse_mathfun
Those functions do not have the same safety as libc, but even for scalar sin/cos it can be faster.
Note that you can combine the lookup table with the taylor series expansion...
And you can even use the same lookup table for each. That means with 2 table lookups in a 32 entry table, a single multiply and add (and a few bit shifts), you can get ~9 bits of precision in your result, which is fine for most uses. It also probably makes a sin operation take ~1 clock cycle on most superscalar architectures, as long as your workload has sufficient instruction parallelism.
Note that a smaller table typically works out faster because 32 entries fit in the cache, whereas repeated random entry into a 1024 entry table would probably kick a bunch of other stuff out of the cache that you wanted.
Why not store the velocity as a 2D vector instead? Then you still have to use cos/sin to compute this vector, but at least you don't need it every frame, plus often you don't need to use cos/sin to compute this vector either since forces that act on the velocity themselves can have an x and y component you can directly add to it
Sometimes it's just easier to work and think in polar coordinates. If that's the case, then (speed, rotation) is just as valid a choice as (vx, vy). For smaller games, the cost of computing sin and cos every frame is so astronomically small, that it's not even concerning yourself over.
Hmm, I usually find polar coordinates harder to reason with due to the gimbal lock :)
In 2D there's no gimbal lock of course, but there's still a number that behaves differently than the others, the angle wrapping around in range 0..2pi!
And of course the article is about fast computation of sin/cos so the option of not computing it should be considered
The equivalent of polar coordinates in 3d are quaternions. Which do not have gimbal lock, the problem only occurs for euler angles(which should never ever be used for anything ever).
> The equivalent of polar coordinates in 3d are quaternions
I’d disagree. Quaternions are the 3D equivalent (yes, I’m aware of the subtleties here, but speaking in a rough sense) to using complex numbers to represent rotations.
Euler angles are much closer in spirit to polar coordinates, but unfortunately have gimbal lock.
A single number to represent a rotation in 2D, vs a single number to represent rotations in 4D for which 2D, and 3D are special cases.
Compared to 3 numbers and a series of operations which happen to rotate a 3D vector.
Regardless I think euler angles should not be taught except as a dont do this, its a bad idea. Mostly because when used in practice, in order to interpret them you must convert them to quaternion to do so, which is very telling. Gimbal lock is a secondary problem to that.
A physical analogy is an azimuth-elevation mounted antenna tracking an object that flies over the zenith. When the object is near the horizon, the angular velocity of the azimuth actuator may be near zero and the elevation actuator may be small. Near zenith, the angular velocity of the elevation actuator may be small and the azimuth actuator may be near infinity.
The main thing is how when representing the 3D looking orientation as multiple angles (pitch=up/down, yaw=compass direction), you lose control over the yaw axis when looking straight up.
Of course gimbal lock is a thing, and there are ways to avoid it, even without quaternions, but I'm curious as to when it should be avoided. I think that gamers are now so used to it, that removing it might cause more problems than it solves
I guess that's more for airplane simulators then. But I'd also think any physics engine where things can rotate arbitrarily need to work correctly here.
I also think, despite various sources on the internet saying "quaternions are the solution to gimbal lock", that classical 3D vectors and 3D matrices without quaternions also perfectly solve it.
Agreed absolutely. I've used quaternions when someone else made the decision that was what the physics engine was going to do! But when I'm the decision maker it's 3d vectors and matrices all the way
Looks to me like you could compute the top half, and then just repeat the rest as a kind of mirror function, that repeats with some set translation. Am I wrong here?
I thought the same thing as well. When I saw the Taylor sum diverging at the "edges" I thought - well Taylor sum converges absolutely only in a compact interval. An easy solution to this limitation is to use the periodicity of the cos/sin function...
Strange. This article was easy to read, simple formating, not driven by trends or weird internet culture. The weren't any bombastic claims, aggrandizing statements or dramatic opinions. Just interesting information presented clearly without being verbose. Almost like it was written by an adult. How did this end up on HN?
Marz's taylor series implementation as-is is significantly faster (~60% the runtime) than the glibc implementation at 6 terms on My Machine(tm), rather than about the same as in TFA. I haven't compared to LERP lookup table yet though.
Niklaus Wirth also has some cosine code (probably meant more for software floating point, or fp FPGA blocks); I don't know how they compare with these approximations but his seem to be within 1e-6 of python's math.cos ...
By selecting an appropriate norm (in function space) you can minimize either the maximum error or the error integral over some range (e.g. the 0-\pi range).
Here's a video on the subject. You might want to watch earlier ones also for more context.
https://www.youtube.com/watch?v=tMlKZZf2Kac&list=PLdkTDauaUn...
Full disclosure, this is my university lecture on optimization that was recorded during Covid.