While CORDIC is great for fixed point, it has limitations for floating point. The original 8087 fsin and fcos instructions used CORDIC, but later versions of the architecture switched to polynomial approximations, see https://software.intel.com/sites/default/files/managed/f8/9c.... Today it's possible to develop implementations of these elementary functions on x86 CPUs that are more precise and more performant using regular multiply/addition/fused multiply add than even the current improved post-CORDIC fsin and fcos functions.
The main issue is that having an instruction executing a fixed-function block with a given (high) latency and little if any pipelining tends to be far worse than having many more fully pipelined multiply/add instructions. The other issue is that argument reduction and approximation over the reduced domain are not independent. For some parts of the domain, such as computing the sine of a number very close to a multiple of pi, you may need to spend more cycles reducing the argument accurately to counter cancelation effects. However, as the reduced argument is then very close to zero, a simple polynomial suffices.
So, for most modern systems, I'd put the effort in efficient pipelined fused-multiply-add and use that for all elementary functions. Fixed-function hardware for elementary functions has generally been proved sub-optimal.
> For some parts of the domain, such as computing the sine of a number very close to a multiple of pi, you may need to spend more cycles reducing the argument accurately to counter cancelation effects.
I'm curious about this. How would you compute FLOAT32_32 (or FLOAT64_MAX) mod PI with correct rounding?
I think the author was assuming that the reader was already familiar with FPGA design, in such designs it is straight forward to synthesize adders and shifters but you cannot synthesize a multiplier (for example).
Oh wow, I literally just finished my quadrature sinusoid DDS generator using MyHDL last night. I didn't use CORDIC but rather a LUT. I found out I can optimize generating quadrature sinusoids by having two separate LUTs where each one stores from 0 to pi/2 and the other from pi/2 to pi, and this has an advantage because when the sine output takes inputs from the first LUT, the cos output takes inputs from the second LUT and vice versa, thus saving duplicates.
I'm still cleaning up the testbench code and I plan to put out a blog post here if y'all are interested: hatsunearu.github.io
I'm not sure I followed, but I thought that the standard approach was to only store a quarter of a cycle (0 to pi/2) in the LUT, and use symmetry to generate the rest of the cycle. Is the advantage of having two LUTs that you can index into each of them separately?
edit: and yeah, the two LUTs allow me to index into them separately, and also guarantees that every quadrature combination never indexes into one LUT.
OK so the problem is kinda twofold with that...
First of all my application needs two sinusoids at quadrature (it's for downconversion in my SDR), which using that approach I believe I need dual read port RAM which takes up two RAM slices in my FPGA so I'm not losing anything by taking my approach.
Secondly I tried doing that but I seem to get an "off by one" error--basically if you subdivide 0 to pi/2 with let's say 2^10 points, you either:
a) the last point (index=2^10-1) corresponds to sin(pi/2)
b) the last point + 1 corresponds to sin(pi/2), and the actual last point is sin(pi/2 - (pi/2) / (2^10))
the a) approach that means when you sweep through 0 to 2^10 - 1, the next index isn't 2^10 - 1, but rather 2^10 - 2, which means you'd need some annoying logic rather than simple bit slicing.
the b) approach you'd have you make a special case and it's all too annoying.
Maybe I'm just not creative enough but I haven't been able to come up with an elegant way of doing it with 0 to pi/2 LUTs (just by using simple bit slices)
I also haven't looked at any actual implementations (actually I couldn't even find them lol) other than the basic description on wikipedia. Maybe that's why.
When calculating K, the author says “It can be shown through the use of trigonometric identities that:” and proceeds to show a formula. How exactly does this happen?
After calculating K, the author assigns it to c in the cordic function, but not to s. Why?
So cos β_i = 2^i sin β_i. Square and use Pythagoras for cos^2 β_i = 2^{2i} (1 - cos^2 β_i).
Solve for cos^2 β_i and take the positive square root for 2^i / √(1 + 2^{2i}) = 1 / √(2^{-2i} + 1).
To your second:
I presume the author multiplied 1 by K to get c <- K. When you multiply 0 by K you still have s <- 0 though. I haven't figured out the justification for moving the multiplication from after to before the loop but if it works for c it works for s too.
edit:
If you write
c_new ← cos α[i] × (c - tan α[i] × s)
s_new ← cos α[i] × (s + tan α[i] × c)
c ← c_new
s ← s_new
with a complex variable z = c + is (changing the index variable to j) you get
z ← cos α[j] × (1 + i × tan α[j]) × z
which makes it more obvious (to me) that you can pull the multiplication by cosines out of the loop.
z <- 1 # c=1, s=0
z <- K × z # c=K, s=0
...loop...
To the second question, I am also struggling to follow the step "Let’s move the multiplications by cos β_i into a separate loop".
The recursive construction of `c` doesn't allow the betas to be pulled out. The i-th term is multiplied by beta_i to beta_n, but not beta_0 to beta_{i-1}.
Clearly the method must work, but I'm not sure what I'm missing.
Edit: ignoring `direction`, I can prove this step by induction for both c(i) and s(i) together. At least to me though, this it isn't a simple algebraic simplification.
Isn't it simpler (and more efficient) to build a interpolating polynomial approximation for sin(x) for the range [0,pi/8) (using Chebichev iso Lagrange interpolation fe)
Is this just an example? Because if I wanted to rotate a vector, I'd do it with vectors, not trig. Which requires multiplication and addition, right? What am I missing.
Indeed, I have used CORDIC for rectangular-to-polar on a little micro without multiply (and power-constrained by a coin cell). I made a code-generator for this, see https://github.com/markrages/cordic_polar
You mean if you insert x->ix and N->2^N and use exponentiation by squaring? Then you get
fun cis(x) {
c, s <- 1, x >> N
for 0 .. N-1 {
c, s <- c*c - s*s, c*s << 1
return c, s
}
By comparison, CORDIC has only additions and shifts in the loop. CORDIC is also online in the sense that you don't need to know N up front, you can keep iteratively improving your approximation until you reach the limit of numeric precision. But with this function you need to know N up front.
If we're talking about fixed-point numbers, the initial assignment of s <- x>>N here is also scary since it is the only place in which x feeds into the algorithm and the shift means that the only a small number of the bits of x are actually used in the calculation. In the worst case, if you increased N a lot in the hope of getting more accuracy you could shift all the bits of x off and get the answer 1, 0 for every angle.
What you need to do is make 32 instances of this in series. Then you can compute the sin and cos without any flip-flops, which counts as asynchronous. beta is an array indexed by count, so this will instead be replaced with `BETA_0, `BETA_1, etc. for each instance.
A pipelined version is more useful: you still make 32 instances, but leave flip-flops between each stage. This way you produce one result per cycle, but with a 32-cycle latency. The cycle time is fast since there is not much logic in each stage. This might be useful, for example, to make the logic part of a DDS.
This is way too complex, a purely combinatorial version would never meet timing in a design where you cared enough to make it in an FPGA. The question doesn't make sense on it's face. You want a pipelined version.
I asked why to get to the bottom of wht he's really asking.
You could use multi cycle paths so this doesn't limit your global timing. Then you would save the registers. This is much more useful for ASIC design, because on an FPGA the registers are there whether you use them or not.
However total latency will be lower than a pipe-lined version which is nice if that's what your optimizing for.
The main issue is that having an instruction executing a fixed-function block with a given (high) latency and little if any pipelining tends to be far worse than having many more fully pipelined multiply/add instructions. The other issue is that argument reduction and approximation over the reduced domain are not independent. For some parts of the domain, such as computing the sine of a number very close to a multiple of pi, you may need to spend more cycles reducing the argument accurately to counter cancelation effects. However, as the reduced argument is then very close to zero, a simple polynomial suffices.
So, for most modern systems, I'd put the effort in efficient pipelined fused-multiply-add and use that for all elementary functions. Fixed-function hardware for elementary functions has generally been proved sub-optimal.