
Computing sin and cos in hardware with synthesisable Verilog - Cieplak
http://kierdavis.com/cordic.html
======
GeertB
While CORDIC is great for fixed point, it has limitations for floating point.
The original 8087 fsin and fcos instructions used CORDIC, but later versions
of the architecture switched to polynomial approximations, see
[https://software.intel.com/sites/default/files/managed/f8/9c...](https://software.intel.com/sites/default/files/managed/f8/9c/x87TrigonometricInstructionsVsMathFunctions.pdf).
Today it's possible to develop implementations of these elementary functions
on x86 CPUs that are more precise and more performant using regular
multiply/addition/fused multiply add than even the current improved post-
CORDIC fsin and fcos functions.

The main issue is that having an instruction executing a fixed-function block
with a given (high) latency and little if any pipelining tends to be far worse
than having many more fully pipelined multiply/add instructions. The other
issue is that argument reduction and approximation over the reduced domain are
not independent. For some parts of the domain, such as computing the sine of a
number very close to a multiple of pi, you may need to spend more cycles
reducing the argument accurately to counter cancelation effects. However, as
the reduced argument is then very close to zero, a simple polynomial suffices.

So, for most modern systems, I'd put the effort in efficient pipelined fused-
multiply-add and use that for all elementary functions. Fixed-function
hardware for elementary functions has generally been proved sub-optimal.

~~~
im3w1l
> For some parts of the domain, such as computing the sine of a number very
> close to a multiple of pi, you may need to spend more cycles reducing the
> argument accurately to counter cancelation effects.

I'm curious about this. How would you compute FLOAT32_32 (or FLOAT64_MAX) mod
PI with correct rounding?

~~~
stephencanon
The most approachable explanation I know of is K C Ng’s:
[https://www.scribd.com/document/64949170/Ng-Argument-
Reducti...](https://www.scribd.com/document/64949170/Ng-Argument-Reduction-
for-Huge-Arguments-Good-to-the-Last-Bit)

------
Y_Y
What a well explained and simple article. Though it would be nice to know why
each optimisation is made wrt the eventual hardware.

~~~
ChuckMcM
I think the author was assuming that the reader was already familiar with FPGA
design, in such designs it is straight forward to synthesize adders and
shifters but you cannot synthesize a multiplier (for example).

~~~
mechagodzilla
You can synthesize simple integer multipliers, just not floating point
multipliers.

------
hatsunearu
Oh wow, I literally _just_ finished my quadrature sinusoid DDS generator using
MyHDL last night. I didn't use CORDIC but rather a LUT. I found out I can
optimize generating quadrature sinusoids by having two separate LUTs where
each one stores from 0 to pi/2 and the other from pi/2 to pi, and this has an
advantage because when the sine output takes inputs from the first LUT, the
cos output takes inputs from the second LUT and vice versa, thus saving
duplicates.

I'm still cleaning up the testbench code and I plan to put out a blog post
here if y'all are interested: hatsunearu.github.io

~~~
gugagore
I'm not sure I followed, but I thought that the standard approach was to only
store a quarter of a cycle (0 to pi/2) in the LUT, and use symmetry to
generate the rest of the cycle. Is the advantage of having two LUTs that you
can index into each of them separately?

~~~
hatsunearu
edit: and yeah, the two LUTs allow me to index into them separately, and also
guarantees that every quadrature combination never indexes into one LUT.

OK so the problem is kinda twofold with that... First of all my application
needs two sinusoids at quadrature (it's for downconversion in my SDR), which
using that approach I believe I need dual read port RAM which takes up two RAM
slices in my FPGA so I'm not losing anything by taking my approach.

Secondly I tried doing that but I seem to get an "off by one" error--basically
if you subdivide 0 to pi/2 with let's say 2^10 points, you either:

a) the last point (index=2^10-1) corresponds to sin(pi/2) b) the last point +
1 corresponds to sin(pi/2), and the actual last point is sin(pi/2 - (pi/2) /
(2^10))

the a) approach that means when you sweep through 0 to 2^10 - 1, the next
index isn't 2^10 - 1, but rather 2^10 - 2, which means you'd need some
annoying logic rather than simple bit slicing.

the b) approach you'd have you make a special case and it's all too annoying.

Maybe I'm just not creative enough but I haven't been able to come up with an
elegant way of doing it with 0 to pi/2 LUTs (just by using simple bit slices)

I also haven't looked at any actual implementations (actually I couldn't even
find them lol) other than the basic description on wikipedia. Maybe that's
why.

------
kkaranth
Nice read! I have 2 questions:

When calculating K, the author says “It can be shown through the use of
trigonometric identities that:” and proceeds to show a formula. How exactly
does this happen?

After calculating K, the author assigns it to c in the cordic function, but
not to s. Why?

~~~
edflsafoiewq
To your first question:

tan β_i = sin β_i / cos β_i = 2^{-i}.

So cos β_i = 2^i sin β_i. Square and use Pythagoras for cos^2 β_i = 2^{2i} (1
- cos^2 β_i).

Solve for cos^2 β_i and take the positive square root for 2^i / √(1 + 2^{2i})
= 1 / √(2^{-2i} + 1).

To your second:

I presume the author multiplied 1 by K to get c <\- K. When you multiply 0 by
K you still have s <\- 0 though. I haven't figured out the justification for
moving the multiplication from after to before the loop but if it works for c
it works for s too.

edit:

If you write

    
    
        c_new ← cos α[i] × (c - tan α[i] × s)
        s_new ← cos α[i] × (s + tan α[i] × c)
        c ← c_new
        s ← s_new
    

with a complex variable z = c + is (changing the index variable to j) you get

    
    
        z ← cos α[j] × (1 + i × tan α[j]) × z
    

which makes it more obvious (to me) that you can pull the multiplication by
cosines out of the loop.

    
    
        z <- 1      # c=1, s=0
        z <- K × z  # c=K, s=0
        ...loop...

~~~
seedless-sensat
To the second question, I am also struggling to follow the step "Let’s move
the multiplications by cos β_i into a separate loop".

The recursive construction of `c` doesn't allow the betas to be pulled out.
The i-th term is multiplied by beta_i to beta_n, but not beta_0 to beta_{i-1}.

Clearly the method must work, but I'm not sure what I'm missing.

Edit: ignoring `direction`, I can prove this step by induction for both c(i)
and s(i) together. At least to me though, this it isn't a simple algebraic
simplification.

~~~
edflsafoiewq
I got it. In terms of a complex variable, mathematically the loop before that
step is

    
    
        z = Π_j cos β[j] (1 + i direction_j 2^{-j})
    

So the step corresponds to the trivial rearrangement

    
    
        z = ( Π_j cos β[j] ) ( Π_j (1 + i direction_j 2^{-j}) )

~~~
edflsafoiewq
This is what's happening mathematically:
[https://i.imgur.com/MclfpEN.png](https://i.imgur.com/MclfpEN.png).

------
toolslive
Isn't it simpler (and more efficient) to build a interpolating polynomial
approximation for sin(x) for the range [0,pi/8) (using Chebichev iso Lagrange
interpolation fe)

------
JoeAltmaier
Is this just an example? Because if I wanted to rotate a vector, I'd do it
with vectors, not trig. Which requires multiplication and addition, right?
What am I missing.

~~~
gcb0
missing the "in hardware" part?

target may be a simple microcontroller or fpga that lacks multiplication.

~~~
JoeAltmaier
Yet is has trig? Come on.

~~~
Khoth
It doesn't have trig. The algorithm just requires 32 precomputed trig results
that are stored in a lookup table.

~~~
markrages
Indeed, I have used CORDIC for rectangular-to-polar on a little micro without
multiply (and power-constrained by a coin cell). I made a code-generator for
this, see
[https://github.com/markrages/cordic_polar](https://github.com/markrages/cordic_polar)

~~~
JoeAltmaier
Very sweet!

------
man-and-laptop
How is CORDIC different from e^x ~= (1+x/N)^N where N is a power of 2?

~~~
edflsafoiewq
You mean if you insert x->ix and N->2^N and use exponentiation by squaring?
Then you get

    
    
        fun cis(x) {
            c, s <- 1, x >> N
            for 0 .. N-1 {
                c, s <- c*c - s*s, c*s << 1
            return c, s
        }
    

By comparison, CORDIC has only additions and shifts in the loop. CORDIC is
also online in the sense that you don't need to know N up front, you can keep
iteratively improving your approximation until you reach the limit of numeric
precision. But with this function you need to know N up front.

If we're talking about fixed-point numbers, the initial assignment of s <\-
x>>N here is also scary since it is the only place in which x feeds into the
algorithm and the shift means that the only a small number of the bits of x
are actually used in the calculation. In the worst case, if you increased N a
lot in the hope of getting more accuracy you could shift all the bits of x off
and get the answer 1, 0 for every angle.

------
gravypod
How would one build an asynchronous implementation of this circuit in an HDL

~~~
monocasa
Why?

~~~
monocasa
Replying because I can't edit.

This is way too complex, a purely combinatorial version would never meet
timing in a design where you cared enough to make it in an FPGA. The question
doesn't make sense on it's face. You want a pipelined version.

I asked why to get to the bottom of wht he's really asking.

~~~
slededit
You could use multi cycle paths so this doesn't limit your global timing. Then
you would save the registers. This is much more useful for ASIC design,
because on an FPGA the registers are there whether you use them or not.

However total latency will be lower than a pipe-lined version which is nice if
that's what your optimizing for.

~~~
monocasa
Yeah, but if you were using a more advanced technique like that, you wouldn't
be asking on HN how to unroll that loop.

------
andrewflnr
This is incredibly slick. Is it used in real hardware?

~~~
userbinator
CORDIC? Yes. It's what most if not all scientific calculators use.

