Computing sin and cos in hardware with synthesisable Verilog

GeertB · on Oct 28, 2018

While CORDIC is great for fixed point, it has limitations for floating point. The original 8087 fsin and fcos instructions used CORDIC, but later versions of the architecture switched to polynomial approximations, see https://software.intel.com/sites/default/files/managed/f8/9c.... Today it's possible to develop implementations of these elementary functions on x86 CPUs that are more precise and more performant using regular multiply/addition/fused multiply add than even the current improved post-CORDIC fsin and fcos functions.

The main issue is that having an instruction executing a fixed-function block with a given (high) latency and little if any pipelining tends to be far worse than having many more fully pipelined multiply/add instructions. The other issue is that argument reduction and approximation over the reduced domain are not independent. For some parts of the domain, such as computing the sine of a number very close to a multiple of pi, you may need to spend more cycles reducing the argument accurately to counter cancelation effects. However, as the reduced argument is then very close to zero, a simple polynomial suffices.

So, for most modern systems, I'd put the effort in efficient pipelined fused-multiply-add and use that for all elementary functions. Fixed-function hardware for elementary functions has generally been proved sub-optimal.

seedless-sensat · on Oct 28, 2018

The author's method is trivially pipeline-able into up to 32 stages.

im3w1l · on Oct 28, 2018

> For some parts of the domain, such as computing the sine of a number very close to a multiple of pi, you may need to spend more cycles reducing the argument accurately to counter cancelation effects.

I'm curious about this. How would you compute FLOAT32_32 (or FLOAT64_MAX) mod PI with correct rounding?

stephencanon · on Oct 29, 2018

The most approachable explanation I know of is K C Ng’s: https://www.scribd.com/document/64949170/Ng-Argument-Reducti...

Y_Y · on Oct 27, 2018

What a well explained and simple article. Though it would be nice to know why each optimisation is made wrt the eventual hardware.

ChuckMcM · on Oct 27, 2018

I think the author was assuming that the reader was already familiar with FPGA design, in such designs it is straight forward to synthesize adders and shifters but you cannot synthesize a multiplier (for example).

mechagodzilla · on Oct 28, 2018

You can synthesize simple integer multipliers, just not floating point multipliers.

hatsunearu · on Oct 28, 2018

Oh wow, I literally just finished my quadrature sinusoid DDS generator using MyHDL last night. I didn't use CORDIC but rather a LUT. I found out I can optimize generating quadrature sinusoids by having two separate LUTs where each one stores from 0 to pi/2 and the other from pi/2 to pi, and this has an advantage because when the sine output takes inputs from the first LUT, the cos output takes inputs from the second LUT and vice versa, thus saving duplicates.

I'm still cleaning up the testbench code and I plan to put out a blog post here if y'all are interested: hatsunearu.github.io

gugagore · on Oct 28, 2018

I'm not sure I followed, but I thought that the standard approach was to only store a quarter of a cycle (0 to pi/2) in the LUT, and use symmetry to generate the rest of the cycle. Is the advantage of having two LUTs that you can index into each of them separately?

hatsunearu · on Oct 28, 2018

edit: and yeah, the two LUTs allow me to index into them separately, and also guarantees that every quadrature combination never indexes into one LUT.

OK so the problem is kinda twofold with that... First of all my application needs two sinusoids at quadrature (it's for downconversion in my SDR), which using that approach I believe I need dual read port RAM which takes up two RAM slices in my FPGA so I'm not losing anything by taking my approach.

Secondly I tried doing that but I seem to get an "off by one" error--basically if you subdivide 0 to pi/2 with let's say 2^10 points, you either:

a) the last point (index=2^10-1) corresponds to sin(pi/2) b) the last point + 1 corresponds to sin(pi/2), and the actual last point is sin(pi/2 - (pi/2) / (2^10))

the a) approach that means when you sweep through 0 to 2^10 - 1, the next index isn't 2^10 - 1, but rather 2^10 - 2, which means you'd need some annoying logic rather than simple bit slicing.

the b) approach you'd have you make a special case and it's all too annoying.

Maybe I'm just not creative enough but I haven't been able to come up with an elegant way of doing it with 0 to pi/2 LUTs (just by using simple bit slices)

I also haven't looked at any actual implementations (actually I couldn't even find them lol) other than the basic description on wikipedia. Maybe that's why.

rshashik · on Oct 28, 2018

Couldn't find that post on your blog. You sure you didn't forget to commit it?

rshashik · on Oct 28, 2018

My bad I misread it.

kkaranth · on Oct 28, 2018

Nice read! I have 2 questions:

When calculating K, the author says “It can be shown through the use of trigonometric identities that:” and proceeds to show a formula. How exactly does this happen?

After calculating K, the author assigns it to c in the cordic function, but not to s. Why?

edflsafoiewq · on Oct 28, 2018

To your first question:

tan β_i = sin β_i / cos β_i = 2^{-i}.

So cos β_i = 2^i sin β_i. Square and use Pythagoras for cos^2 β_i = 2^{2i} (1 - cos^2 β_i).

Solve for cos^2 β_i and take the positive square root for 2^i / √(1 + 2^{2i}) = 1 / √(2^{-2i} + 1).

To your second:

I presume the author multiplied 1 by K to get c <- K. When you multiply 0 by K you still have s <- 0 though. I haven't figured out the justification for moving the multiplication from after to before the loop but if it works for c it works for s too.

edit:

If you write

    c_new ← cos α[i] × (c - tan α[i] × s)
    s_new ← cos α[i] × (s + tan α[i] × c)
    c ← c_new
    s ← s_new

with a complex variable z = c + is (changing the index variable to j) you get

    z ← cos α[j] × (1 + i × tan α[j]) × z

which makes it more obvious (to me) that you can pull the multiplication by cosines out of the loop.

    z <- 1      # c=1, s=0
    z <- K × z  # c=K, s=0
    ...loop...

seedless-sensat · on Oct 28, 2018

To the second question, I am also struggling to follow the step "Let’s move the multiplications by cos β_i into a separate loop".

The recursive construction of `c` doesn't allow the betas to be pulled out. The i-th term is multiplied by beta_i to beta_n, but not beta_0 to beta_{i-1}.

Clearly the method must work, but I'm not sure what I'm missing.

Edit: ignoring `direction`, I can prove this step by induction for both c(i) and s(i) together. At least to me though, this it isn't a simple algebraic simplification.

edflsafoiewq · on Oct 28, 2018

I got it. In terms of a complex variable, mathematically the loop before that step is

    z = Π_j cos β[j] (1 + i direction_j 2^{-j})

So the step corresponds to the trivial rearrangement

    z = ( Π_j cos β[j] ) ( Π_j (1 + i direction_j 2^{-j}) )

edflsafoiewq · on Oct 28, 2018

This is what's happening mathematically: https://i.imgur.com/MclfpEN.png.

toolslive · on Oct 28, 2018

Isn't it simpler (and more efficient) to build a interpolating polynomial approximation for sin(x) for the range [0,pi/8) (using Chebichev iso Lagrange interpolation fe)

JoeAltmaier · on Oct 28, 2018

Is this just an example? Because if I wanted to rotate a vector, I'd do it with vectors, not trig. Which requires multiplication and addition, right? What am I missing.

detaro · on Oct 28, 2018

The goal isn't rotating a vector. Rotating a vector is how the actual goal, compute sin/cos, is achieved.

gcb0 · on Oct 28, 2018

missing the "in hardware" part?

target may be a simple microcontroller or fpga that lacks multiplication.

JoeAltmaier · on Oct 28, 2018

Yet is has trig? Come on.

Khoth · on Oct 28, 2018

It doesn't have trig. The algorithm just requires 32 precomputed trig results that are stored in a lookup table.

markrages · on Oct 28, 2018

Indeed, I have used CORDIC for rectangular-to-polar on a little micro without multiply (and power-constrained by a coin cell). I made a code-generator for this, see https://github.com/markrages/cordic_polar

JoeAltmaier · on Oct 29, 2018

Very sweet!

man-and-laptop · on Oct 28, 2018

How is CORDIC different from e^x ~= (1+x/N)^N where N is a power of 2?

edflsafoiewq · on Oct 28, 2018

You mean if you insert x->ix and N->2^N and use exponentiation by squaring? Then you get

    fun cis(x) {
        c, s <- 1, x >> N
        for 0 .. N-1 {
            c, s <- c*c - s*s, c*s << 1
        return c, s
    }

By comparison, CORDIC has only additions and shifts in the loop. CORDIC is also online in the sense that you don't need to know N up front, you can keep iteratively improving your approximation until you reach the limit of numeric precision. But with this function you need to know N up front.

If we're talking about fixed-point numbers, the initial assignment of s <- x>>N here is also scary since it is the only place in which x feeds into the algorithm and the shift means that the only a small number of the bits of x are actually used in the calculation. In the worst case, if you increased N a lot in the hope of getting more accuracy you could shift all the bits of x off and get the answer 1, 0 for every angle.

gravypod · on Oct 27, 2018

How would one build an asynchronous implementation of this circuit in an HDL

jhallenworld · on Oct 28, 2018

It's easy to do this. See this part? It's repeated 32 times.

       // Compute mode.
        cos_next = cos + (direction_negative ? sin_shr : -sin_shr);
        sin_next = sin + (direction_negative ? -cos_shr : cos_shr);
        angle_next = angle + (direction_negative ? beta : -beta);

What you need to do is make 32 instances of this in series. Then you can compute the sin and cos without any flip-flops, which counts as asynchronous. beta is an array indexed by count, so this will instead be replaced with `BETA_0, `BETA_1, etc. for each instance.

A pipelined version is more useful: you still make 32 instances, but leave flip-flops between each stage. This way you produce one result per cycle, but with a 32-cycle latency. The cycle time is fast since there is not much logic in each stage. This might be useful, for example, to make the logic part of a DDS.

monocasa · on Oct 27, 2018

monocasa · on Oct 28, 2018

Replying because I can't edit.

This is way too complex, a purely combinatorial version would never meet timing in a design where you cared enough to make it in an FPGA. The question doesn't make sense on it's face. You want a pipelined version.

I asked why to get to the bottom of wht he's really asking.

slededit · on Oct 29, 2018

You could use multi cycle paths so this doesn't limit your global timing. Then you would save the registers. This is much more useful for ASIC design, because on an FPGA the registers are there whether you use them or not.

However total latency will be lower than a pipe-lined version which is nice if that's what your optimizing for.

monocasa · on Oct 29, 2018

Yeah, but if you were using a more advanced technique like that, you wouldn't be asking on HN how to unroll that loop.

andrewflnr · on Oct 28, 2018

This is incredibly slick. Is it used in real hardware?

userbinator · on Oct 28, 2018

CORDIC? Yes. It's what most if not all scientific calculators use.