Hacker News new | past | comments | ask | show | jobs | submit login
Why the CORDIC algorithm lives rent-free in my head (github.com/francisrstokes)
461 points by todsacerdoti 13 days ago | hide | past | favorite | 91 comments





While the author mentions this is mostly applicable to things like FPGAs, there's also an application in gamedev (or any distributed physics simulation). Floating point calculations are tricky to get deterministic across platforms[0]. One solution is to skip floats entirely and implement a fixed-point physics engine. You'd need something like CORDIC to implement all the trig functionality.

I started working on such a thing as a fun side project a few years ago but never finished it. One of these days I hope to get back to it.

[0]: https://randomascii.wordpress.com/2013/07/16/floating-point-...


That blog post is now a decade old, but includes an important quote:

> The IEEE standard does guarantee some things. It guarantees more than the floating-point-math-is-mystical crowd realizes, but less than some programmers might think.

Summarizing the blog post, it highlights a few things though some less clearly than I would like:

  * x87 was wonky

  * You need to ensure rounding modes, flush-to-zero, etc. are consistently set

  * Some older processors don't have FMA

  * Approximate instructions (mmsqrtps et al.) don't have a consistent spec

  * Compilers may reassociate expressions

For small routines and self-written libraries, it's straightforward, if painful to ensure you avoid all of that.

Briefly mentioned in the blog post is IEEE-754 (2008) made the spec more explicit, and effectively assumed the death of x87. It's 2024 now, so you can definitely avoid x87. Similarly, FMA is part of the IEEE-754 2008 spec, and has been built into all modern processors since (Haswell and later on Intel).

There are still cross-architecture differences like 8-wide AVX2 vs 4-wide NEON that can trip you up, but if you are writing assembly or intrinsics or just C that inspect with Compiler Explorer or objdump, you can look at the output and say "Yep, that'll be consistent".


> but if you are writing assembly or intrinsics or just C that inspect with Compiler Explorer or objdump, you can look at the output and say "Yep, that'll be consistent".

Surely people have written tooling for those checks for various CPUs?

Also, is it that ‘simple’? Reading https://github.com/llvm/llvm-project/issues/62479, calculations that the compiler does and that only end up in the executable as constants can make results different between architectures or compilers (possibly even compiler runs, if it runs multi-threaded and constant folding order depends on timing, but I find it hard to imagine how exactly that would happen).

So, you’d want to check constants in the code, too, but then, there’s no guarantee that compilers do the same constant folding. You can try to get more consistent by being really diligent in using constexpr, but that doesn’t guarantee that, either.


Years ago, I was programming in Ada and ran across a case where the value of a constant in a program differed from the same constant being converted at runtime. Took a while to track that one down.

The same reasoning applies though. The compiler is just another program. Outside of doing constant folding on things that are unspec'ed or not required (like mmsqrtps and most transcendentals), you should get consistent results even between architectures.

Of course the specific line linked to in that GH issue is showing that LLVM will attempt constant folding of various trig functions:

https://github.com/llvm/llvm-project/blob/faa43a80f70c639f40...

but the IEEE-754 spec does also recommend correctly rounded results for those (https://en.wikipedia.org/wiki/IEEE_754#Recommended_operation...).

The majority of code I'm talking about though uses constants that are some long, explicit number, and doesn't do any math on them that would then be amenable to constant folding itself.

That said, lines like:

https://github.com/llvm/llvm-project/blob/faa43a80f70c639f40...

are more worrying, since that may differ from what people expect dynamically (though the underlying stuff supports different denormal rules).

Either way, thanks for highlighting this! Clearly the answer is to just use LLVM/clang regardless of backend :).


> There are still cross-architecture differences like 8-wide AVX2 vs 4-wide NEON that can trip you up

Does those differences produce different results when running cross-platform simd code?


Depends on what you're doing. The main issue here is reductions / accumulations.

That is, if you have a bunch of floats like:

  float sum = 0.f;
  for (int i = 0; i < N; i++) {
    sum += x[i];
  }
and that you vectorize that to something like (typing in this comment, errors are likely):

  __mm256 sum_8wide = _mm256_setzero_ps();
  for (int i = 0; i < N/8; i++) {
    sum_8wide = _mm256_add_ps(sum_8wide, _mm256_load_ps(x[8*i]);
  }
  // Now sum up the 8 values to get the final sum
  float sum = _mm256_hadd_ps(...);
that will result in a different accumulation than if you did those are 4-wide and then a reduction. The usual solution is to either use the lowest common denominator (e.g., use AVX instead of AVX2) or more performance oriented, use the 4-wide SIMD units on ARM to "emulate" an 8-wide virtual vector (~15 years since I wrote NEON... and again, this is in a comment):

  float32x4_t sum_lo = ; // zero;
  float32x4_t sum_hi = ; // zero;
  for (int i = 0; i < N/8; i++) {
    sum_lo = vaddq_f32(sum_lo, vload1q_f32(x[8*i));
    sum_hi = vaddq_f32(sum_hi, vload1q_f32(x[8*i + 4]));
  }
  // Reduce the sum in the same order
You would want to write a "virtual SIMD" wrapper library, so you don't do this manually in lots of places.

The author did mention about fixed point was very popular for gamedev before floating point becoming popular due the increased in hardware capability, and most likely CORDIC was being used as well together with fixed point.

> In fact, before IEEE 754 became the popular standard that it is today, fixed point was used all the time (go and ask any gamedev who worked on stuff between 1980 and 2000ish and they'll tell you all about it).


I believe that was mostly for performance reasons, not determinism, right?

Was gamedev between 1980 and 2000ish, can confirm. PS1 had no floating point unit.

This was the cause of the signature jiggly textures that were pervasive in PS1 games

This is a common misconception, but is not the case. For example, look at the Voodoo 1, 2, and 3, which also used fixed point numbers internally but did not suffer from this problem.

The real issue is that the PS1 has no subpixel precision. In other words, it will round a triangle coordinates to the nearest integers.

Likely the reason why they did this is because then you can completely avoid any division and multiplication hardware, with integer start and end coordinates line rasterization can be done completely with addition and comparisons.


Didn’t PS1 also lack perspective corrected texture mapping? That would definitely make textures wobbly. AFAIK they compensated for it simply by using as finely subdivided geometry as possible (which wasn’t very finely, really).

The folk that made Crash Bandicoot were pretty clever. They figured out that the PlayStation could render untextured, shaded triangles a lot faster than textured triangles, so they "textured" the main character with pixel-scale geometry. This in turn saved them enough memory to use a higher resolution frame buffer mode.

https://all-things-andy-gavin.com/2011/02/04/making-crash-ba...


It wasn't Pixel scale, more paint by numbers.

The nphysics physics simulation library for gamedev used this approach of using fixed point math whenever cross-platform determinism was wanted, with CORDIC. Nphysics however was deprecated.

The newer Rapier library (which is a rewrite of the nphysics) instead relies on the guarantees of IEEE-754 2008 to provide cross-platform determinism, which means that it doesn't work with old platforms, but it is deterministic across modern platforms, including wasm. And yes, you can't rely on the transcedental routines provided by each platform (like sine, cosine, etc), those need to be implemented in a way to work the same everywhere. But, this is possible if avoid running on non-compliant platforms.

https://www.rustsim.org/blog/2020/06/01/this-month-in-rustsi...

https://rapier.rs/docs/user_guides/rust/determinism/


Fun facts, in addition to sine and cosine calculation and generation, CORDIC can also be used for many operations including log, exponent, square root, vector magnitude, polar-cartesian conversion and vector rotation. Spoiler alerts, the authors provided the teaser regarding these propects in the conclusions.

I've got the feeling that by using quaternion instead of conventional orthonormal matrices, CORDIC based operations can be executed more efficiently (less compute cycles and memory) and effectively (less errors) [1].

[1] CORDIC Framework for Quaternion-based Joint Angle Computation to Classify Arm Movements:

https://core.ac.uk/works/8439118


It can be extended to arbitrary Lie groups IIRC

I remember in high school precalc, we learned about the Taylor series, and my teacher told us that it was how trig functions were actually implemented on calculators. Well I looked it up and found that it was actually CORDIC, and went and had some fun implementing it in TI Basic

I thought you might be interested to read how the remarkable Sinclair scientific calculator did it's trig, log etc. It wasn't Cordic but the algorithm has similarities.

http://files.righto.com/calculator/sinclair_scientific_simul...


Well, are there _any_ calculators out there that use Taylor expansions?

I've done some work with fast approximate math libraries,and went down this path. Taylor was a dead-end, but Chebyshev polynomials work great. They have the nice property of having (close to) the minimum maximum (minimax) error over the entire range you are approximating.

well, from memory (this was > 20y ago)

  - Taylor gives you polynomial optimized for the evaluation of one point.
  - naive Newton gives you a polynomial with 0 error in the interpolation points [but the system has a bad condition]
  - Chebychev gives you a polynomial with minimal error over the interval of choice.
So there is no real reason to ever use the Taylor series to approximate a function over an interval. The high school math teachers are lying.

not lying, probably just mistaken

Some related articles with hardware implementations: https://arxiv.org/pdf/2211.04053 https://hal.science/hal-01327460/document https://archive.ll.mit.edu/HPEC/agendas/proc05/Day_1/Abstrac...

I would love to see how it compares to regular software and hardware trigonometric implementations on various kinds of hardware through the ages.


Thanks for the links.

It is a very strange that for a very pervasive and hugely popular computer techniques, CORDIC do not get a proper detailed treatment and coverage in books. Since IoT and machine-to-machine communication are taking off, and with the efficiency of CORDIC implementation and operation, it usages will most probably increase exponentially, hence good references are needed for correct and optimized implementation.

The two anomalies are books by Prof. Omondi, and Prof. Deschamps [1],[2].

[1]Computer-Hardware Evaluation of Mathematical Functions:

https://www.worldscientific.com/worldscibooks/10.1142/p1054

[2]Guide to FPGA Implementation of Arithmetic Functions:

http://www.arithmetic-circuits.org/guide2fpga/vhdl_codes.htm


sin and cos are often used to rotate vectors. In these cases, a trick with a CORDIC is to avoid the traditional sin/cos/multiply calculation by using the vector to be rotated as the input to the CORDIC. The CORDIC will directly produce the rotated vector, without having to compute sin/cos or do a complex multiplication.

CORDIC really shines when latency doesn't matter so much. You can pipeline every stage of the computation and get a huge throughput. It's well suited for digital mixing in radio systems.


Side note for 2023: Some modern MCUs are low cost, but have FPUs. A good example is the STM32G4. So, unlike on an M0 MCU or w/e, you can freely use f32, if you don't want fixed point. And you can get these for ~$1-2 / MCU.

That said... The G4 also has a hardware CORDIC peripheral that implements this algo for fixed-point uses. Is this mainly to avoid floating point precision losses? You program it using registers, but aren't implementing CORDIC yourself on the CPU; dedicated hardware inside the IC does it.


the cheapest cortex-m4fs in stock on digi-key, trying to omit duplicates, are the 3-dollar nuvoton m481le8ae https://www.digikey.com/en/products/detail/nuvoton-technolog..., the 3-dollar maxim max32660 https://www.digikey.com/en/products/detail/analog-devices-in..., and the 5-dollar atmel atsamd51 https://www.digikey.com/en/products/detail/microchip-technol.... their cheapest stm32g4 is the stm32g441kbt6 which rounds to 4 dollars https://www.digikey.com/en/products/detail/microchip-technol...

where do you get them for under 2 dollars?

(digi-key does list the nuvoton chip for barely under 2 dollars in quantity 500)


https://jlcpcb.com/partdetail/Stmicroelectronics-STM32G431CB...

JLC: $2.1 for qty=1. $1.3 at 1k qty.

Loads of stock. It's a low flash and low RAM variant, but sufficient for some purposes. 170Mhz CPU with many onboard periphs.


fantastic, thanks!

The second Parallax Propeller chip has a CORDIC engine in silicon. It's fast, and handles 64bit intermediate products, which makes the divide and trig stuff more than adequate precision for most things. One can always increase precision in software too.

I was late to the game learning about CORDIC. I had used fixed point a lot in 8 and 16 bit assembly land for performance, and determinism.

When I found out about it, I was stunned! It was fast, and only required basic math capability to be useful.


I found that a lack of barrel shifter or at least micro-coded shift 'n' hampers Cordic in an 8 bit CPU but I coded a routine up for fun in 6502. I never found a use for it myself as there always seems to be a faster way to do things for games programming at least.

https://forums.atariage.com/blogs/entry/3385-atan2-in-6502/


Yes a barrel shifter would have been a nice addition to the 6502. L

Yes, general purpose code is rarely the answer. I think all of us see that after a while.

Then, after that process, one starts coding differently. Chain stuff together, using common ops, pulling off intermediate values, etc... For most things, a fast way exists, but it will often be a tweak or two to make sense in the context.


I looked it over and frankly, I could not do better, or maybe even as good. Nice work!

Thanks!

I got reminded of a rather cute snippet of code I had a part in.

One needed to find the coordinates of the bisector of an angle subtended by an arc of a unit circle. They (x,y) coordinates of the 'arms' were available. The existing implementation was a mass of trigonometry - going from the x,y coordinates to polar (r,θ), then making sure that the θ computed was in the correct quadrant, halving the θ and converting back to x,y coordinates. All in all, many calls to trigonometric functions and inverse functions.

This was in Python and thus we had first class access to complex numbers. So all that was needed was to define two complex numbers. z1 from (x1,y1) z2 from (x2,y2) and then take the geometric mean of the product: √(z1*z2). Done.

No explicit trigonometry and no explicit conversion and back in the new code.


Reminds me of this article which I'm often drawn back to.

https://fgiesen.wordpress.com/2010/10/21/finish-your-derivat...


That was a good read, thanks. Bookmarked

> Now, it's fairly obvious that rotating by say 22.75˚ is the same as rotating by 45˚ and then -22.5˚ - i.e. we can break up a rotation into smaller parts, with both positive and negative components.

Wouldn't that be rotating by 22.5°? Is this an error in the article or my understanding?


An error in the article.


Meagher's octree system notably used only integer arithmetic with no integer multiplication/division:

> Efficient (linear time) algorithms have been developed for the Boolean operations (union, intersection and difference), geometric operations (translation, scaling and rotation), N-dimensional interference detection, and display from any point in space with hidden surfaces removed. The algorithms require neither floating-point operations, integer multiplications, nor integer divisions.

https://doi.org/10.1016/0146-664X(82)90104-6

This facilitated creation of fast, custom VLSI graphics acceleration hardware for octree representation.


Nice to read this.

Curious how CORDIC performs against cubic or other polynomial interpolation with a small table. I was taught that resource constrained synthesizers sometimes used cubic interpolation, and this was presumably when CORDIC was relatively new.

At a glance, the single bit of precision you get with each iteration of CORDIC means it would be more expensive in compute but cheaper in space than a polynomial.

But on the space note I feel we should highlight that it can be cheaper than the article might suggest with the 4096 entry lookup tables for sin(x) - Only a quarter of the full circle is needed due to symmetry.


Back in the day gamedevs (and demosceners) used a merely 256-entry LUT for sin and cos – it was convenient to have byte-sized angles that auto-wraparound, and for rotation in 2D games, 2^8 was quite enough. Wouldn’t get you very far in 3D though if you want smooth motion.

The first place I personally saw CORDIC mentioned was:

"How Many Ways Can You Draw a Circle?", the first *Jim Blinn's Corner* article:

https://ieeexplore.ieee.org/document/4057251

He found 15 ways, which generalized to methods like CORDIC, etc.

A couple years ago I was wondering if SDF shape representation could use something like a taxi cab distance metric to move floating point operations to integer, although I haven't spent too much time investigating it.


so, it's a https://en.wikipedia.org/wiki/Farey_sequence (aka mediant, aka naive fraction sum for approximation) but for angles?

It is not, because the subdivisions in CORDIC do not possess the best approximation properties of Farey fractions. For that, you would have to partition the circle into major/minor arcs instead, in the sense of Hardy-Littlewood's circle method. But that would be computationally very expensive, whereas this binary subdivision is fast, and is made faster using the shift-add tricks.

i see. thanks. assumed it was just bit shift tricks implementation. but indeed, no mediant.

More like a binary search, but instead of adjusting an array index you're iterating an orthonormal matrix which represents rotation about the origin.

Or a second order step response.

That’s super cool

Thank you for that link, very interesting diagrams and structures

They look very similar to what a neural network might be in certain cases, and it seems like the structures are trying to map some sort of duality


Most important keywords are:

BLDC + motor control + sensorless

CORDIC is a de facto standard algorithm in the relevant application area.

Infineon explains algorithms compactly and clearly:

https://www.infineon.com/dgdl/AP1610510_CORDIC_Algorithm.pdf

Microchip gives an idea of where and what the algorithm is used for:

https://ww1.microchip.com/downloads/aemDocuments/documents/O...

CORDIC hardware accelerators are found in many microcontrollers designed to control electric motors.


I took a stab at a Cordic ATAN2 function in 6502 years ago. This was the first pass. I meant to go back and write a faster version.

https://forums.atariage.com/blogs/entry/3385-atan2-in-6502/


I think the love for this algorithm feels similar to my love for Raycasting. You get a 3D game on barely any hardware with no trigonometry. Like… how?! It makes complete sense once I implemented it, but it still feels like wizardry.

I have a few vintage programmable HP calculators that implement CORDIC on their 4-bit CPUs. You can also program them to calculate the Taylor expansion of sin() as another approximate method.

If you liked this, it’s worth reading Donald Knuth seminal “The Art of Computer Programming” which explains a number of mathematical algorithms by example.


This used to be all the rage when folks were into DSP

cos + jsin baby.

If, having it in your head, you could use it to calculate trig functions in your head, that would be a way of it paying rent.

So, yeah.


Cool algorithm. Might come in handy soon for running neural networks on less powerful hardware :)

The primary advantage of CORDIC is that it saves space in the design.

Lookup tables with interpolation are much faster


Polynomial interpolation for sin/cos is even better when you have access to an FPU: https://gist.github.com/publik-void/067f7f2fef32dbe5c27d6e21...

Neural networks usually don't use trigonometry or complex numbers.

From the neural network Wikipedia page:

> The signal each neuron outputs is calculated from this number, according to its activation function

The activation function is usually a sigmoid, which is also usually defined in terms of trigonometric functions

Which neural networks don’t use trigonometric functions or equivalent?


In current neural networks the activation function usually is not sigmoid but something like ReLU ( y = 0 if x<0 else x ), and in any case the computation of activations is not meaningful part of total compute - for non-tiny networks, almost all the effort is spent on the the large matrix multiplications of the large layers before the activation function, and as the network size grows, they become even less relevant, as the amount of activation calculations grows linearly with layer size but the core computation grows superlinearly (n^2.8 perhaps?).

Really?

It's literally been decades, but the last I learned about neural networks, a non-linear activation function was important for Turing completeness.


Yes, some non-linearity is important - not for Turing completeness, but because without it the consecutive layers effectively implement a single linear transformation of the same size and you're just doing useless computation.

However, the "decision point" of the ReLU (and it's everywhere-differentiable friends like leaky ReLU or ELU) provides a sufficient non-linearity - in essence, just as a sigmoid effectively results in a yes/no chooser with some stuff in the middle for training purposes, so does the ReLU "elbow point".

Sigmoid has a problem of 'vanishing gradients' in deep networks, as the sigmoid gradients of 0 - 0.25 in standard backpropagation means that a 'far away' layer will have tiny, useless gradients if there's a hundred sigmoid layers in between.


ReLU are indeed non-linear, despite the confusing name.

The nonlinearity (plus at least two layers) are required to solve nonlinear problems (like the famous XOR example).


neural networks aren't turing complete (they're circuits, not state machines) and relu is not just nonlinear but in fact not even differentiable at zero

Most activations functions are based on exponential functions.

I don't immediately see how cordic would help you calculate e^(x) unless x is complex valued (which it is usually not).


see section B of https://eprints.soton.ac.uk/267873/1/tcas1_cordic_review.pdf. Generalized cordic can compute hyperbolic trig functions which gives you exponentials. That said, it's still not useful for ML because it's pretty hard to beat polynomials and tables.

Interesting, thanks.

You just need a vague approximation. Even CORDIC is overkill.

clamp(x/2, -1, 1) is basically a sigmoid.


“Lives rent-free in my head” is a horrible cliche. By now, it has the same effect as saying “it’s nice” but uses lots more letters.

And what does this pedantry solve, exactly? It clearly is something the author thinks often enough about and enjoys talking about.

Let people enjoy things.


I thought it was more often "negative" like you cant stop thinking about some dumbass getting disproportionate affect on the world

I take it as short for the "why are you still carrying her?" parable https://medium.com/@soninilucas/two-monks-and-a-woman-zen-st... - something you should have let go and forgotten.

Doesn't seem to be what the author meant in this case.


Having never heard the expression before, I thought it meant it wouldn't be forgotten, like learning to ride a bicycle kinda thing.

It's a fine phrase, evocative, that I agree has become cliché. But that's not the problem here. The problem here is that it's being misused.

It works when there's something that is problematic for taking up space in your head. It colors the way you think in a way you wish was not the case.

While I could strain and make that sort of make sense here (eg it uses up some capacity that would otherwise allow the author to remember other more immediately useful algorithms), it's clear from the article that the author only means "I still remember the CORDIC algorithm" and nothing more.

Not a big deal, but it momentarily annoyed me as a "clickbait and switch" headline.


I never liked that phrase either. So what exactly lives in your head and pays rent?

Kubernetes, which I would evict if it stopped paying rent.

the "rent" in the saying is the effort to learn and upkeep a certain piece of knowledge in your head, to the point where you can recall it as needed

living rent free generally means that you are able to easily recall a piece of knowledge (and often do, even if you don't want to).

something that pays rent might be like an important concept for an upcoming school test, for a class you're not interested in.


I always thought rent was utility, and a piece of knowledge or an opinion that didn't pay rent was one that provided little or no practical benefit to you.

The term in its original use described an obsession with a particular group or thing and called it out as pointless. I don't think it would make sense for rent to mean effort in this original sense, nor in the broader way it's used here.


I have no idea what pseudo-intellectual discourse this is trying to be but the phrase "X lives rent free in my head" means only that "I think of X a lot because I enjoy thinking about it". Nothing more.

Go read https://knowyourmeme.com/memes/rent-free and learn the origins and predominant usage of the phrase.

In addition, it literally says "rent free". There's nothing intellectual or pseudo-intellectual about reading the phrase "rent free" and concluding that the information does not pay anything to live in your head. This is basic reading comprehension.


JFC nothing pays rent to live in your head. I don't see the point of analyzing something so literally, completely sidestepping the content of the article in question.

Get a hobby.


Knowledge work

I didn't finish yet but I was thinking he's talking about memorizing the algorithm to calculate sines in your head.

[flagged]


What kind of Marxist keeps their ideas in line by collecting rent from some of them but not others?

> “Lives rent-free in my head” is a horrible cliche.

Yes it is, much like "dumpster fire" or "toxic" or "gaslighting" or any other contemporary Reddit-isms that are associated with hyper-reductionist cultural hot takes. My personal distaste for them, however, has no bearing on the fact that they have very real meaning and people enjoy using them.

> By now, it has the same effect as saying “it’s nice” but uses lots more letters.

The commonly-held meaning of this is that it describes a pervasive, intrusive thought, not a nice one. This particular turn of phrase is often used to describe how politicians devote energy to talking about other politicians that they don't like, for instance.


"Lives rent free in my head" lives rent free in my head.

Hey it's a quine, too!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: