
Fast Inverse Square Root - bertzzie
http://blog.quenta.org/2012/09/0x5f3759df.html
======
ColinWright
In case anyone is interested in reading other accounts and the HN discussions
that accompany them, here are a few previous submissions of other articles:

<http://news.ycombinator.com/item?id=213056> <\- Some comments

<http://news.ycombinator.com/item?id=419166> <\- Some comments

<http://news.ycombinator.com/item?id=573912>

<http://news.ycombinator.com/item?id=896092> <\- Some comments

<http://news.ycombinator.com/item?id=1599635>

<http://news.ycombinator.com/item?id=2332793> <\- Some comments

<http://news.ycombinator.com/item?id=3115168> <\- Some comments

<http://news.ycombinator.com/item?id=3259199>

------
psykotic
I remember re-deriving this a few years ago. Short answer: If x = 2^n then
rsqrt(x) = 2^(-n/2), so the exponent just gets negated and divided by 2. If x
= (1 + m) 2^n where |m| < 1 is the fractional part of the mantissa, then
rsqrt(1 + m) = 1 + rsqrt'(1) m + O(m^2) = 1 - m/2 + O(m^2) is the first-order
Taylor expansion around 1. Thus rsqrt(x) ~= (1 - m/2) 2^(-n/2) so both m and n
get negated and divided by 2.

This immediately suggests getting an initial guess for Newton's method by just
shifting the bitwise form of an IEEE-754 floating point number by 1 and then
doing some fix-ups for the negation.

The only hitch is that the exponent is stored in biased form rather than 2's
complement form, so you have to first subtract out the bias, do the shift, and
add back in the bias. Also, shifting can cause a least-significant 1 bit from
the exponent to shift down into the most significant bit of the mantissa. But
that turns out to be useful since it amounts to using the approximation
sqrt(2) ~= 1.5. But doing all these bitwise operations is a lot of extra work
compared to just doing a single shift. It turns out you can get almost the
same effect with just a single addition of an immediate operand. Most of the
bits of this addend are already predetermined by the need to handle the
biasing and unbiasing in the exponent, and the remainder can easily be
determined by brute force (the search space is only of size 2^24 or so) by
just minimizing the approximation error over some test set. Doing that, I was
able to converge to almost exactly the same 'magic number' as in Carmack's
method.

My coworker Charles Bloom has a series of posts on this and related matters:
[http://cbloomrants.blogspot.com/2010/11/11-20-10-function-
ap...](http://cbloomrants.blogspot.com/2010/11/11-20-10-function-
approximation-by_20.html)

Here's an old email exchange I had with another coworker last time I was
thinking about this stuff, where I tried to derive as many bits of the
constant as possible by reasoning:
<https://gist.github.com/4e1d0dff97193fe5d745>

Historical aside: How this algorithm came to be associated with Carmack is an
amusing and roundabout tale. My partner in crime on Telemetry at RAD is Brian
Hook. Hook was one of the first employees at 3dfx, where he designed and
implemented the GLIDE graphics API. Gary Tarolli was the co-founder and CTO at
3dfx, and Hook got the rsqrt code from him. When Hook left 3dfx to work with
Carmack on Quake 2 and 3 at id software, he brought that code with him and
copied it into the codebase. Carmack never had anything to do with it. It's my
recollection Tarolli didn't develop the algorithm either--he had gotten it
from someone else in turn.

~~~
packetslave
The more-or-less definitive history of the "Carmack" fast inverse sqrt:
<http://www.beyond3d.com/content/articles/8/>

------
jimminy
I've always liked the post provided by BetterExplained[0], but this article is
more thorough while remaining simple. Would probably still use the
BetterExplained post for a first introduction, I think.

[0]:[http://betterexplained.com/articles/understanding-quakes-
fas...](http://betterexplained.com/articles/understanding-quakes-fast-inverse-
square-root/)

------
boas
This line:

> x = _(float_ )&i;

breaks strict aliasing, and has undefined behavior. If you compile it with g++
-O2, you may get unexpected results. The solution is to use a union, or to
compile with -no-strict-aliasing.

I used to have a lot of code like this, which worked fine on older compilers,
but not on newer compilers. It's perfectly reasonable code, so I'm not sure
why the newer compilers don't produce the expected behavior by default.

~~~
mistercow
>I'm not sure why the newer compilers don't produce the expected behavior by
default.

The reason is that there are optimizations that a compiler can do when it
knows that two particular pointers will not reference the same location in
memory at a given time (which is called aliasing). These optimizations have to
do with memory access. If pointers cannot be proven not to alias each other,
then the order in which they are loaded, modified, and stored will be very
rigid since it is impossible to tell if changing the order will change the end
result. Sometimes that's just life; you might have some complex interactions
going with your pointers, and it may actually be that the order of operations
can't be changed.

But if you can assume that the pointers _don't_ alias each other, then the
compiler is free to change the order of operations for better efficiency. It
can often remove unnecessary loads and stores (if x and y don't alias, you
don't have to reload y just because you changed x), and it can also put loads
together at the beginning and stores together at the end (which is faster
because, um, caching? I'm not really clear on that part).

The catch is that in general, there is no way for the compiler to know whether
two pointers can alias each other. In C99, you can inform the compiler of this
with the restrict keyword, but that requires that you do some careful analysis
of your code to make sure you aren't lying to the compiler. If you mess up,
you may end up with bugs that are very difficult to track.

So that's where strict aliasing comes in. Usually, two pointers of different
types won't point at the same memory location. So they added the "strict
aliasing" rule to make that fact official, because it means that in the vast
majority of cases, code gets a nice speedup without any changes or difficult
reasoning. Using a union is basically a way of informing the compiler that you
need it to make an exception this one time.

I used to use a macro something like this to make the process concise:

    
    
      #define PUN(x, toType) (((union {__typeof__(x) __a; toType __b;})x).__b)
    

Then you'd do:

    
    
      int i = PUN(int, x);         // evil floating point bit level hacking
      i = 0x5f3759df - (i >> 1);  // what the fuck?
      x = PUN(float, i);
      x = x*(1.5f-(xhalf*x*x));

~~~
premchai21
I actually use macros along the lines of:

    
    
      #define LOAD(TYP, ptr) (*((TYP *)memcpy((TYP[1]){ }, ptr, sizeof(TYP))))
      #define REINTERPRET(AS_TYP, FROM_TYP, val) LOAD(AS_TYP, (FROM_TYP[1]){ val })
    

In practice (that I've found, with GCC), the memcpy and single-use temporaries
get optimized away entirely. In strict C99, writing one member of a union and
then reading a different member of the same union is undefined in the general
case, last I checked.
[http://cellperformance.beyond3d.com/articles/2006/06/underst...](http://cellperformance.beyond3d.com/articles/2006/06/understanding-
strict-aliasing.html) seems to agree, but suggests that every major compiler
recognizes it as a _de facto_ idiom and supports it anyway.

~~~
mistercow
Yep, type punning through a union is nonstandard and unportable, but if I'm
not mistaken, it is a documented feature in GCC.

~~~
stephencanon
Type punning through a union is explicitly defined to work in C99 and C11
(footnote 95 in C11):

> If the member used to read the contents of a union object is not the same as
> the member last used to store a value in the object, the appropriate part of
> the object representation of the value is reinterpreted as an object
> representation in the new type as described in 6.2.6 (a process sometimes
> called ‘‘type punning’’).

Please don't help spread the myth that compiler writers can break this idiom.
That said, I use memcpy in my own code.

~~~
premchai21
All right, I see it in C99 as well now (§6.5.2.3 footnote 82, if I'm not
mistaken). Thanks for the correction.

------
prezjordan
Finally an article that really breaks it down for someone like me (college
calculus knowledge) to understand. Thanks for sharing this, super interesting.

------
jgeralnik
If anybody wants to further play around with this, I wrote the following
script in python using matplotlib:

    
    
      import numpy
      import matplotlib.pyplot as plt
      import struct
      
      def power(x,p):
        i = struct.unpack("i", struct.pack("f", x))[0]
        i = (1-p) * 0x3f7a3bea + (p*i)
        return struct.unpack("f", struct.pack("i", i))[0]
      
      p = -0.5
      xs = numpy.arange(0.1,100,0.1)
      ys1 = map(lambda x: x**p, xs)
      ys2 = map(lambda x: power(x, p), xs)
      
      plt.plot(xs, ys1, label="Real function")
      plt.plot(xs, ys2, label="Approximation")
      
      plt.title("Graph of x^%d"%p)
      plt.legend(loc="upper left")
      plt.savefig("graph.png")
    

Change p to whatever you want (even values greater than 1) to see different
powers.

------
smoyer
A long time ago (the late '80s or early '90s) I needed a square root routine
for a floating point number and found an article by a Jack Crenshaw (a real
rocket scientist), that always converged in less than 5 iterations. First you
normalized your FP number (converted the mantissa to 1.xxxxx) while adjusting
the exponent and then multiplied by 1.38 which gave a surprisingly close
answer.

I'm guessing that (if I could find the article) the constant used in this
floating point technique and the one used in this article are somehow
inversely related too!

In any case, it's a very nice article ... thanks!

~~~
Nick_C
Gee, you've triggered a memory for me too. I remember doing something very
similar (1.38 was the trigger) around the same time. It probably writing
floating point routines for processors that didn't have them, perhaps the
Z-80.

------
IgniteTheSun
Not to be too pedantic, but this is the reciprocal of the square root. (The
inverse of the square root operation would be the square operation.)

~~~
bo1024
On the other hand, you could parse it as inverse (square root). Inverse of x
is 1/x.

------
angersock
There was a nifty article on Flipcode long ago (
[http://www.flipcode.com/archives/Fast_Approximate_Distance_F...](http://www.flipcode.com/archives/Fast_Approximate_Distance_Functions.shtml)
) in a similar vein.

------
nicholassmith
I remember seeing the article about the history of this pop up and when it
started getting into the maths my brain wasn't keen, but this is a really
simple and great explanation of even the heavier aspects of the math involved.

------
SeanLuke
If only there were an equivalent fast atan2.

------
k2xl
i've seen this multiple times in the past few years, each time i've seen it
i've understood just a tiny bit more... i would love to see an interview with
the original author of this...

~~~
andrewcooke
my impression (from reading the clang/llvm posts a while back) was that they
tend to assume that anything with undefined behaviour in the spec returns a
value that makes the code faster (eg. by simply ignoring the statement
altogether).

edit: [http://blog.llvm.org/2011/05/what-every-c-programmer-
should-...](http://blog.llvm.org/2011/05/what-every-c-programmer-should-
know.html)

~~~
jtlienwis
I am wondering if in a lot of code taking this inverse square root is not
necessary like when calculating length = sqrt(x^2+y^2) in a lot of code, you
just want to know a minimum of maximum value so getting the min or max of
x^2+y^2 is what you want and calculating the sqrt is not necessary.

------
FredBrach
_why?_

Also because x * 1/sqrt(x) == sqrt(x) (so on a pentium, an sqrt was 3 cycles
further an inv-sqrt, pretty cheap for an sqrt)

