
Fast inverse square root - r11t
http://en.wikipedia.org/wiki/Fast_inverse_square_root
======
DarkShikari
Curiously enough, today, such a hack is completely pointless (at least on
x86): the _rsqrtps_ instruction can do 4 of these in just 3 clock cycles, with
higher accuracy to boot.

Most modern instruction sets with remotely decent floating point support have
a similar instruction, in large part _because_ of the prevalence of hacks like
this in the past.

~~~
andreyf
This particular hack, yes, but the general point stands. Sometimes hacks make
for great code. The point is that _this_ is good code:

    
    
        float h = 0.5f*x;
        int i = *(int*)&x;
        i = 0x5f3759df - (i>>1);
        x = *(float*)&i;
        return x*(1.5f - h*x*x);
    

That's 3 local variables, 5 short statements... makes me wonder how quickly
one could search/test all combinations of sensible 5-statement functions to
find other such "hacks"...

~~~
DarkShikari
When you're willing to make an approximation you can come up with all sorts of
very short functions for common operations.

Here's one from my project, which takes an integer x and calculates a (float)
log2(x) in an absurdly low number of operations.

    
    
        static inline float x264_log2( uint32_t x )
        {
            int lz = __builtin_clz( x );
            return x264_log2_lut[(x<<lz>>24)&0x7f] + x264_log2_lz_lut[lz];
        }
    
        const float x264_log2_lut[128] = {
            0.00000, 0.01123, 0.02237, 0.03342, 0.04439, 0.05528, 0.06609, 0.07682,
            0.08746, 0.09803, 0.10852, 0.11894, 0.12928, 0.13955, 0.14975, 0.15987,
            0.16993, 0.17991, 0.18982, 0.19967, 0.20945, 0.21917, 0.22882, 0.23840,
            0.24793, 0.25739, 0.26679, 0.27612, 0.28540, 0.29462, 0.30378, 0.31288,
            0.32193, 0.33092, 0.33985, 0.34873, 0.35755, 0.36632, 0.37504, 0.38370,
            0.39232, 0.40088, 0.40939, 0.41785, 0.42626, 0.43463, 0.44294, 0.45121,
            0.45943, 0.46761, 0.47573, 0.48382, 0.49185, 0.49985, 0.50779, 0.51570,
            0.52356, 0.53138, 0.53916, 0.54689, 0.55459, 0.56224, 0.56986, 0.57743,
            0.58496, 0.59246, 0.59991, 0.60733, 0.61471, 0.62205, 0.62936, 0.63662,
            0.64386, 0.65105, 0.65821, 0.66534, 0.67243, 0.67948, 0.68650, 0.69349,
            0.70044, 0.70736, 0.71425, 0.72110, 0.72792, 0.73471, 0.74147, 0.74819,
            0.75489, 0.76155, 0.76818, 0.77479, 0.78136, 0.78790, 0.79442, 0.80090,
            0.80735, 0.81378, 0.82018, 0.82655, 0.83289, 0.83920, 0.84549, 0.85175,
            0.85798, 0.86419, 0.87036, 0.87652, 0.88264, 0.88874, 0.89482, 0.90087,
            0.90689, 0.91289, 0.91886, 0.92481, 0.93074, 0.93664, 0.94251, 0.94837,
            0.95420, 0.96000, 0.96578, 0.97154, 0.97728, 0.98299, 0.98868, 0.99435,
        };
    
        /* Avoid an int/float conversion. */
        const float x264_log2_lz_lut[32] = {
            31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
        };
    

The resulting asm, just 7 instructions:

    
    
          bsr     ecx, eax
          xor     ecx, 0x1f
          movss  xmm0, [ecx*4+x264_log2_lz_lut]
          shl     eax, cl
          shr     eax, 0x18
          and     eax, 0x7f
          addss  xmm0, [eax*4+x264_log2_lut]

~~~
axod
Could duplicate the x264_log2_lut table, then you wouldn't need the 'and eax,
0x7f'.... 6 instructions.

~~~
DarkShikari
Yeah, though at that point I'd be start to worry about whether that's a fair
L1 cache tradeoff, given that it only saves a single (integer!) instruction
and requires 512 bytes of memory.

~~~
axod
Sure, always a trade off when you get to that sort of point, and not _much_
payoff.

------
dkersten
I remember when this was being discussed over at gamedev.net, where Chris
Lomont came up with his version. The wikipedia article explains it much
clearer though :-P

The history of this particular hack can be found over at Beyond3D:

Part 1: <http://www.beyond3d.com/content/articles/8/>

Part 2: <http://www.beyond3d.com/content/articles/15/>

They trace it all the way back to Greg Walsh, who apparently came up with the
hack while working on the Titan graphics computer at Ardent Computer in the
late 80's, who got the idea while working with Cleve Moler, the author of
matlab.

------
sophacles
I hope someday to actually understand this. It seems to come up once a year or
so, and each time I understand a bit more... so hopefully in a few years...

~~~
frisco
It's really just Newton's method for successive approximations:
<http://en.wikipedia.org/wiki/Newtons_method>

~~~
eru
And a very clever zeroth approximation.

~~~
jacquesm
Exactly, that's the key to the whole thing.

After all if your first approximation is 'in the ballpark' you get a lot of
precision in one go. Newtons method would run 'infinitely long' to reach an
arbitrary precision, normally the cutoff is some acceptable error.

By getting within the ballpark on the first try you can get within acceptable
error with one more approximation step.

(acceptable depends on the use of the result of course).

------
sandGorgon
I think the best paper that explains the magic number (0x5f3759df) is
<http://www.lomont.org/Math/Papers/2003/InvSqrt.pdf>

Lomont traces the history of the algorithm and the possible way that the magic
number was derived - including more magic numbers (0x5f375a86)

------
leif
"isn't that just x*x?"

~~~
wglb
Inverse here means "one divided by". At first I thought the same thing.

