Fast inverse square root 37 points by r11t on Oct 22, 2009 | hide | past | web | favorite | 20 comments

 Curiously enough, today, such a hack is completely pointless (at least on x86): the rsqrtps instruction can do 4 of these in just 3 clock cycles, with higher accuracy to boot.Most modern instruction sets with remotely decent floating point support have a similar instruction, in large part because of the prevalence of hacks like this in the past.
 Yes, but what a beautiful little hack it is. The fact that modern processors implement it in hardware is actually a tribute to this tricks ingenuity and elegance.After all, to have major chip manufacturers implement your software hack in hardware is quite a compliment, assuming they use the same mechanism.
 This particular hack, yes, but the general point stands. Sometimes hacks make for great code. The point is that this is good code:`````` float h = 0.5f*x; int i = *(int*)&x; i = 0x5f3759df - (i>>1); x = *(float*)&i; return x*(1.5f - h*x*x); `````` That's 3 local variables, 5 short statements... makes me wonder how quickly one could search/test all combinations of sensible 5-statement functions to find other such "hacks"...
 When you're willing to make an approximation you can come up with all sorts of very short functions for common operations.Here's one from my project, which takes an integer x and calculates a (float) log2(x) in an absurdly low number of operations.`````` static inline float x264_log2( uint32_t x ) { int lz = __builtin_clz( x ); return x264_log2_lut[(x<>24)&0x7f] + x264_log2_lz_lut[lz]; } const float x264_log2_lut[128] = { 0.00000, 0.01123, 0.02237, 0.03342, 0.04439, 0.05528, 0.06609, 0.07682, 0.08746, 0.09803, 0.10852, 0.11894, 0.12928, 0.13955, 0.14975, 0.15987, 0.16993, 0.17991, 0.18982, 0.19967, 0.20945, 0.21917, 0.22882, 0.23840, 0.24793, 0.25739, 0.26679, 0.27612, 0.28540, 0.29462, 0.30378, 0.31288, 0.32193, 0.33092, 0.33985, 0.34873, 0.35755, 0.36632, 0.37504, 0.38370, 0.39232, 0.40088, 0.40939, 0.41785, 0.42626, 0.43463, 0.44294, 0.45121, 0.45943, 0.46761, 0.47573, 0.48382, 0.49185, 0.49985, 0.50779, 0.51570, 0.52356, 0.53138, 0.53916, 0.54689, 0.55459, 0.56224, 0.56986, 0.57743, 0.58496, 0.59246, 0.59991, 0.60733, 0.61471, 0.62205, 0.62936, 0.63662, 0.64386, 0.65105, 0.65821, 0.66534, 0.67243, 0.67948, 0.68650, 0.69349, 0.70044, 0.70736, 0.71425, 0.72110, 0.72792, 0.73471, 0.74147, 0.74819, 0.75489, 0.76155, 0.76818, 0.77479, 0.78136, 0.78790, 0.79442, 0.80090, 0.80735, 0.81378, 0.82018, 0.82655, 0.83289, 0.83920, 0.84549, 0.85175, 0.85798, 0.86419, 0.87036, 0.87652, 0.88264, 0.88874, 0.89482, 0.90087, 0.90689, 0.91289, 0.91886, 0.92481, 0.93074, 0.93664, 0.94251, 0.94837, 0.95420, 0.96000, 0.96578, 0.97154, 0.97728, 0.98299, 0.98868, 0.99435, }; /* Avoid an int/float conversion. */ const float x264_log2_lz_lut[32] = { 31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 }; `````` The resulting asm, just 7 instructions:`````` bsr ecx, eax xor ecx, 0x1f movss xmm0, [ecx*4+x264_log2_lz_lut] shl eax, cl shr eax, 0x18 and eax, 0x7f addss xmm0, [eax*4+x264_log2_lut]``````
 Could duplicate the x264_log2_lut table, then you wouldn't need the 'and eax, 0x7f'.... 6 instructions.
 Yeah, though at that point I'd be start to worry about whether that's a fair L1 cache tradeoff, given that it only saves a single (integer!) instruction and requires 512 bytes of memory.
 Sure, always a trade off when you get to that sort of point, and not much payoff.
 What kind of project is it?
 I was hoping the names would make it slightly obvious, but if not, it's the x264 video encoder: http://www.videolan.org/developers/x264.html
 Actually it is not completely pointless today: Some months ago I used it to speed up an force-directed algorithm which spent 80% of its time in calculating normalized vectors between two points - and the fast inverse sqrt actually helped to speed up the program by a factor of 2.5 to 3.0. The reason why i used it is quite simple: In (unsafe) .NET code you can execute all the necessary arithmetic operations (pointer casts, bit shifts) but you are not allowed to execute arbitrary x86 operations.
 I remember when this was being discussed over at gamedev.net, where Chris Lomont came up with his version. The wikipedia article explains it much clearer though :-PThe history of this particular hack can be found over at Beyond3D:They trace it all the way back to Greg Walsh, who apparently came up with the hack while working on the Titan graphics computer at Ardent Computer in the late 80's, who got the idea while working with Cleve Moler, the author of matlab.
 I hope someday to actually understand this. It seems to come up once a year or so, and each time I understand a bit more... so hopefully in a few years...
 It's really just Newton's method for successive approximations: http://en.wikipedia.org/wiki/Newtons_method
 And a very clever zeroth approximation.
 Exactly, that's the key to the whole thing.After all if your first approximation is 'in the ballpark' you get a lot of precision in one go. Newtons method would run 'infinitely long' to reach an arbitrary precision, normally the cutoff is some acceptable error.By getting within the ballpark on the first try you can get within acceptable error with one more approximation step.(acceptable depends on the use of the result of course).
 If you Google for that magic number, you will find a large number of explanations, requiring various levels of understanding. No doubt there is one clear enough for you; perhaps stack a few of them.
 Good tip, thanks -- now on to making my brain hurt!
 I think the best paper that explains the magic number (0x5f3759df) is http://www.lomont.org/Math/Papers/2003/InvSqrt.pdfLomont traces the history of the algorithm and the possible way that the magic number was derived - including more magic numbers (0x5f375a86)
 "isn't that just x*x?"
 Inverse here means "one divided by". At first I thought the same thing.

Search: