Hacker News new | past | comments | ask | show | jobs | submit login

When you're willing to make an approximation you can come up with all sorts of very short functions for common operations.

Here's one from my project, which takes an integer x and calculates a (float) log2(x) in an absurdly low number of operations.

    static inline float x264_log2( uint32_t x )
    {
        int lz = __builtin_clz( x );
        return x264_log2_lut[(x<<lz>>24)&0x7f] + x264_log2_lz_lut[lz];
    }

    const float x264_log2_lut[128] = {
        0.00000, 0.01123, 0.02237, 0.03342, 0.04439, 0.05528, 0.06609, 0.07682,
        0.08746, 0.09803, 0.10852, 0.11894, 0.12928, 0.13955, 0.14975, 0.15987,
        0.16993, 0.17991, 0.18982, 0.19967, 0.20945, 0.21917, 0.22882, 0.23840,
        0.24793, 0.25739, 0.26679, 0.27612, 0.28540, 0.29462, 0.30378, 0.31288,
        0.32193, 0.33092, 0.33985, 0.34873, 0.35755, 0.36632, 0.37504, 0.38370,
        0.39232, 0.40088, 0.40939, 0.41785, 0.42626, 0.43463, 0.44294, 0.45121,
        0.45943, 0.46761, 0.47573, 0.48382, 0.49185, 0.49985, 0.50779, 0.51570,
        0.52356, 0.53138, 0.53916, 0.54689, 0.55459, 0.56224, 0.56986, 0.57743,
        0.58496, 0.59246, 0.59991, 0.60733, 0.61471, 0.62205, 0.62936, 0.63662,
        0.64386, 0.65105, 0.65821, 0.66534, 0.67243, 0.67948, 0.68650, 0.69349,
        0.70044, 0.70736, 0.71425, 0.72110, 0.72792, 0.73471, 0.74147, 0.74819,
        0.75489, 0.76155, 0.76818, 0.77479, 0.78136, 0.78790, 0.79442, 0.80090,
        0.80735, 0.81378, 0.82018, 0.82655, 0.83289, 0.83920, 0.84549, 0.85175,
        0.85798, 0.86419, 0.87036, 0.87652, 0.88264, 0.88874, 0.89482, 0.90087,
        0.90689, 0.91289, 0.91886, 0.92481, 0.93074, 0.93664, 0.94251, 0.94837,
        0.95420, 0.96000, 0.96578, 0.97154, 0.97728, 0.98299, 0.98868, 0.99435,
    };

    /* Avoid an int/float conversion. */
    const float x264_log2_lz_lut[32] = {
        31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
    };
The resulting asm, just 7 instructions:

      bsr     ecx, eax
      xor     ecx, 0x1f
      movss  xmm0, [ecx*4+x264_log2_lz_lut]
      shl     eax, cl
      shr     eax, 0x18
      and     eax, 0x7f
      addss  xmm0, [eax*4+x264_log2_lut]



Could duplicate the x264_log2_lut table, then you wouldn't need the 'and eax, 0x7f'.... 6 instructions.


Yeah, though at that point I'd be start to worry about whether that's a fair L1 cache tradeoff, given that it only saves a single (integer!) instruction and requires 512 bytes of memory.


Sure, always a trade off when you get to that sort of point, and not much payoff.


What kind of project is it?


I was hoping the names would make it slightly obvious, but if not, it's the x264 video encoder: http://www.videolan.org/developers/x264.html




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: