Hacker News new | past | comments | ask | show | jobs | submit login

While I applaud the efforts to document the optimization processes, they are fundamentally slow. The easiest and portable optimization would be converting 8 full bits at a time:

    for (int i = 0; i < 256; ++i) {
        const char c[4] = {
            encode_lookup_tbl[i >> 0 & 3],
            encode_lookup_tbl[i >> 2 & 3],
            encode_lookup_tbl[i >> 4 & 3],
            encode_lookup_tbl[i >> 6 & 3],
        };
        encode_lookup_tbl2[i] = *(uint32_t*)c;
    }

    for (x = 0; x < bytes; x++) {
        ((uint32_t*)ws_out)[x] = encode_lookup_tbl2[bytes_in[x]];
    }
Note that the resulting encode_lookup_tbl2 depends on the endianness, but as long as we make the table on demand it can remain portable. Also `ws_out` might have to be aligned for the potability (unaligned memory access can be slow at best and trap at worst) but it is easy to fulfill when those functions are only used internally. In my measurement this alone made it ~3.5x faster.

The next obvious approach is to use SIMD. Pretty much every modern CPU gives you 16-byte shuffle, which can be used to convert 4 bits at a time. (EDIT: This turned out to be not very correct, see below.) I tried to write a quick example but it takes quite a bit of time to write and debug [1] ;-) (I may update this post if I get a working version.)

Lastly, it might be possible to use mmap to eliminate the intermediate buffer at all. I think the actual performance gain with and without mmap can vary much and you need to profile it, but it's worth considering.

[1] I realized that there is no bytewise shift operator in x86 SIMD (_mm_srli_epi8). Oops!

---

Added later: So the following is my crude AVX2 implementation. It is 20% faster than the bytewise version and 10% faster than the original SSSE3 version. It is not that satisfactory though because it had to do a lot of unpacking to get bytes into correct positions (I'm not a great SIMD coder after all).

    const __m256i MASK = _mm256_set1_epi8(0xf);

    _Alignas(32) char c[32];
    for (int i = 0; i < 32; ++i) c[i] = encode_lookup_tbl[i & 3];
    const __m256i MAPLO = _mm256_load_si256((void*)c);
    for (int i = 0; i < 32; ++i) c[i] = encode_lookup_tbl[i >> 2 & 3];
    const __m256i MAPHI = _mm256_load_si256((void*)c);

    for (x = 0; x < bytes; x += 32) {
        // abcdefghijklmnop ABCDEFGHIJKLMNOP
        __m256i block = _mm256_loadu_si256((void*)(bytes_in + x));
        // ah bh ch dh ... mh nh oh ph
        __m256i hi = _mm256_and_si256(_mm256_srli_epi16(block, 4), MASK);
        // al bl cl dl ... ml nl ol pl
        __m256i lo = _mm256_and_si256(block, MASK);
        // MAPLO[al] ... MAPLO[pl]
        // MAPHI[al] ... MAPLO[pl]
        // MAPLO[ah] ... MAPHI[ph]
        // MAPHI[ah] ... MAPHI[ph]
        __m256i mapped1 = _mm256_shuffle_epi8(MAPLO, lo);
        __m256i mapped2 = _mm256_shuffle_epi8(MAPHI, lo);
        __m256i mapped3 = _mm256_shuffle_epi8(MAPLO, hi);
        __m256i mapped4 = _mm256_shuffle_epi8(MAPHI, hi);
        // MAPLO[al] MAPLO[ah] ... MAPLO[hl] MAPLO[hh]
        // MAPLO[il] MAPLO[ih] ... MAPLO[pl] MAPLO[ph]
        // MAPHI[al] MAPHI[ah] ... MAPHI[hl] MAPHI[hh]
        // MAPHI[il] MAPHI[ih] ... MAPHI[pl] MAPHI[ph]
        __m256i shuf1 = _mm256_unpacklo_epi8(mapped1, mapped3);
        __m256i shuf2 = _mm256_unpackhi_epi8(mapped1, mapped3);
        __m256i shuf3 = _mm256_unpacklo_epi8(mapped2, mapped4);
        __m256i shuf4 = _mm256_unpackhi_epi8(mapped2, mapped4);
        // MAPLO[al] MAPHI[al] MAPLO[ah] MAPHI[ah] ... MAPLO[dl] MAPHI[dl] MAPLO[dh] MAPHI[dh] etc.
        // bytewise:
        // aaaabbbbccccdddd AAAABBBBCCCCDDDD
        // eeeeffffgggghhhh EEEEFFFFGGGGHHHH
        // iiiijjjjkkkkllll IIIIJJJJKKKKLLLL
        // mmmmnnnnoooopppp MMMMNNNNOOOOPPPP
        __m256i out1 = _mm256_unpacklo_epi8(shuf1, shuf3);
        __m256i out2 = _mm256_unpackhi_epi8(shuf1, shuf3);
        __m256i out3 = _mm256_unpacklo_epi8(shuf2, shuf4);
        __m256i out4 = _mm256_unpackhi_epi8(shuf2, shuf4);
        _mm_store_si128((void*)(ws_out + x * 4),       _mm256_extracti128_si256(out1, 0));
        _mm_store_si128((void*)(ws_out + x * 4 + 16),  _mm256_extracti128_si256(out2, 0));
        _mm_store_si128((void*)(ws_out + x * 4 + 32),  _mm256_extracti128_si256(out3, 0));
        _mm_store_si128((void*)(ws_out + x * 4 + 48),  _mm256_extracti128_si256(out4, 0));
        _mm_store_si128((void*)(ws_out + x * 4 + 64),  _mm256_extracti128_si256(out1, 1));
        _mm_store_si128((void*)(ws_out + x * 4 + 80),  _mm256_extracti128_si256(out2, 1));
        _mm_store_si128((void*)(ws_out + x * 4 + 96),  _mm256_extracti128_si256(out3, 1));
        _mm_store_si128((void*)(ws_out + x * 4 + 112), _mm256_extracti128_si256(out4, 1));
    }



Well, it seems that I've missed an obvious optimization with 256-bit store:

        _mm256_store_si256((void*)(ws_out + x * 4),      _mm256_permute2x128_si256(out1, out2, 0x20));
        _mm256_store_si256((void*)(ws_out + x * 4 + 32), _mm256_permute2x128_si256(out3, out4, 0x20));
        _mm256_store_si256((void*)(ws_out + x * 4 + 64), _mm256_permute2x128_si256(out1, out2, 0x31));
        _mm256_store_si256((void*)(ws_out + x * 4 + 96), _mm256_permute2x128_si256(out3, out4, 0x31));
The relative time to encode 1 MB input (timed with 8,192 iterations):

    5.51x   original
    1.00x   bytewise (baseline)
    0.83x   SSSE3 (the original version I've posted)
    0.74x   AVX2 (parent)
    0.60x   AVX2 (updated)
Unfortunately my machine (i7-7700) can't run anything beyond AVX2.


> Unfortunately my machine (i7-7700) can't run anything beyond AVX2.

You machine has BMI2. It's not SIMD, but it handles 8 bytes at a time, and very suitable for packing and unpacking these bits in this case.

https://godbolt.org/z/xcT3exenr


Ugh, you were correct. I did copy and paste your code to my testing framework and it instantly crashed at that time, but it seems that I put a wrong offset to the output. The resulting code was slightly faster (by 2--4%) than my AVX2 code.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: