_mm256_store_si256((void*)(ws_out + x * 4), _mm256_permute2x128_si256(out1, out2, 0x20)); _mm256_store_si256((void*)(ws_out + x * 4 + 32), _mm256_permute2x128_si256(out3, out4, 0x20)); _mm256_store_si256((void*)(ws_out + x * 4 + 64), _mm256_permute2x128_si256(out1, out2, 0x31)); _mm256_store_si256((void*)(ws_out + x * 4 + 96), _mm256_permute2x128_si256(out3, out4, 0x31));
5.51x original 1.00x bytewise (baseline) 0.83x SSSE3 (the original version I've posted) 0.74x AVX2 (parent) 0.60x AVX2 (updated)
You machine has BMI2. It's not SIMD, but it handles 8 bytes at a time, and very suitable for packing and unpacking these bits in this case.
https://godbolt.org/z/xcT3exenr