They’re getting better over time, especially for floats/doubles, but I still fin...

They’re getting better over time, especially for floats/doubles, but I still find them limited even for simple use cases.

Here’s an example of auto-vectorizer in clang 12, which I believe represents state of the art at the moment: https://godbolt.org/z/6Pe33187W It automatically vectorized the loop and even manually unrolled it, however I think the code bottlenecks on shuffles not on memory loads. Just too many instructions in the loop, and that vpmovzxbq instruction can only run on port 5 on Skylake.

Compare the assembly with manually vectorized version from an answer on stackoverflow: https://godbolt.org/z/do5e3-