
Chunking Optimizations: Let the Knife Do the Work - ingve
https://nullprogram.com/blog/2019/12/09/
======
klodolph
I will say that most of the time, when you want to pull out the intrinsics,
you really want to use the autovectorizer instead and "restrict" will get you
everything you want.

The main use cases for vector intrinsics, in my opinion, are when you are
doing something clever shuffling the vectors around or packing / unpacking
them. If you’re not being especially clever, just let the compiler do the
heavy lifting and make it happen by putting "restrict" in a couple places.

If you are being especially clever, at least give the compiler a shot and
benchmark the different versions against each other.

~~~
nwallin
Or if you're me and your project is compiled with MSVC.

My personal projects are are beautifully simple loops that compile to
vectorized AVX or SSE instructions. My work projects have a lot of `//todo:
rewrite this with intrinsics` or `//don't fuck up the alignment there's a
bunch of SIMD intrinsics under here`.

------
balnaphone
Using gcc-9.2 with the usual AVX-512 flags, I get

    
    
        xor512d:
                vmovdqu8        zmm1, ZMMWORD PTR [rsi]
                vpxorq  zmm0, zmm1, ZMMWORD PTR [rdi]
                vmovdqu8        ZMMWORD PTR [rdi], zmm0
                vzeroupper
                ret
    

which is very close to the reported Clang-generated code.

Flags used: -O3 -mavx512f -mavx512cd -mavx512bw -mavx512dq -mavx512vl
-mavx512ifma -mavx512vbmi

~~~
balnaphone
By the way, godbolt.org is great for testing this kind of stuff. Here is a
link to verify my claim:
[https://godbolt.org/z/u68_9m](https://godbolt.org/z/u68_9m) .

------
Narishma
As one of the comments in the article points out, there's an issue of
mismatched compiler optimization flags for the different examples shown. The
first was compiled with -Os, the last with -O3, and it's not mentioned for the
ones in between. Using the same flags for both will produce the same code, so
it doesn't seem like restrict helps much in this case.

~~~
lilyball
When testing on clang, the first and last examples produce radically different
code even when both using -Os or -O3. The choice of flag does affect the code
emitted, but it doesn't change the article's point as xor512d does it 16 bytes
at a time and xor512a does it 1 byte at a time.

~~~
tom_mellior
> xor512d does it 16 bytes at a time and xor512a does it 1 byte at a time.

Not it doesn't. The compiler chooses how many bytes to do at a time. And in
fact it defers that choice to run time. For reference, here are both functions
compiled with GCC and Clang at -O3:
[https://godbolt.org/z/WmduxJ](https://godbolt.org/z/WmduxJ)

For xor512a both compilers insert an aliasing check and fall back to a scalar
loop if the input arrays overlap. But for the case where they do _not_
overlap, both compilers produce vector code that is essentially equivalent to
xor512d.

If we expect aliasing to be rare, the two functions will usually both execute
essentially the same vector code, with the performance difference only being
the slight cost of the aliasing check. This holds for both GCC and Clang.

------
api
One unfortunate issue / question: are compilers smart enough to emit tests for
CPU support for these extensions? I see no mention of that, and if the answer
is "no" you will find yourself being still forced to hand-roll this stuff.

Let's say I build a binary with AVX support. Will it run on a CPU without AVX
support?

It's less of an issue with older SSE stuff that is effectively universal
across all X64 chips, but it's a big issue with late SSE and AVX extensions.

~~~
klodolph
Hand-rolling is unnecessary and error-prone. It is better to leverage the
compiler.

There are a couple relevant GCC flags: -mcpu and -mtune. The -mcpu flag will
specify which features it is allowed to assume are present. The blog post is
using SSE instructions, I’m thinking SSE2 but that’s just from memory, and
these days most people consider it eccentric to try and support CPUs without
SSE2.

There are a couple ways you can do the feature testing in your own program.
You can make multiple copies of the function and compile them with different
CPU targets, and then use something like

    
    
        void xor512a_avx(...) __attribute__((target("avx"))) { ... }
    
        if (__builtin_cpu_supports("avx")) {
            xor512a_avx(dst, src);
        }
    

Alternatively, you can use something called function multiversioning, which
will write the branch for you. I think this feature is present in both GCC and
Clang.

Also note that there are some unusual edge cases you may not have considered
with feature detection. There’s a strong case to say that instructions like
CPUID are the wrong way to go about it, and you should do things like use
sysctl() on macOS. The risks in using CPUID are a bit esoteric in my opinion,
though.

------
RossBencina
Sadly C++ doesn't have the `restrict` keyword.

~~~
dodobirdlord
That's surprising, is there a technical reason? I would have guessed that a
non-aliasing declaration was table stakes for any performance critical
language, so I just always assumed it was in there somewhere.

~~~
toolslive
with C++, you often wonder why some things are done in such an ugly way (or
haven't been cleaned up). The standard reply is frequently "backward
compatibility". But if you investigate the backward compatibility, you find
C++ is not really backward compatible with C at all. FWIW 'restrict' was
introduced in C99, so after the big rift. But you can use '__restrict'

