Hacker News new | past | comments | ask | show | jobs | submit login
Chunking Optimizations: Let the Knife Do the Work (nullprogram.com)
83 points by ingve 41 days ago | hide | past | web | favorite | 13 comments

I will say that most of the time, when you want to pull out the intrinsics, you really want to use the autovectorizer instead and "restrict" will get you everything you want.

The main use cases for vector intrinsics, in my opinion, are when you are doing something clever shuffling the vectors around or packing / unpacking them. If you’re not being especially clever, just let the compiler do the heavy lifting and make it happen by putting "restrict" in a couple places.

If you are being especially clever, at least give the compiler a shot and benchmark the different versions against each other.

Or if you're me and your project is compiled with MSVC.

My personal projects are are beautifully simple loops that compile to vectorized AVX or SSE instructions. My work projects have a lot of `//todo: rewrite this with intrinsics` or `//don't fuck up the alignment there's a bunch of SIMD intrinsics under here`.

Using gcc-9.2 with the usual AVX-512 flags, I get

            vmovdqu8        zmm1, ZMMWORD PTR [rsi]
            vpxorq  zmm0, zmm1, ZMMWORD PTR [rdi]
            vmovdqu8        ZMMWORD PTR [rdi], zmm0
which is very close to the reported Clang-generated code.

Flags used: -O3 -mavx512f -mavx512cd -mavx512bw -mavx512dq -mavx512vl -mavx512ifma -mavx512vbmi

By the way, godbolt.org is great for testing this kind of stuff. Here is a link to verify my claim: https://godbolt.org/z/u68_9m .

As one of the comments in the article points out, there's an issue of mismatched compiler optimization flags for the different examples shown. The first was compiled with -Os, the last with -O3, and it's not mentioned for the ones in between. Using the same flags for both will produce the same code, so it doesn't seem like restrict helps much in this case.

When testing on clang, the first and last examples produce radically different code even when both using -Os or -O3. The choice of flag does affect the code emitted, but it doesn't change the article's point as xor512d does it 16 bytes at a time and xor512a does it 1 byte at a time.

> xor512d does it 16 bytes at a time and xor512a does it 1 byte at a time.

Not it doesn't. The compiler chooses how many bytes to do at a time. And in fact it defers that choice to run time. For reference, here are both functions compiled with GCC and Clang at -O3: https://godbolt.org/z/WmduxJ

For xor512a both compilers insert an aliasing check and fall back to a scalar loop if the input arrays overlap. But for the case where they do not overlap, both compilers produce vector code that is essentially equivalent to xor512d.

If we expect aliasing to be rare, the two functions will usually both execute essentially the same vector code, with the performance difference only being the slight cost of the aliasing check. This holds for both GCC and Clang.

One unfortunate issue / question: are compilers smart enough to emit tests for CPU support for these extensions? I see no mention of that, and if the answer is "no" you will find yourself being still forced to hand-roll this stuff.

Let's say I build a binary with AVX support. Will it run on a CPU without AVX support?

It's less of an issue with older SSE stuff that is effectively universal across all X64 chips, but it's a big issue with late SSE and AVX extensions.

Hand-rolling is unnecessary and error-prone. It is better to leverage the compiler.

There are a couple relevant GCC flags: -mcpu and -mtune. The -mcpu flag will specify which features it is allowed to assume are present. The blog post is using SSE instructions, I’m thinking SSE2 but that’s just from memory, and these days most people consider it eccentric to try and support CPUs without SSE2.

There are a couple ways you can do the feature testing in your own program. You can make multiple copies of the function and compile them with different CPU targets, and then use something like

    void xor512a_avx(...) __attribute__((target("avx"))) { ... }

    if (__builtin_cpu_supports("avx")) {
        xor512a_avx(dst, src);
Alternatively, you can use something called function multiversioning, which will write the branch for you. I think this feature is present in both GCC and Clang.

Also note that there are some unusual edge cases you may not have considered with feature detection. There’s a strong case to say that instructions like CPUID are the wrong way to go about it, and you should do things like use sysctl() on macOS. The risks in using CPUID are a bit esoteric in my opinion, though.

Intel compilers have an -ax flag which will compile functions into several versions for different instructions. GCC has target_clones which can be set for particular functions. Both ways will choose the function to use at runtime based on the cpu.

Sadly C++ doesn't have the `restrict` keyword.

That's surprising, is there a technical reason? I would have guessed that a non-aliasing declaration was table stakes for any performance critical language, so I just always assumed it was in there somewhere.

with C++, you often wonder why some things are done in such an ugly way (or haven't been cleaned up). The standard reply is frequently "backward compatibility". But if you investigate the backward compatibility, you find C++ is not really backward compatible with C at all. FWIW 'restrict' was introduced in C99, so after the big rift. But you can use '__restrict'

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact