
Function multi-versioning in GCC 6 - arthur2e5
https://lwn.net/Articles/691932/
======
ChuckMcM
The fact that different CPUs have different features was one of the original
reasons you never ran a generic kernel. The generic kernel was just that,
generic. As a result it made no assumptions. Now in Windows and later Linux
and FreeBSD kernels there are kernel modules that can change behaviors but
sometimes that still doesn't give you the max performance for the integer
code. Fortunately it isn't that much of a burden.

That said, I'm sure HPC folks compile with all the bells and whistles for the
exact model, options, and probably CPU stepping enabled.

~~~
jabl
HPC guy here; Nope, we don't bother recompiling the kernel for every piece of
hardware we have. HPC code tends to run 99.99% in user space, kernel
performance doesn't matter that much.

End user applications, that's another matter, and here using all the latest
vector instructions etc. can make a difference. Usually less so than what one
might hope, though. The really big deal tends to be using optimized libraries
such as OpenBLAS, FFTW, MKL instead of doing numerical linear algebra yourself
in a naive fashion, or using the reference netlib BLAS.

Another very common problem we see is poor application I/O patterns. Yes,
every HPC site loves to brag how many GB/s their Lustre system does, but if
you divide that by the number of CPU cores in a cluster, that ratio is quite
low. Additionally, like other clustered file systems, Lustre metadata
performance is relatively poor, so applications banging on lots of small files
can easily tank the performance of the entire Lustre system.

------
JoshTriplett
This seems particularly interesting as a contrast with
[https://news.ycombinator.com/item?id=13145245](https://news.ycombinator.com/item?id=13145245)
.

~~~
mcbain
Solves and creates different issues. With FMV you essentially have to build
the compile unit with the highest level of microarch support so that the pre-
processor interprets the intrinsics headers and doesn't eliminate intrinsics
you might need.

(ICC solves this in a differently annoying way - where all the intrinsics are
available, even if they are incompatible with the platform you are on.)

For some light entertainment, think about what happens with static
initializers and compiling for different microarch flavours. If your C++
static init function happens to generate an AVX insn, and you've only got
SSE2, welcome to SIGILL before main().

~~~
burntsushi
Can't you get the same behavior as icc in gcc/clang by just using target
specific optimization options at the function level?

See for example stage 1 here:
[https://gcc.gnu.org/wiki/FunctionSpecificOpt](https://gcc.gnu.org/wiki/FunctionSpecificOpt)
(that document appears dated, but do things still work that way?) Afaik,
clang/llvm have similar functionality.

~~~
mcbain
That stage1 example is pretty ugly - using __builtin_ia32<x> would work, but
they are the only things harder to read than the intrinsics themselves!

Plus there are some intrinsics that are just macros, (sets, masks, etc), and
you don't get them from the preprocessor just by setting the function target.

As an aside, that page really is dated - it is just early proposals afterall -
as SSE5 didn't see light of day like that. VPCMOV ended up in AMD's XOP set.

ICC also has its auto as well as manual dispatch options:

auto: [https://software.intel.com/en-
us/node/682440](https://software.intel.com/en-us/node/682440) manual:
[https://software.intel.com/en-us/node/684505](https://software.intel.com/en-
us/node/684505)

I believe this is the area where Intel had their knuckles rapped for only
working on "GenuineIntel" processors, and why there are big disclaimers on
everything now. I've not tried using these myself as they aren't portable
solutions.

------
akkartik
I wonder how long we have until we run out of x86 opcodes. Even if it's a
variable-length instruction encoding, there's going to come a point where the
size of the instruction outweighs any performance benefits, not to mention the
effort that goes into designing new instruction extensions in a chip.

Then again, perhaps that's all that's left for Intel to do now. Evolution of
marketing ploys: transistors -> clock speed -> #cores -> instruction
extensions.

~~~
userbinator
_there 's going to come a point where the size of the instruction outweighs
any performance benefits_

Except for the fact that many of these new instructions perform some huge
nontrivial operation in hardware that would've required hundreds or more
regular instructions previously --- AES is a good example. It seems like a
general principle that instruction sets tend to become more CISC-y over time,
as dedicated hardware and instructions designed to operate on such replace
slower software implementations.

------
eriknstr
Does clang / llvm have anything similar?

~~~
mcbain
Clang doesn't have function multi-versioning (FMV), but it now supports the
ifunc attribute for runtime resolution:

[http://clang.llvm.org/docs/AttributeReference.html#ifunc-
gnu...](http://clang.llvm.org/docs/AttributeReference.html#ifunc-gnu-ifunc)

------
hitlin37
pretty nice. i like the code in the article, seems much cleaner to look at
different functions for different arch. This will come handy during code
maintenance.

