
Major Update of Vector Class Library - EvgeniyZh
https://www.agner.org/optimize/blog/read.php?i=1013
======
jey
From the docs:

    
    
        1.2 Features of VCL
        ∙ Vectors of 8-, 16-, 32- and 64-bit integers, signed and unsigned
        ∙ Vectors of single and double precision floating point numbers
        ∙ Total vector size 128, 256, or 512 bits
        ∙ Defines almost all common operators
        ∙ Boolean operations and branches on vector elements
        ∙ Many arithmetic functions
        ∙ Standard mathematical functions
        ∙ Permute, blend, gather, scatter, and table look-up functions
        ∙ Fast integer division
        ∙ Can build code for different instruction sets from the same source code
        ∙ CPU dispatching to utilize higher instruction sets when available
        ∙ Uses metaprogramming to find the optimal implementation for the selected instruction set and parameter values of a given operator or function
        ∙ Includes extra add-on packages for special purposes and applications
    

(Tldr: this is for CPU vectors and isn't directly comparable to a linear
algebra library like Eigen.)

~~~
throwaway542134
> ∙ CPU dispatching to utilize higher instruction sets when available

What's the advantage of hand rolled CPU dispatching over compiler intrinsincs
like function multiversioning in GCC/Clang?

~~~
pdovy
FMV is pretty neat but there are scenarios where it's not ideal, so I'm not
surprised to see it not used in what is meant to be a lightweight high
performance library.

Notably the fact that the dispatching is done at runtime means you are trading
off the convenience factor for code size and running extraneous dispatching
code in your critical path. Additionally I've anecdotally seen on modern Intel
hardware the power heuristics can penalize you for even _speculatively_
running some of the wider instruction sets.

~~~
throwaway542134
Am I missing something? The CPU dispatching in this library is done at runtime
too... I thought FMV only has a penalty on the first call?

~~~
pdovy
Ah apologies, I only took a cursory look at his dispatching logic. It looks to
me like it supports both modes, but you have to hand roll the dispatching
logic if you want to use it at runtime (but he provides an example). If you
_really_ need the runtime dispatch then yeah I'd agree FMV is probably
cleaner.

------
snowAbstraction
Why would I choose this over
[http://eigen.tuxfamily.org/](http://eigen.tuxfamily.org/) or
[https://bitbucket.org/blaze-
lib/blaze/src/master/](https://bitbucket.org/blaze-lib/blaze/src/master/) ?

I am just curious.

~~~
nn3
I think these are much higher level with larger vectors.

With vectors he means the low level vectors of the CPU.

So if you want to write your own low level algorithms, without going down to
the actual intrinsics.

------
gameswithgo
this is a great simd resource, even if you don’t use it directly the source
can be a great guide on how to implement various tricky things you often need
when doing simd intrinsics programming.

------
vortico
Interesting. I've pieced together something like this for
[https://github.com/VCVRack/Rack/tree/v1/include/simd](https://github.com/VCVRack/Rack/tree/v1/include/simd)
in C++11, but it only works for SSE2.

------
CogitoCogito
He mentions using new metaprogramming techniques enabled by C++ 14/17 to
choose optimal tuning parameters during compile time. What is the main upshot
of this? Are their benchmarks showing they improve performance? Due they
improve maintainability? Possibly both?

~~~
jepler
I haven't digested just how it works, but "if constexpr" is used to choose
different, more efficient instructions, when "special" permutations are
chosen. Here is the implementation for a 128-bit permutation:
[https://github.com/vectorclass/version2/blob/master/vectorf1...](https://github.com/vectorclass/version2/blob/master/vectorf128.h#L2402)

Implementation and comments related to "perm_flags" function here:
[https://github.com/vectorclass/version2/blob/6b16b1aaa388067...](https://github.com/vectorclass/version2/blob/6b16b1aaa38806704dd308fb731af2ccdfe632c9/instrset.h#L564)

