1.2 Features of VCL
∙ Vectors of 8-, 16-, 32- and 64-bit integers, signed and unsigned
∙ Vectors of single and double precision floating point numbers
∙ Total vector size 128, 256, or 512 bits
∙ Defines almost all common operators
∙ Boolean operations and branches on vector elements
∙ Many arithmetic functions
∙ Standard mathematical functions
∙ Permute, blend, gather, scatter, and table look-up functions
∙ Fast integer division
∙ Can build code for different instruction sets from the same source code
∙ CPU dispatching to utilize higher instruction sets when available
∙ Uses metaprogramming to find the optimal implementation for the selected instruction set and parameter values of a given operator or function
∙ Includes extra add-on packages for special purposes and applications
What's the advantage of hand rolled CPU dispatching over compiler intrinsincs like function multiversioning in GCC/Clang?
Notably the fact that the dispatching is done at runtime means you are trading off the convenience factor for code size and running extraneous dispatching code in your critical path. Additionally I've anecdotally seen on modern Intel hardware the power heuristics can penalize you for even _speculatively_ running some of the wider instruction sets.
I am just curious.
With vectors he means the low level vectors of the CPU.
So if you want to write your own low level algorithms, without going down to the actual intrinsics.
Implementation and comments related to "perm_flags" function here: https://github.com/vectorclass/version2/blob/6b16b1aaa388067...
Maintainablity is a bit harder, and compilers suck. You have always check new compiler regressions, esp. with the broken restrict/noalias feature with newer gcc's or clang with -O3.