Hacker News new | past | comments | ask | show | jobs | submit login
Major Update of Vector Class Library (agner.org)
59 points by EvgeniyZh 79 days ago | hide | past | web | favorite | 14 comments

From the docs:

    1.2 Features of VCL
    ∙ Vectors of 8-, 16-, 32- and 64-bit integers, signed and unsigned
    ∙ Vectors of single and double precision floating point numbers
    ∙ Total vector size 128, 256, or 512 bits
    ∙ Defines almost all common operators
    ∙ Boolean operations and branches on vector elements
    ∙ Many arithmetic functions
    ∙ Standard mathematical functions
    ∙ Permute, blend, gather, scatter, and table look-up functions
    ∙ Fast integer division
    ∙ Can build code for different instruction sets from the same source code
    ∙ CPU dispatching to utilize higher instruction sets when available
    ∙ Uses metaprogramming to find the optimal implementation for the selected instruction set and parameter values of a given operator or function
    ∙ Includes extra add-on packages for special purposes and applications
(Tldr: this is for CPU vectors and isn't directly comparable to a linear algebra library like Eigen.)

> ∙ CPU dispatching to utilize higher instruction sets when available

What's the advantage of hand rolled CPU dispatching over compiler intrinsincs like function multiversioning in GCC/Clang?

FMV is pretty neat but there are scenarios where it's not ideal, so I'm not surprised to see it not used in what is meant to be a lightweight high performance library.

Notably the fact that the dispatching is done at runtime means you are trading off the convenience factor for code size and running extraneous dispatching code in your critical path. Additionally I've anecdotally seen on modern Intel hardware the power heuristics can penalize you for even _speculatively_ running some of the wider instruction sets.

Am I missing something? The CPU dispatching in this library is done at runtime too... I thought FMV only has a penalty on the first call?

Ah apologies, I only took a cursory look at his dispatching logic. It looks to me like it supports both modes, but you have to hand roll the dispatching logic if you want to use it at runtime (but he provides an example). If you _really_ need the runtime dispatch then yeah I'd agree FMV is probably cleaner.

Why would I choose this over http://eigen.tuxfamily.org/ or https://bitbucket.org/blaze-lib/blaze/src/master/ ?

I am just curious.

I think these are much higher level with larger vectors.

With vectors he means the low level vectors of the CPU.

So if you want to write your own low level algorithms, without going down to the actual intrinsics.

This is a SIMD library (vector in the sense of doing many things at once), what you linked are maths library (vector as in mathematical concept). There’s clearly a bunch of overlap, both are used to do calculations and afaik the libraries you link also use SIMD when they can, so the practical difference is that this library is lower level and more generally applicable (but presumably more complex or difficult to work with).

If you don't actually need higher level math functions or it's prohibitive to implement something at the higher level. For example if you have an operation where you know the sparse-ness(?) of a matrix at compile time but the values of the matrix can change, you may want to implement the matrix multiplications by hand using SIMD ops.

this is a great simd resource, even if you don’t use it directly the source can be a great guide on how to implement various tricky things you often need when doing simd intrinsics programming.

Interesting. I've pieced together something like this for https://github.com/VCVRack/Rack/tree/v1/include/simd in C++11, but it only works for SSE2.

He mentions using new metaprogramming techniques enabled by C++ 14/17 to choose optimal tuning parameters during compile time. What is the main upshot of this? Are their benchmarks showing they improve performance? Due they improve maintainability? Possibly both?

I haven't digested just how it works, but "if constexpr" is used to choose different, more efficient instructions, when "special" permutations are chosen. Here is the implementation for a 128-bit permutation: https://github.com/vectorclass/version2/blob/master/vectorf1...

Implementation and comments related to "perm_flags" function here: https://github.com/vectorclass/version2/blob/6b16b1aaa388067...

constexpr or not are making a huge difference in large scale vectorizable operations for me. In my memcpy it's either two times slower without or two times faster with.

Maintainablity is a bit harder, and compilers suck. You have always check new compiler regressions, esp. with the broken restrict/noalias feature with newer gcc's or clang with -O3.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact