
Design of a low-level C++ template SIMD library [pdf] - lainon
https://www.ti.uni-bielefeld.de/downloads/publications/templateSIMD.pdf
======
Lanedo
This would deserve an upvote if the code was legally usable.

Since it's not too easy to spot, the paper refers to [http://www.ti.uni-
bielefeld.de/html/people/moeller/tsimd_war...](http://www.ti.uni-
bielefeld.de/html/people/moeller/tsimd_warpingsimd.html) and that page has a
"Software Download" section with a custom license that has significant
arbitrary restrictions:

* "agrees not to transfer [...] to other individuals"

* "agrees not to use the software [...] where property of humans [is] endangered"

* etc.

I.e. this "contribution" would only be relevant if was usable under a really
_free_ license, e.g. one of: [https://www.gnu.org/licenses/license-
list.html#SoftwareLicen...](https://www.gnu.org/licenses/license-
list.html#SoftwareLicenses)

~~~
Tomte
It is "legally usable". Just not under your pet requirements.

~~~
rootlocus
Don't know what you mean by "pet requirements", but this restriction:

> (3) The software and the databases will only be used for the licensee's own
> scientific study, scientific research, or academic teaching. Use for
> commercial or business purposes is not permitted. [...]

is quite limiting.

~~~
Tomte
Yes, it is. Still, it is usable in those cases.

------
exDM69
I've had success using C vector extensions in GCC and Clang [0]. With a simple
typedef, you get a portable SIMD vector type and basic arithmetic operators
working. It's compatible with platform-specific intrinsics (SSE, NEON, etc),
check out this small example with some basic arithmetic and a few uses of
intrinsics with it and what kind of compiler output it produces [1] (warning:
I'm pretty sure the rcp/rsqrt/sqrt functions are wrong, this was just an
experiment).

Here's the gist of it:

    
    
        typedef float vec4f __attribute__((vector_size(4 * sizeof(float))));
        vec4f a = { 1.0, 2.0, 3.0, 4.0 }, b = { 5.0, 6.0, 7.0, 8.0 };
        vec4f c = (2.0 * a) + (a + b * b);  // with -ffast-math, this will emit a fused multiply-and-add (FMA)
    

Note: if you look inside the intrinsics headers (xmmintrin.h, arm_neon.h, etc)
supplied by GCC, you'll find that it uses these internally. E.g. _mm_add_ps(a,
b) is defined as a+b.

I work with basic 3d math and physics, so I don't need that much and just
having 4-wide vectors is good enough for me.

I've also found out that you can use vector widths that are not available in
the target machine. E.g. 4 x double vectors work fine even without 256 bit
registers, the compiler will split the vector and use two 128 bit registers
and emit two instructions. This might also work for using 16 x float vectors
for 4x4 matrices.

Some C++ overloading magic would be useful for naming things (e.g. no need for
dot4f vs dot4d).

I've been trying to get some time to write an article about the ins and outs
of using vector extensions, but haven't got there yet. Some effort would also
be required to put together a decent library of basic arithmetic (dot, cross,
quaternion product, matrix product & inverse, etc) as well as basic libm
functions (sin, cos, log, exp). I haven't had the time to put together a
comprehensive (and well tested) collection of these nor have I found any open
source library that would do.

[0] [https://gcc.gnu.org/onlinedocs/gcc/Vector-
Extensions.html](https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html)
[1] [https://godbolt.org/g/N9VvXZ](https://godbolt.org/g/N9VvXZ)

------
cottonseed
We've used libsimdcpp to good effect:
[https://github.com/p12tic/libsimdpp](https://github.com/p12tic/libsimdpp)

"libsimdpp is a portable header-only zero-overhead C++ low level SIMD
library." Not yet sure how it compares to the linked library.

~~~
marmaduke
Do you find it’s easier to write code with that than rely on autovec?

~~~
exDM69
You can't rely on autovectorization because it's a really brittle optimization
that only works at the best of times, and generally only with simple loops.

For anything more complex, you need to write SIMD code explicitly. Getting
good performance requires writing code where the full width of the registers
is used. If the compiler falls back to using scalar arithmetic, it tends to
pollute the surrounding code with register spilling when registers are
required for scalar arithmetic (ie. only the 1st component of the xmm0
register is used).

Writing SIMD code is quite a bit of effort if you need to get it working well.

~~~
imtringued
You can also not rely on things like tailcall optimization to automatically
happen. That is why you usually annotate the function with @tailrec in Scala
for example. The annotation doesn't do anything by itself. The compiler will
just show an error/warning if the function is not optimized with a tail call.

Autovectorised SIMD code would probably need something like an "AUTOVEC"
annotation at every single line to be effective which defeats the purpose of
autovectorisation in the first place.

~~~
Const-me
> Autovectorised SIMD code would probably need something like an "AUTOVEC"
> annotation

If you only need SIMD for stream processing, autovectorisation is OK.

Only there’re multiple autovectorizers in C. The default one is indeed very
fragile. But the one in OpenMP 4 is better: [http://www.hpctoday.com/hpc-
labs/explicit-vector-programming...](http://www.hpctoday.com/hpc-
labs/explicit-vector-programming-with-openmp-4-0-simd-extensions/)

But even that OMP 4 is very limited.

One reason is many SSE operations don’t map to C: approximate math (rcpps,
rsqrtps), composite operations (FMA, AES), and saturated math (there’re dozens
instruction for manipulating 8 and 16 bit numbers with saturation, i.e. on
over/underflow the numbers don’t wrap around by stripping highest byte[s] but
stay at the min/max 8/16 bit value).

Another reason is some SSE instructions operate horizontally (phminposuw,
pmaddubsw, psadbw, dpps), or are advanced swizzle instructions (shufps,
pshufb, pshuflw, pshufhw, pslldq), both are very hard to autogenerate from
these #pragma omp simd loops.

------
marmaduke
It’s not C++ but worth mention ISPC:

[https://ispc.github.io](https://ispc.github.io)

which extends C with data parallel constructs. Notably it can generate generic
wide code as long as the implementation of different intrinsic dis provided.

------
rbx
also worth mentioning here is Vc - portable, zero-overhead C++ types for
explicitly data-parallel programming
[https://github.com/VcDevel/Vc/](https://github.com/VcDevel/Vc/)
([http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2017/p021...](http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2017/p0214r7.pdf)).

~~~
onan_barbarian
Vc also looks rather more conventional in terms of license and distribution
method (3-clause BSD and github respectively).

Custom licenses are a headache and I always wonder whether the academics who
promulgate them wonder why no-one uses their software.

------
paulsutter
I recommend Agner Fog’s Vector Class Library instead

[http://www.agner.org/optimize/vectorclass.pdf](http://www.agner.org/optimize/vectorclass.pdf)

~~~
cozzyd
I agree, although I get some (perhaps benign?) warnings with gcc 7.2.

------
marmaduke
Another interesting one

[https://bitbucket.org/eschnett/vecmathlib/src](https://bitbucket.org/eschnett/vecmathlib/src)

Also includes implementation of functions like sin cos

------
chuckcode
Would be nice if authors compared to some of the existing linear algebra
packages with SIMD support like Eigen[1] which I've found to be very useful
and easy to use. Header only includes, additional functionality and cache
sensitive algorithms in addition to SIMD.

[1]
[http://eigen.tuxfamily.org/index.php?title=Main_Page](http://eigen.tuxfamily.org/index.php?title=Main_Page)

~~~
stagger87
What kind of comparison are you looking for? Both libraries offer different
levels of abstractions and functionality altogether. The paper linked here
offers abstractions over the registers/intrinsics themselves and Eigen offers
abstractions at the array and matrix level, plus all the built in
algorithms/etc. Eigen is the type of library that might utilize a library like
the one linked to implement functionality/algorithms. Maybe you want to know
whats going on under the hood of Eigen?

[https://eigen.tuxfamily.org/dox/TopicInsideEigenExample.html](https://eigen.tuxfamily.org/dox/TopicInsideEigenExample.html)

------
malkia
Almost any game studio, company, etc. would have several of these... and
github would have many there, like
[https://github.com/google/dimsum](https://github.com/google/dimsum)

~~~
Const-me
Even some individuals do: [https://github.com/Const-
me/VectorMath](https://github.com/Const-me/VectorMath)

------
boulos
Sadly this punts on a number of things that are too important: masking and
gather/scatter. I tried doing this in Syrah [1] many years ago (oh LRBni) and
realized that the masking for NEON, AVX2 and LRBni were all annoyingly
different (particularly for mixed precision, like say floats and doubles). Now
that Skylake is actually available en masse nearly 10 years later, I should
probably clean this up and do a proper AVX-512 specialization.

[1] [https://github.com/boulos/syrah](https://github.com/boulos/syrah)

~~~
BeeOnRope
Unfortunately, Skylake doesn't include AVX-512, only the recently released
Skylake-X, which is available on server chips and a handful of high end
consumer parts.

------
droelf
there is also xsimd, which is a pretty cool reimplementation of a lot of
Boost.SIMD!

[https://github.com/QuantStack/xsimd](https://github.com/QuantStack/xsimd)

------
br1
Watch
[https://www.youtube.com/watch?v=GzZ-8bHsD5s](https://www.youtube.com/watch?v=GzZ-8bHsD5s)
to learn how risc-v does simd, without hardcoding vector lengths or needing
peeling loops.

------
jackmott
work in progress that uses Nim macros to make a nice SIMD library:

[https://github.com/jackmott/nim_simd](https://github.com/jackmott/nim_simd)

