Hacker News new | comments | ask | show | jobs | submit login
Design of a low-level C++ template SIMD library [pdf] (uni-bielefeld.de)
96 points by lainon on Jan 2, 2018 | hide | past | web | favorite | 29 comments

This would deserve an upvote if the code was legally usable.

Since it's not too easy to spot, the paper refers to http://www.ti.uni-bielefeld.de/html/people/moeller/tsimd_war... and that page has a "Software Download" section with a custom license that has significant arbitrary restrictions:

* "agrees not to transfer [...] to other individuals"

* "agrees not to use the software [...] where property of humans [is] endangered"

* etc.

I.e. this "contribution" would only be relevant if was usable under a really free license, e.g. one of: https://www.gnu.org/licenses/license-list.html#SoftwareLicen...

The code's meant to be of use educationally and the license is meant to limit it to that purpose. There is not some kind of bait-and-switch here, the fact that the code accompanies a fifty page academic paper is a pretty big hint as to what is going on.

> I.e. this "contribution" would only be relevant if was usable under a really free license, e.g. one of: https://www.gnu.org/licenses/license-list.html#SoftwareLicen....

The ironic thing about your comments is that half the open source projects out there would reject the code anyhow if it was GPLed. I'm not sure why we should expect the author of such a paper to pick the magic combination of licenses (because you'd have to have multiple licenses, and that's a pain in the butt) to make everyone happy, when making everyone happy is not the purpose, writing an academic paper is.

Sometimes the value you're going to get from code comes simply from reading it or using it as a reference, and that is okay.

It's still very interesting even if I can't just use the code. Not everyone wants to share their code and that's ok, but they are still sharing knowledge which is great.

It is "legally usable". Just not under your pet requirements.

Don't know what you mean by "pet requirements", but this restriction:

> (3) The software and the databases will only be used for the licensee's own scientific study, scientific research, or academic teaching. Use for commercial or business purposes is not permitted. [...]

is quite limiting.

Yes, it is. Still, it is usable in those cases.

Absolutely agreed. As I posted elsewhere on the thread these insane custom licenses guarantee that you can't use this code for anything serious. Even if you wanted to work on a project as a hobbyist, you can't redistribute this code. So if this guy gets hit by a bus or decides that he doesn't want to release this code any more, tough.

What's more is that as this HN post conveys, there are plenty of libraries like this, many being developed under much less bizarre licenses.

The thing that I find particularly bizarre is that I have a lot of understanding for people who craft custom licenses to make money to support a project (or do dual-license stuff). But this just seems self-defeating. I honestly just stopped reading at LICENSE.

There's also a ton of frisky legal bullshit:

"(11) Should any provision of this license agreement be or become invalid, this shall not affect the validity of the remaining provisions. Any invalid provision shall be replaced by a valid provision which corresponds to the meaning and purpose of the invalid provision."

Ah yes, the old "if my clause is legal garbage, magically replace it by what I meant and enforce that".

I've had success using C vector extensions in GCC and Clang [0]. With a simple typedef, you get a portable SIMD vector type and basic arithmetic operators working. It's compatible with platform-specific intrinsics (SSE, NEON, etc), check out this small example with some basic arithmetic and a few uses of intrinsics with it and what kind of compiler output it produces [1] (warning: I'm pretty sure the rcp/rsqrt/sqrt functions are wrong, this was just an experiment).

Here's the gist of it:

    typedef float vec4f __attribute__((vector_size(4 * sizeof(float))));
    vec4f a = { 1.0, 2.0, 3.0, 4.0 }, b = { 5.0, 6.0, 7.0, 8.0 };
    vec4f c = (2.0 * a) + (a + b * b);  // with -ffast-math, this will emit a fused multiply-and-add (FMA)
Note: if you look inside the intrinsics headers (xmmintrin.h, arm_neon.h, etc) supplied by GCC, you'll find that it uses these internally. E.g. _mm_add_ps(a, b) is defined as a+b.

I work with basic 3d math and physics, so I don't need that much and just having 4-wide vectors is good enough for me.

I've also found out that you can use vector widths that are not available in the target machine. E.g. 4 x double vectors work fine even without 256 bit registers, the compiler will split the vector and use two 128 bit registers and emit two instructions. This might also work for using 16 x float vectors for 4x4 matrices.

Some C++ overloading magic would be useful for naming things (e.g. no need for dot4f vs dot4d).

I've been trying to get some time to write an article about the ins and outs of using vector extensions, but haven't got there yet. Some effort would also be required to put together a decent library of basic arithmetic (dot, cross, quaternion product, matrix product & inverse, etc) as well as basic libm functions (sin, cos, log, exp). I haven't had the time to put together a comprehensive (and well tested) collection of these nor have I found any open source library that would do.

[0] https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html [1] https://godbolt.org/g/N9VvXZ

We've used libsimdcpp to good effect: https://github.com/p12tic/libsimdpp

"libsimdpp is a portable header-only zero-overhead C++ low level SIMD library." Not yet sure how it compares to the linked library.

Do you find it’s easier to write code with that than rely on autovec?

You can't rely on autovectorization because it's a really brittle optimization that only works at the best of times, and generally only with simple loops.

For anything more complex, you need to write SIMD code explicitly. Getting good performance requires writing code where the full width of the registers is used. If the compiler falls back to using scalar arithmetic, it tends to pollute the surrounding code with register spilling when registers are required for scalar arithmetic (ie. only the 1st component of the xmm0 register is used).

Writing SIMD code is quite a bit of effort if you need to get it working well.

You can also not rely on things like tailcall optimization to automatically happen. That is why you usually annotate the function with @tailrec in Scala for example. The annotation doesn't do anything by itself. The compiler will just show an error/warning if the function is not optimized with a tail call.

Autovectorised SIMD code would probably need something like an "AUTOVEC" annotation at every single line to be effective which defeats the purpose of autovectorisation in the first place.

> Autovectorised SIMD code would probably need something like an "AUTOVEC" annotation

If you only need SIMD for stream processing, autovectorisation is OK.

Only there’re multiple autovectorizers in C. The default one is indeed very fragile. But the one in OpenMP 4 is better: http://www.hpctoday.com/hpc-labs/explicit-vector-programming...

But even that OMP 4 is very limited.

One reason is many SSE operations don’t map to C: approximate math (rcpps, rsqrtps), composite operations (FMA, AES), and saturated math (there’re dozens instruction for manipulating 8 and 16 bit numbers with saturation, i.e. on over/underflow the numbers don’t wrap around by stripping highest byte[s] but stay at the min/max 8/16 bit value).

Another reason is some SSE instructions operate horizontally (phminposuw, pmaddubsw, psadbw, dpps), or are advanced swizzle instructions (shufps, pshufb, pshuflw, pshufhw, pslldq), both are very hard to autogenerate from these #pragma omp simd loops.

It’s not C++ but worth mention ISPC:


which extends C with data parallel constructs. Notably it can generate generic wide code as long as the implementation of different intrinsic dis provided.

also worth mentioning here is Vc - portable, zero-overhead C++ types for explicitly data-parallel programming https://github.com/VcDevel/Vc/ (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p021...).

Vc also looks rather more conventional in terms of license and distribution method (3-clause BSD and github respectively).

Custom licenses are a headache and I always wonder whether the academics who promulgate them wonder why no-one uses their software.

I recommend Agner Fog’s Vector Class Library instead


I agree, although I get some (perhaps benign?) warnings with gcc 7.2.

Would be nice if authors compared to some of the existing linear algebra packages with SIMD support like Eigen[1] which I've found to be very useful and easy to use. Header only includes, additional functionality and cache sensitive algorithms in addition to SIMD.

[1] http://eigen.tuxfamily.org/index.php?title=Main_Page

What kind of comparison are you looking for? Both libraries offer different levels of abstractions and functionality altogether. The paper linked here offers abstractions over the registers/intrinsics themselves and Eigen offers abstractions at the array and matrix level, plus all the built in algorithms/etc. Eigen is the type of library that might utilize a library like the one linked to implement functionality/algorithms. Maybe you want to know whats going on under the hood of Eigen?


Almost any game studio, company, etc. would have several of these... and github would have many there, like https://github.com/google/dimsum

Even some individuals do: https://github.com/Const-me/VectorMath

Another interesting one


Also includes implementation of functions like sin cos

Sadly this punts on a number of things that are too important: masking and gather/scatter. I tried doing this in Syrah [1] many years ago (oh LRBni) and realized that the masking for NEON, AVX2 and LRBni were all annoyingly different (particularly for mixed precision, like say floats and doubles). Now that Skylake is actually available en masse nearly 10 years later, I should probably clean this up and do a proper AVX-512 specialization.

[1] https://github.com/boulos/syrah

Unfortunately, Skylake doesn't include AVX-512, only the recently released Skylake-X, which is available on server chips and a handful of high end consumer parts.

Skylake doesn't do AVX512. Only "Skylake Server" (aka: Xeons and "extreme" i7 or i9 processors) do AVX512.

IceLake is expected to do AVX512.

there is also xsimd, which is a pretty cool reimplementation of a lot of Boost.SIMD!


Watch https://www.youtube.com/watch?v=GzZ-8bHsD5s to learn how risc-v does simd, without hardcoding vector lengths or needing peeling loops.

work in progress that uses Nim macros to make a nice SIMD library:


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact