
Performance: SIMD, Vectorization and Performance Tuning [video] - espeed
https://www.youtube.com/watch?v=_OJmxi4-twY
======
exDM69
Watching this makes me sad that there are no languages that would have first
class SIMD vector types that would enable writing portable SIMD code for
different CPU instruction sets (SSE, AVX, NEON, etc). The closest thing to
what I want is C vector extensions available in GCC and Clang [0] (you still
need some compiler-specific #ifdefs). GPU and shader languages (GLSL, OpenCL
C) have a bit better support, but I want that on the CPU too.

Here's a list of my requirements:

1\. Built-in types for floating point and integer vectors (compile time
constant width). E.g. float32x4_t or int64x2_t. _Maybe_ have some matrix types
too.

2\. Normal infix operators for arithmetic (+, -, /). You can do this with C
[1]. Built-in syntax for vector shuffles (can't do this in C) [2].

3\. Compile time polymorphism to make vector-width agnostic code. If you write
sin4f and sin8f (in C), they are line-by-line identical except for types. You
should be able to write a single sin() function that works for any vector
width

4\. A standard library that has all the usual libm math functions (sin, cos,
log, exp, asin, atanh, etc). I could do with less-than-perfect precision for
performance (at least if -ffast-math is enabled)

5\. A standard library for some basic vector and matrix operations for static-
sized vectors and matrices. E.g. dot product, matrix _matrix product, matrix_
vector product, inverse matrix, etc.

I put some hope on Rust, which has been working on some SIMD stuff. But the
current iteration doesn't fulfill most of my requirements.

[0] [https://gcc.gnu.org/onlinedocs/gcc/Vector-
Extensions.html](https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html)

[1] You can do this:

    
    
        typedef float float32x4_t __attribute__((vector_size(16)));
        float32x4_t a = { 1, 2, 3, 4 }, b = { 5, 6, 7, 8 }, c = (a+b)*(a-b);
    

[2] You'll need some #ifdefs around __builtin_shuffle (GCC) and
__builtin_shufflevector (Clang). Something like my_vec.xxyy, similar to GLSL,
would be nicer.

~~~
nuntius
Take a look at Halide. I think it is an excellent DSL, covering the base of
what you describe, and well positioned for extension to the rest. If nothing
else, the documentation is a summary of a wide range of optimization
techniques. Written as a C++ library, it also supports dumping an object file
with C-style header.

[http://halide-lang.org/](http://halide-lang.org/)

Another option in the area is OpenMP. It has a wider base of users and
contributors, but I think the abstractions are not as good.

[http://www.openmp.org/](http://www.openmp.org/)

If you can move to a completely new programming language, then Chapel is built
to easily scale up across large supercomputer clusters.

[http://chapel.cray.com/](http://chapel.cray.com/)

~~~
exDM69
These are all great options for massive parallelism, but that's not what I'm
after.

I want _explicit_ SIMD with 2/4/8/16 wide vectors, primarily to be used with
3d graphics and physics calculations.

~~~
trendia
I use SIMDPP [0], which allows you to explicitly write SIMD instructions in a
portable way. See the documentation [1] for the available commands.
Specifically, I write code to be used on both x86 and ARM systems.

> libsimdpp is a portable header-only zero-overhead C++ wrapper around single-
> instruction multiple-data (SIMD) intrinsics found in many compilers. The
> library presents a single interface over several instruction sets in such a
> way that the same source code may be compiled for different instruction
> sets. The resulting object files then may be hooked into internal dynamic
> dispatch mechanism.

> The library resolves differences between instruction sets by implementing
> the missing functionality as a combination of several intrinsics. Moreover,
> the library supplies a lot of additional, commonly used functionality, such
> as various variants of matrix transpositions, interleaving loads/stores,
> optimized compile-time shuffling instructions, etc. Each of these are
> implemented in the most efficient manner for the target instruction set.
> Finally, it's possible to fall back to native intrinsics when necessary,
> without compromising maintanability.

[0] [https://github.com/p12tic/libsimdpp](https://github.com/p12tic/libsimdpp)

[1]
[http://p12tic.github.io/libsimdpp/v2.0%7Erc2/libsimdpp/](http://p12tic.github.io/libsimdpp/v2.0%7Erc2/libsimdpp/)

------
theparanoid
Mike Acton's CppCon talk "Data-Oriented Design and C++" [1] is also good.

[1]
[https://www.youtube.com/watch?v=rX0ItVEVjHc](https://www.youtube.com/watch?v=rX0ItVEVjHc)

