
Auto-vectorization for the masses (2011) - lelf
https://leiradel.github.io/2011/05/05/Auto-Vectorization-1.html
======
nwallin
If you're in C++ land, constexpr can be a godsend for auto vectorization.

Traditionally, you have a SoA of all your data. So if you have an array of
triangles (not meshes) you might have three structs p1, p2, p3, each with
three arrays of x,y,z. Then you write a fortan-esque loop over your triangles.
Usually the vectorizer will vectoriser that, but fortan-esque code is
obnoxious to write IMHO.

Instead, you have the same thing, but with constexpr operator[] methods. The
inner structs return a constexpr constructed vec3, the outer struct returns a
constexpr constructed triangle, and then you plug that into a constexpr
processing function/method. GCC and Clang will vectorize this. And you can
reuse normal linear algebra functions like dot cross etc (so long as they're
constexpr) and it doesn't look like Fortran.

Matt Godbolt missed this during his path tracing three ways presentation. His
data oriented design pathtracer stored an array of vec3s, which can't be auto
vectorized. (and isn't data oriented design) I keep meaning to try to submit a
PR to his repo, but the project has some weird dependencies and I can't get it
to compile.

MSVC won't autovectorize either of the above. Not sure if it doesn't have a
vectorizer or if it's just insufficiently powerful. Or if I'm just using the
wrong compiler flags.

~~~
tboerstad
Do you have a code example of the constexpr operator[] you mention? I am
having a hard time following it.

MSVC has an auto-vectorizer, with /Qvec-report:2 it will give you information
on why it doesn't auto-vectorize a specific loop.

It's well documented, though I'm not particularly fond of it. Here is an
example where the auto-vectorizer works:
[https://godbolt.org/z/crTLUY](https://godbolt.org/z/crTLUY)

Here is the documentation: [https://docs.microsoft.com/en-
us/cpp/parallel/auto-paralleli...](https://docs.microsoft.com/en-
us/cpp/parallel/auto-parallelization-and-auto-vectorization?view=vs-2019)

------
marklacey
I only barely skimmed the post and the follow-on posts so this is less about
that and more about autovectorizers.

Autovectorization is the wrong approach for data-parallelization. You don’t
want to rely on a brittle unpredictable code transformation for performance in
this case. You want to bake it into the programming model.

ispc uses this approach and it results in performance predictability to a
large degree. You can imagine other approaches as well, like explicitly data-
parallel loops, or a declarative approach.

Most of these (and the GPU data-parallel models) rely to a very large extent
on the programmer to manage data dependencies to ensure correctness.

~~~
tom_mellior
> You don’t want to rely on a brittle unpredictable code transformation for
> performance in this case.

That's somewhat true, but much of the unpredictability could be removed if
compilers provided annotations saying "I expect this loop to be vectorized"
where the compiler would be forced to report an error if it didn't manage to
do it.

~~~
jcranmer
Such annotations exist in all major C/C++/Fortran compilers, although not all
will error if it goes poorly. Although most of the ones with an HPC focus have
some output where they will tell you how they optimized your loop for you.

~~~
tom_
> although not all will error if it goes poorly

So... these annotations don't actually exist? ;)

The whole point of such a feature is that the compiler will fail if it can't
do what you want.

~~~
marklacey
And then what do you do when you upgrade compilers and the build starts
failing?

This isn’t a hypothetical, it happens in real life. Your only option at that
point is roll back compilers and hope someone cares about the regression
enough to fix it.

The point is the better model is to build the semantics into the language
rather than relying on the whims of implementation.

Of course the semantic guarantees will likely be somewhat weak because of
differences in ISAs, memory hierarchies, etc.

~~~
Fronzie
As fallback, I have unit-tests with a check on both correctness and the
instruction count (and # of memory allocations in hot loops). Whereas the cpu
cycle-count varies between runs, the instruction count does not.

Linux has good support for performance counters, Windows requires a bit of
work.

So in this case, the compiler update would show a regression in the test
suite, which needs to be addressed as part of the compiler upgrade.

Of course, having this in the compiler would remove the need for unit tests.

------
rsp1984
This has been done by Intel: [https://ispc.github.io](https://ispc.github.io)

~~~
tom_
More about ispc, from Matt Pharr:
[https://pharr.org/matt/blog/2018/04/30/ispc-
all.html](https://pharr.org/matt/blog/2018/04/30/ispc-all.html) \- includes
some discussion of Intel's corporate culture. Interesting throughout.

~~~
joe_the_user
Great quote here:

 _" The problem with an auto-vectorizer is that as long as vectorization can
fail (and it will), then if you’re a programmer who actually cares about what
code the compiler generates for your program, you must come to deeply
understand the auto-vectorizer. Then, when it fails to vectorize code you want
to be vectorized, you can either poke it in the right ways or change your
program in the right ways so that it works for you again. This is a horrible
way to program; it’s all alchemy and guesswork and you need to become deeply
specialized about the nuances of a single compiler’s implementation—something
you wouldn’t otherwise need to care about one bit."_

The thing about this kind of thing is that it's nightmare to do but the people
who can do it wind-up seeming like wizards and alchemists and so they won't
necessarily say "this is a nightmare, never do this".

------
tom_mellior
So... skimming this post and its successors, I didn't see any actual examples
of generated vector code, especially not examples that GCC can't do although
they are supposedly "easy". And no benchmarks. Did I miss anything or did this
project really die before it got to vectorization (or anything more
interesting than constant folding)?

------
epistasis
Very interesting and useful to see.

And in an entirely approach for vectorization for the masses: I do wish that
it was easier to access vectorization through BLAS, a library that is well
supported across nearly all languages, gets massively optimized, but is hard
to install correctly.

~~~
chewxy
Good news is that the Gonum team has been working on an optimized pure Go
version of BLAS. It's at parity with netlib blas for some of the important
functions (GEMV, GEMV, etc).

Why is this good news? Go is a very easy to use language, and it favours using
compile targets, leading it to be available across different platforms. To
install, one simply does `go get gonum.org/v1/gonum`

~~~
jedbrown
Netlib BLAS is a very low bar [1], and not at all how one should go about
writing a performance portable BLAS. BLIS
([https://github.com/flame/blis/](https://github.com/flame/blis/)) is a much
better approach, and underlies vendor implementations on AMD
([https://developer.amd.com/amd-aocl/blas-
library/](https://developer.amd.com/amd-aocl/blas-library/)) and many embedded
systems.

[1] GEMV is entirely limited by memory bandwidth, thus quite uninteresting
from a vectorization standpoint. Maybe you meant GEMM?

------
polskibus
Does anyone know if JVM, golang or .NET have autovectorization?

~~~
MrBuddyCasino
JIT compilers are in general not great at auto-vec because it is an expensive
optimization, they do not have as much time as eg C++ ahead-of-time compilers,
so the JVM and the CLR only handle very simple cases. .NET has an explicit API
to guarantee vectorization, Java does not, but there is a JEP:
[https://openjdk.java.net/jeps/338](https://openjdk.java.net/jeps/338)

I don’t know about Golang, but given the compiler speed it probably can’t
afford the more advanced techniques.

