
Towards fearless SIMD - fanf2
https://raphlinus.github.io/rust/simd/2018/10/19/fearless-simd.html
======
ndesaulniers
> Code written for too high a SIMD capability will generally crash (it’s
> undefined behavior), while code written for too low a SIMD capability will
> fall short of the performance available.

I can't help but think that SIMD intrinsics are the wrong level of
abstraction. Literally assembly being moved into a high level language but
without the ability for the compiler to optimize well. Hand written intrinsics
can really be fine tuned for a given microarchitecture, but I think XLA's
approach of having a higher level of abstraction, then JIT compiling to the
device in question (whether it be SIMD of X width, or a gpu, or a tpu) is the
right way to go.

Whenever I see hand coded assembly, I can't help but think to myself:

* Did the author actually have instruction scheduling information on hand when writing this?

* For what generation of chips was this found to be optimal for, and is it still? Invariably, you wind up with hand coded assembly routines that were written a long time ago still in use without anyone revisiting the code or it's performance.

~~~
m_mueller
As someone coming from CUDA I think you‘re on the right path. CUDA unifies
multicore and vectorization parallelims. Crucial here is hardware support for
branching (including early returns) even if it degrades performance. This
allows a CUDA kernel being programmed in a programmer friendly way that is
often already fairly performant, and the actual mapping to hardware is
deferred to kernel invocation where it‘s straightforward to specialize. I
think this concept is something that CPU (programming) could learn from GPU.
Why not adopt the concept of kernels?

~~~
marmaduke
OpenCL on CPU is exactly this and does very well. Last real world test I did,
an 8 core Haswell and a GTX 970 were similar in performance, for the kernel I
was running. Caveats apply of course but it was refreshing to have a single
programming model.

~~~
tpolzer
I did some evaluation of (Linux based) OpenCL implementations two years ago
(but I don't think anything changed significantly since then).

I had a few takeaways:

\- OpenCL on GPUs and CPUs have little to do with each other (in terms of
performance characteristics) and if you tune well for one of them, the other
one will suffer.

\- Vectorization of work items doesn't really work well unless your kernel is
so simple that normal compiler auto vectorization with a loop would have
probably worked just as well if not better.

\- Nvidia intentionally makes OpenCL a second class citizen vs. CUDA. I had
nearly identical (simple) kernels running on both platforms and only the CUDA
one managed to saturate memory throughput.

\- The whole ecosystem is mostly more effort than it's worth. Portability
between different OpenCL implementations is a gamble, some will even silently
compute invalid results (I'm looking at you Intel...). I had kernel hangs with
both Nvidia and AMD.

~~~
twtw
> if you tune well for one of them, the other one will suffer.

This is one of the things that bothers me the most about OpenCL. It attempts
to offer this uniform abstraction over a generic compute accelerator, which
can be CPU vector extensions, GPUs, or FPGAs, but these things are different
enough that you have to develop for a specific type if you want reasonable
performance. So you get none of the benefits of a accelerator specific
abstraction while still writing accelerator specific code.

There is a real cost to a generic abstraction, and distinct
languages/platforms would in my mind be better than different "dialects" of
the same language/platform that pretend to be compatible but really aren't.

I like that CUDA is very clearly designed to run only on GPUs - it provides a
clarity that OpenCL lacks.

------
gameswithgo
Another Rust project with similar goals, different approach:

[https://github.com/jackmott/simdeez](https://github.com/jackmott/simdeez)

full disclosure, that's mine.

~~~
Bjartr
Your library is actually mentioned in the linked article

> Another crate, with considerable overlap in goals, is simdeez. This crate is
> designed to facilitate runtime detection, and writing the actual logic
> without duplication, but leaves the actual writing of architecture specific
> shims to the user, and still requires nontrivial unsafe code.

~~~
gameswithgo
cool, he updated it, it wasn't originally. We have chatted a bit and exchanged
ideas.

------
stochastic_monk
I’ve found template metaprogramming to be an efficient, generic way to
generate SIMD-accelerated code. (See [0].) By giving the type functions
associated with each operation, the same interface can support 128, 256, or
512-bit SIMD, depending on hardware.

It’s C++, so while it’s outside of the Rust ecosystem, it’s still a workable
solution I’ve used in a half dozen projects.

[0] [https://github.com/dnbaker/vec](https://github.com/dnbaker/vec)

~~~
grandinj
That doesn't give you run-time dispatch based on available hardware
capabilities. So you either need virtual functions (and take the associated
perf hit), or some other solution like the linker tricks mentions, or cached
JIT, or....

~~~
stochastic_monk
It’s not runtime, it’s compile-time. That means it needs to be separately
compiled for each hardware backend.

~~~
grandinj
How do you combine the separately compiled libraries into a single deliverable
library/executable?

Because that is what the original article is aiming for.

~~~
stochastic_monk
It doesn’t do the runtime dispatch this person wants, but it eliminates the
worry of running into an illegal instruction error.

I think there are use cases for a fat binary, but I’m less certain as to its
necessity.

------
wyldfire
The simplest target-independent fearless SIMD is autovectorization[1] . But
taking maximum advantage of that probably means writing some code that feels a
little unnatural. Also, IIRC bounds checks thwart some autovectorization.

[1]
[https://llvm.org/docs/Vectorizers.html](https://llvm.org/docs/Vectorizers.html)

~~~
gameswithgo
In an LLVM context it also means you don't get runtime feature detection. You
would need to build N dlls and then write code to load the proper one at
runtime based on feature detection.

JITs could solve that problem, but few JITs currently do very much auto
vectorization, because they don't have time.

~~~
blattimwind
> In an LLVM context it also means you don't get runtime feature detection.
> You would need to build N dlls and then write code to load the proper one at
> runtime based on feature detection.

Does LLVM not support GCC-style function multiversioning?

~~~
Rebelgecko
Does GCC use that when auto-vectorizing? It's been a while, but in the past
when I built binaries via gcc with autogenerated SSE/AVX I don't think they
fell back to the C code for CPUs that lacked those SIMD instructions. They
just crashed.

~~~
davrosthedalek
You have to tell gcc it should also build a generic version. Then, it should
work.

------
davidhyde
> Note that this is a performance of approximately 470 picoseconds per sample.
> Modern computers are fast when running optimized code.

Or about half the time it takes a photon from your phone screen to hit your
eyeball

------
melonakos
We’ve been working on this problem for over a decade and have some support for
Rust. Let’s work together if that makes sense to you l! I’m John at ArrayFire.

------
skemper911
In rust this looks rather painful. Might I suggest trying this in Julia, might
be a good comparison of performance, ease of use, and readability. Julia does
a very nice job of compiling directly to SIMD instruction and lets you inspect
the low level code generated.

inline function sin9_shaper(x) c0 = 6.28308759 c1 = -41.33318707 c2 =
81.39900205 c3 = -74.66884436 c4 = 33.15324345

    
    
        a = abs(x - round(x)) - 0.25
        a2 = a * a
        ((((a2 * c4 + c3) * a2 + c2) * a2 + c1) * a2 + c0) * a

end

function gen_sinwave(freq, init=0.0, step=0.1) wave = [sin9_shaper(x) for x =
init:step:freq] end

julia> @code_native gen_sinwave(1113.0); .section
__TEXT,__text,regular,pure_instructions ; Function gen_sinwave { ; Location:
REPL[60]:2 pushl %ebx decl %eax subl $48, %esp vmovaps %xmm0, %xmm2 ; Function
gen_sinwave; { ; Location: REPL[60]:2 decl %eax movl $769501344, %eax ## imm =
0x2DDDA8A0 addl %eax, (%eax) addb %al, (%eax) decl %eax movl $773805120, %ecx
## imm = 0x2E1F5440 addl %eax, (%eax) addb %al, (%eax) vmovsd (%ecx), %xmm1 ##
xmm1 = mem[0],zero decl %eax movl %esp, %ebx vxorps %xmm0, %xmm0, %xmm0 decl
%eax movl %ebx, %edi calll _%eax decl %eax movl $769544608, %eax ## imm =
0x2DDE51A0 addl %eax, (%eax) addb %al, (%eax) decl %eax movl %ebx, %edi calll_
%eax ;} decl %eax addl $48, %esp popl %ebx retl nopw %cs:(%eax,%eax) ;}

end # module

~~~
coder543
Your formatting didn't work, but no, it doesn't "look painful."

What you wrote does not guarantee vectorization, it just relies on
autovectorization.

Rust already does autovectorization magically behind the scenes thanks to LLVM
(which Julia also uses), but explicit SIMD makes it a guarantee.

~~~
byt143
Here is how to do explicit SIMD in julia:
[https://github.com/eschnett/SIMD.jl](https://github.com/eschnett/SIMD.jl)

------
amluto
Why is SIMD considered unsafe? I thought that safe code was permitted to cause
panics, and the worst thing that will happen if unsupported SIMD is used is a
panic.

~~~
burntsushi
No, it's not about panicking. It's undefined behavior to run code compiled
with CPU features that aren't supported by the current CPU. See:
[https://github.com/rust-
lang/rfcs/blob/master/text/2045-targ...](https://github.com/rust-
lang/rfcs/blob/master/text/2045-target-feature.md#make-target_feature-safe)

There are some other ideas for making it easier to reason about safety at this
level: [https://github.com/rust-lang/rfcs/pull/2212](https://github.com/rust-
lang/rfcs/pull/2212)

Can you point to where you heard about unsupported SIMD causing a panic? I'd
like to fix that because it's really wrong!

~~~
amluto
I mean that it really shouldn’t be UB to call a function compiled for an
unsupported target feature. Following that link, I see two arguments that it’s
UB:

1\. A multibyte NOP might be used. Supposedly there might be a multibyte NOP
that older CPUs will decode as a jump. I am not sure I believe this. Is there
an example?

2\. int3 might happen, causing SIGTRAP. I see no explanation of how this would
occur.

So I think that, if LLVM really has UB if the wrong target is used, it should
be fixed. Arguable the old Knights Landing instructions are an exception, but
those are basically dead. Maybe non-x86 targets are different.

Also, I have a suggestion for a potentially much nicer way to deal with
safety: use the type system instead of magic annotations. Have a function like
GetAVX2() -> Option<AVX2>. Teach Rust that code that statically has a live
AVX2 object (as a parameter, say) can use AVX2. Other than the code
generation, this could be done in stable rust right now.

~~~
raphlinus
_This_ is why I linked to my undefined behavior post. Your comment is about as
clear an example of being in the semi-portable camp as any I've seen. And I'm
not blaming you, because in C you _can't_ do SIMD in the standard camp, so
it's actually one of the more compelling reasons to remain in semi-portable.
Rust is different though.

Also: the linked crate does use the type system in pretty much this way so
that code that clients can be safe. However, there are limitations; it's not
just whether a particular instruction can be used, which remains immutable
once it's detected, but also which _registers_ (and, by extension, calling
convention) can be used. That varies from function to function, and requires
the `#[target_feature(enable)]` annotation to control, so just having an `Avx`
type in hand is not quite enough to ensure that you're in a context where
using the ymm registers is ok, and the intrinsic will be inlined to a "V"
variant asm instruction. This is discussed in some detail in the "caveats"
section.

------
neogodless
What is SIMD?

~~~
lossolo
Single Instruction Multiple Data

[https://en.wikipedia.org/wiki/SIMD](https://en.wikipedia.org/wiki/SIMD)

ELI5: Your processor can process more data in parallel.

If you have for example loop that has something like this in the body:

a[i] + b[i] = c[i]

Using SIMD processor can make all this operations same time:

a[i] + b[i] = c[i]

a[i+1] + b[i+1] = c[i+1]

a[i+2] + b[i+2] = c[i+2]

a[i+3] + b[i+3] = c[i+3]

Normally in one iteration you would only get one of this addition done without
SIMD, thanks to SIMD you can have multiple.

~~~
neogodless
Thanks! I get that if you don't already know the acronym, you're not the
audience, but it is helpful to not have to search externally for a reference.
Could you imagine if that was in a general programming magazine before global
internet search engines were popular? You'd have to cross your fingers that
your local encyclopedia has an entry!

ETA: My bad - I did not think to click that. Embarrassing. Thanks for pointing
it out!

~~~
tempay
In fairness the first word of the article is SIMD with a hyperlink to the
Wikipedia page.

------
user_1234
> The next-level challenge is compiling multiple alternates into the same
> binary, and selecting the best at runtime.

What are the use cases for having multiple alternates in the same binary? Why
not decide them during the compile time for the given architecture?

If portability is the concern here, wouldn't it lead to sub-optimal code
anyway?

~~~
rcxdude
If you're shipping binaries, you don't know the exact architecture in advance
(because there are many extensions to x86 and you don't know if the end user
is running a new enough processor to use all of them). If you don't use them
you are likely leaving performance on the table. So you want to select the
fastest option supported by the processor you happen to be running on. You can
do this with fairly minimal peformance impact by linking in different versions
of the function at runtime, but this requires some support from your compiler
and runtime environment.

~~~
user_1234
I think the question here, which one is better:

1\. A portable binary where only individual SIMD operations are optimized for
all targets. 2\. Building the optimized binary for every target architecture
when needed (either by the user or by the binary distributor).

Concern with (1) is, as the number of dynamically called functions (or decided
by if-else nests) increases the quality of the generated code reduces for
_any_ architecture. Basically, compiler will be left with opaque
unrecognizable functions which restricts even the target independent
optimizations (Like, GVN, CSE Constant propagation etc).

Let's say, if the user writes a SIMD program which contains full of
dynamically called functions (which are opaque to the compiler), doesn't it
impact the performance heavily?

Isn't taking the compiler support for optimizing the SIMD operations necessary
rather than writing wrapper libraries ? For example, lowering the SIMD
operation calls to the existing vectorized math libraries which are
recognizable by the compilers ( Example: sin(), cos(), pow() in libm ).

------
ammaskartik
Nice write up!

