
SIMD Made Easy with Intel Implicit SPMD Program Compiler - truth_seeker
https://software.intel.com/en-us/articles/simd-made-easy-with-intel-ispc
======
wmu
The hand-coded AVX2 procedure is far from optimal form, they waste time on
horizontal addition in each iteration.

The First Rule of SIMD-ization says: keep all the intermediate results in
vector(s), do horizontal reduction at the end.

Conversion from comparison result into vector of integers can be done a bit
simpler: just one bit-and is needed and then cast to __m256i (casting doesn't
emit any code as SIMD registers are untyped).

------
truth_seeker
If someone is seeking more insight into it, follow this link:

[https://www.slideshare.net/IntelSoftware/simple-single-
instr...](https://www.slideshare.net/IntelSoftware/simple-single-instruction-
multiple-data-simd-with-the-intel-implicit-spmd-program-compiler-intel-ispc)

------
Const-me
> if you want to target multiple ISAs, you need to write multiple algorithms

In my experience, these algorithms are similar to each other. More often than
not don't require too much extra time: a few macros here and there, a few
templates, couple version of a small low-level function, etc.

> _mm256_hadd_epi32

That instruction is slow, e.g. on Ryzen it has latency 7. _mm256_slli_si256
and bitwise ops have latency 1, often can do same faster.

> readability is reduced when compared to the original scalar implementation

Solvable with a library, example: [https://github.com/Const-
me/IntelIntrinsics](https://github.com/Const-me/IntelIntrinsics)

------
KenanSulayman
Very interesting and I can't wait to try it out.

It's a pity though, based on what I understood from the website, that it's
only producing binaries which can be linked to other than actually generating
C / C++ code. That would be great for LTO, but also allow for better
inspection of the generated SIMD code prior to compilation to ensure that all
code is compiled by the same compiler. I guess the best way to inspect the
artefacts prior to assembly is via LLVM IR.

I'm pretty happy that Intel chose to implement this based on LLVM. I'd have
expected this to be sitting on top of icc.

~~~
BubRoss
ISPC can actually produce C++ that is full of intrinsics.

Also you can try it out already in godbolt.org - ISPC had been around for a
number of years now and works very well.

~~~
KenanSulayman
Thanks!

For anyone interested -- after spending multiple hours playing with it I found
that it can emit C++ via the following flags:

`--emit-c++ --target=generic-$WIDTH`

where $WIDTH is

\- 4 (sse4),

\- 16 (16 * 32 = 512 bit aligned),

\- 32 (32 * 32 = 1024 bit aligned),

\- or 64 (64 * 32 = 2048 bit aligned)

Example with generic-4 on godbolt CE for the ISPC mandelbrot example program:
[https://ispc.godbolt.org/z/0x47SP](https://ispc.godbolt.org/z/0x47SP).

