
Intel SPMD Program Compiler: A Compiler for High-Performance SIMD Programming - kick
https://ispc.github.io/
======
rrss
Matt Pharr wrote a series of posts telling the story of ispc:
[https://pharr.org/matt/blog/2018/04/18/ispc-
origins.html](https://pharr.org/matt/blog/2018/04/18/ispc-origins.html)

I found them extremely interesting - highly recommended.

~~~
joe_the_user
That is an interesting read - compiler writers got hung up on creating auto-
vectorization where Cuda is essentially manual vectorization. And that's the
thing. Once you find ways that writing a massively vectorized program on a GPU
makes sense, why would you write a program where you had to _hope_ your
program gets vectorized?

That said, as I understand things, vectorization can fail with Cuda if you
allocate more kernels than exist on the chip, in which case the chip may run
the kernels in serial, producing surprising results.

~~~
rrss
That isn't really a failure case in cuda (or opencl). It's very common to
launch more blocks/workgroups than can be resident simultaneously on the GPU.

~~~
joe_the_user
Neither autovectorization-fail nor Cuda executing kernels in serial is a
"fail-fail", both are fall-back actions that accomplish a given task in a bit
longer than the explicit instructions imply. That said, executing kernels in
serial can supposedly create problems if a programmer creates logic that
assumes kernels are always moving in lock-step.

------
corysama
I only played with ISPC a little bit. What I found is that it is really great
if you need to write a large volume of SIMD code and that code sticks to one
lane size. Like, 4 32-bit floats or ints. But, if you want switch mid-stream
to 8 shorts or 16 bytes, you’re gonna have a hard time. Or, if you just need a
few instructions, it’s easier to just use intrinsics.

~~~
BubRoss
I would have to see an example of what you mean, but it should be completely
possible though might require converting without using vectorization.

Switching lane size doesn't make much sense to me because ideally you would
want lanes that are as wide as possible and mostly be agnostic to their size.

~~~
corysama
I had some code that tried to stay 16x8bit, but would occasionally
_mm_unpacklo_epi8, _mm_unpackhi_epi8 to 2 8x16bit vectors to keep precise
intermediate results during some fixed-point math.

Writing it out like that, it sounds like it should have been easy. Don't
remember what I ran into. Maybe didn't bang on it long enough.

~~~
BubRoss
The original AVX instructions didn't have all the integer operations that the
most modern chips have. It might have been haswell that added small integers
over the 256 bit lane width.

~~~
stephencanon
Right. AVX (the original extension) only added 256b floating-point and non-
destructive 128b integer. The 256b integer SIMD ops are all in AVX2 or later.

------
gnufx
This would benefit from a comparison with current OpenMP/OpenACC (which
supports offloading to attached processors in a standard way for C and
Fortran, at least). Also, comparing with gcc 4.2 in the performance examples
doesn't seem very useful; it didn't support AVX, regardless of auto-
vectorization. (That's not meant to dismiss ISPC.)

------
yarg
It's open source, so it should be fine?

But Intel's history when it comes to compilers and applied optimisations
leaves this making me immediately uncomfortable.

The sort of PR work that these guys would need to do in order for me to
consider them even remotely trustworthy is beyond even their budget.

~~~
wahern
You don't have to guess. The process of upstreaming and the [then] current
state of ARM support is described here:
[https://pharr.org/matt/blog/2018/04/29/ispc-
retrospective.ht...](https://pharr.org/matt/blog/2018/04/29/ispc-
retrospective.html)

Not sure what conclusions to draw from that, but it looks like ARM support was
finally made first class this past August:
[https://github.com/ispc/ispc/blob/cf90189/docs/ReleaseNotes....](https://github.com/ispc/ispc/blob/cf90189/docs/ReleaseNotes.txt)

I think it might be difficult to purposefully cripple AMD in an open source
project.

~~~
loeg
> I think it might be difficult to purposefully cripple AMD in an open source
> project.

It's not as explicit as it has been in the past, but the CPUID checks for very
specific feature sets aligned with particular Intel models may not match AMD
models, producing worse code on AMD models that support featuresets above
baseline AVX2:

[https://github.com/ispc/ispc/blob/master/check_isa.cpp#L106-...](https://github.com/ispc/ispc/blob/master/check_isa.cpp#L106-L140)

That said, I don't assume malice here and I haven't investigated thoroughly.
Most likely they just want to support their own silicon well and that's what
they know. It's possible they would accept similar support for AMD µarchs in
the OSS project (or maybe not).

I wouldn't draw too much inference from the ARM example, as I don't see ARM as
an Intel competitor. AMD, on the other hand, is currently very competitive
with Intel.

~~~
mcbain
I’ve spent time writing CPU detection code for previous projects, and there is
nothing that jumps out at me as biased in the linked ISA check. In fact that
is really the bare minimum required to split the AVX variants, and will detect
AMD support just the same as Intel.

You can compare it to other detection functions - one relatively easy to read,
non-vendor biased example that does dig into all the extensions is this Go
implementation (not mine):
[https://github.com/klauspost/cpuid/blob/master/cpuid.go](https://github.com/klauspost/cpuid/blob/master/cpuid.go)

~~~
loeg
Right, it looks pretty reasonable to me too. Zen2 still doesn't have AVX-512
anyway, so the super parallel paths this aims to really help aren't applicable
anyway.

Zen1-2 should land on the "AVX 2 (Haswell)" path in the linked excerpt -- they
have AVX/AVX2, F16C, OSXSAVE, and RDRAND -- which is the best ISA without
AVX512 implemented in the compiler. That's entirely reasonable on Intel's
part.

(I don't know why they look for RDRAND in a compiler, but whatever.)

~~~
moonbug
Because it has an rdrand() function.

