
What Can We Learn from the Intel SPMD Program Compiler? - matt_d
https://software.intel.com/en-us/blogs/2018/12/17/what-can-we-learn-from-the-intel-spmd-program-compiler
======
dzdt
Interesting to see Intel boosting ISPC again. My impression had been that it
was a skunkworks project they never really supported or accepted.

See Matt Pharr's very interesting blog posts on the development and open-
sourcing of ISPC.

[1] [https://pharr.org/matt/blog/2018/04/30/ispc-
all.html](https://pharr.org/matt/blog/2018/04/30/ispc-all.html)

~~~
xiii1408
I second that.

Matt Pharr's posts on ispc are an extremely interesting read, both from a
technical perspective (programming abstraction and hardware design) and a
human perspective (Intel politics, managing style, and career decisions).

~~~
scott_s
Thirded. One of the best take-away from his series of posts is that a
programming model that a programmer can reason about is far better than
compiler optimizations that a programmer can only guess about.

------
dragontamer
ISPC vs OpenCL is a very interesting discussion. I think the discussion
warrants a bit more introduction.

The ideal is to deliver the illusion of true threads to the programmer, where
each thread is mapped to a SIMD unit. This is because on x86 systems, an AVX2
SIMD unit (YMM Register) can support 8x 32-bit values (256-bits). That's 16x
32-bit values for ZMM AVX512 registers. So you get a huge boost to parallelism
if you treat each SIMD-unit as a separate thread.

In effect: it is far cheaper to make SIMD-"threads", than to make "real
threads". So you get Vega64 (4096 Shaders) or NVidia V100 (5120 CUDA Cores).
Even on Intel, you effectively get 8x SIMD cores per actual core.

The question is how to deal with this mapping, to keep things efficient and
reasonable for the programmer. OpenCL, and ISPC have different strategies for
dealing with this abstraction and trying to efficiently map this illusion to
the programmer. The general "Thread Group" (OpenCL) and "Wavefront" (CUDA)
abstraction seems to primarily capture this SIMD-width question.

But there's one additional thing on GPUs that don't exist on CPUs: and that's
the LDS memory (AMD) or Shared memory (NVidia). Shared Memory is effectively a
manually-managed L1 cache on GPUs.

GPUs have very weak cache hierarchies. In high-performance GPGPU compute, it
is preferred for the programmer to manually manage transfers to and from
global RAM to the ~64kB Shared Memory region. And of course, only Wavefronts /
Threadgroups are guaranteed to share a portion of shared memory.

Intel systems don't need this abstraction: Intel CPUs have a full and proper
cache which is unified with global RAM.

So I guess Intel is "taking a victory lap" with this blog post: noting that
all of the complexity of Wavefronts / etc. etc. isn't really necessary on ISPC
/ Intel systems.

\---------

Still, Intel never really achieved the performance that NVidia V100 or AMD
Vega64 can achieve. Even Intel's Xeon Phi has far lower raw performance than
what NVidia / AMD was able to put out.

Its certainly more difficult to write high-performance code on GPUs. But with
so much more raw power on GPUs, many people have been able to extract higher
performance in practice.

~~~
marmaduke
> Intel CPUs have a full and proper cache which is unified with global RAM

> more difficult to write high-performance code on GPUs. But with so much more
> raw power on GPUs

This seems two sides of the same coin; of course you can extract higher
performance with lower coupling.

I don’t think the Xeon Phi was given a fair chance; they did two iterations on
the idea before canning it. NVidia has been doing video cards forever.

~~~
bonzini
Xeon Phi's origins are in Larrabee. They sold two iterations, but they had
several more in house.

~~~
pjmlp
I sat on an Intel session at GDCE 2009 about introduction to Larrabee
programming and how it would take over the GPU programming world for game
developers.

Making it a niche product only available to a selected few, instead of
mainstream GPGPUs, meant most of us hardly cared about it.

------
KMag
I worked on a project that compiled a declarative DSL to both vectorized CPU
code and GPU code.

For the vectorized CPU code, I found ISPC generally pleasant to use, but don't
be fooled by its similarities to C. It's not a C dialect, and I got burned by
assuming any implicit type conversion rules that are identical across C, C++
and Java would also hold for ISPC. The code in question was pretty simple
conversion of a pair of uniformly distributed uint64_ts to a pair of normally
distributed doubles (Box Muller transform). As I remember, operations between
a double to an int64_t result in the double being truncated rather than the
int64_t being cast. I wrote some C++ code and ported it to Java more or less
without modification, but was scratching my head as to why the ISPC version
was buggy. I remember the feeling of the hair standing up on the back of my
neck as it dawned on me that the implicit cast rules might be different than
those of C/C++/Java.

Seemingly arbitrarily deviating from C/C++ behavior in a language that's so
syntactically close is a big footgun. Honestly, I think that C made a mistake.
An operation between an integer and a floating point number should result in
the smallest floating point type that has mantissa and exponent range both as
large and the floating point operand and capable of losslessly representing
every value in the range of the integer operand's type. If no such type is
supported, then C should have forced the programmer to choose what sort of
loss is appropriate via explicit casts. Disallowing implicit type conversion
is also reasonable. However, if your language looks and feels so close to C,
you really need good reasons to change these sorts of details about implicit
behavior.

Similarly, I think C should have made operations between signed and unsigned
integers of the same size to result in the next largest sized signed integer
(uint32_t + int32_t = int64_t), or just not allow operations between signed
and unsigned types without explicit casts.

~~~
Tuna-Fish
I very strongly think that rust went the right route here:

    
    
        error[E0277]: cannot add a float to an integer
         --> src/main.rs:2:6
          |
        2 |     1+1.0;
          |      ^ no implementation for `{integer} + {float}`
    

Casts are historically such a massive source of bugs and unstability that the
correct casting rule for every numerical computation is always: "Make the
programmer explicitly choose."

~~~
renox
I disagree, while C signed/unsigned type conversion is a source of bug and
should be explicit, it doesn't mean that all implicit conversion are bad..

I think that implicit conversion from int to float is mostly harmless, as the
result is a float and the float to int conversion is explicit, this is
unlikely to create bugs.

intX to intY or uintX to uintY if Y>=X are two other conversions which are
unlikely to create bugs but save a lot of boilerplate.

------
petermcneeley
The primary advantage of ISPC over the GPU is latency. If you issue a small
amount of parallizeable work to the GPU you are still likely waiting ~500
microseconds for the result.

~~~
berbec
I wonder if that is possible to optimize at compile or execution time. Is it
possible to determineif the gains sending the work to the GPU is worth the
latency hit?

~~~
Athas
It would be possible to generate both CPU and GPU versions at compile-time,
and pick the best one at run-time based on the data encountered. I do research
on a similar technique for the Futhark language.

