
Intel SPMD: Compiler for High-Performance SIMD Programming - vmorgulis
http://ispc.github.io/
======
scott_s
They link to a conference paper for the full technical details: "ispc: A SPMD
Compiler for High-Performance CPU Programming",
[https://cloud.github.com/downloads/ispc/ispc/ispc_inpar_2012...](https://cloud.github.com/downloads/ispc/ispc/ispc_inpar_2012.pdf)

~~~
wyldfire
> GPU-oriented languages like OpenCL support SIMD but lack capabilities needed
> to achieve maximum efficiency on CPUs and suffer from GPU-driven constraints
> that impair ease of use on CPUs.

IMO these constraints are overstated and OpenCL offers a good abstraction for
CPUs too.

~~~
Ono-Sendai
Maybe in theory but in practice we see poor performance using OpenCL to target
CPU devices (much worse than our reference C++ code)

~~~
keldaris
This used to be a fairly universally held opinion, but recently I've begun to
see fairly optimistic results reported in the literature (see [1] for an
example study, unfortunately not open source). As someone who writes numerical
simulation code for a living, I've started to get curious about OpenCL for the
use case of rapidly developing code that's reasonably performance portable
between Xeon-based and GPU workstations.

Since you seem like you have some experience with testing performance
portability with OpenCL and other solutions, I'd be curious to hear if you
have any comments about the reference I linked or more general suggestions for
alternative means to achieving the same end (performance portability between
CPU/GPU architectures at the workstation level).

[1] [http://iwocl.org/wp-content/uploads/iwocl-2014-tech-
presenta...](http://iwocl.org/wp-content/uploads/iwocl-2014-tech-presentation-
Simon-Mcintosh-Smith.pdf)

~~~
Ono-Sendai
Hi, I would say it depends on the kind of code you are working with. Our code
is quite branchy and somewhat complicated.(3d rendering software) For simpler
code that does e.g. lots of the same thing on a regular grid, OpenCL might
work better than it does on our code (targetting CPUs).

------
eggy
I wouldn't know how to utilize this yet, but if I write a C program that
utilizes SIMD similar to the way you can in the Dart programmng language with
its Float32x4 and Int32x4 types [1], that same program can be expanded to use
the 4 cores with SPMD to do 4 (SIMD) vectors x 4 (SPDM) parallel tasks on my
quad-core 2013 i7-4700MQ by using the Intel compiler and specially-written C
code for a maximum 1600% speedup of 4x4 matrix operations? I am guessing it
would be more like 800% to 1200% in reality if lucky, but still promising.

    
    
      [1] https://www.dartlang.org/articles/simd/

------
joe_the_user
So this compiler is targeting the SIMD units of CPUs rather than GPUs. Can
anyone contrast what the performance of this would be relative to Cuda or
OpenCl for various applications, for example neural nets?

~~~
alkonaut
For large data and trivial algorithms, such as multiplying matrices (which
means then any problem you can express as a set of operations on large
matrices) gpu's do really well, so it's hard to compete with something that
has 1000 cores (Edit: "compute units", not "cores"). Neural nets is
essentially matrix multiplication.

However a lot of interesting problems are seemingly parallel but highly
branching and nonlinear. Take path tracing as an example: it's very little
code and highly parallel as each Ray/pixel is independent, yet it's not an
easy problem for a GPU: each time a ray bounces it will disperse and not do
whatever the Ray next to it was doing in terms of which geometry it will hit
etc.

It might seem like today if a problem can benefit from 8 CPU cores then it
benefits 100x more from being run on a GPU but this is far from true. A great
machine for general computing could do well with a board with 100 x86 CPUs
apart from having a big gpu with a thousand cores for brute forcing the
"simpler" problems.

~~~
vardump
> ... so it's hard to compete with something that has 1000 cores.

Any references to what has "1000 cores"? Nvidia GPUs usually have about 12 or
so cores that can be compared to x86 cores, meaning they can independently
branch.

For example high end Nvidia 980 GTX GPU has only 16 of such comparable SIMD
execution cores. SMXs or whatever Nvidia calls them.

GPU marketing materials confusingly refer as cores to something like x86 CPU
SIMD lanes (and that's being _very_ generous to GPUs), that artificially
inflates the numbers.

Or put differently, one CUDA core can compute up to 1 FMA per cycle @1196-1300
(?) MHz. One recent Intel X86 core can compute at least up to 16 FMAs per
cycle @2800-4000 Mhz.

~~~
pandaman
I am not very familiar with NVidia hardware but I imagine an SMX is not the
smallest unit, which can branch. A "warp" can branch independently and it's 32
lanes wide so I figure an SMX core with 192 "CUDA cores" can run 6 warps. It's
still hundreds of cores and not thousands but much more than a dozen.

~~~
vardump
I think WARP is more like hardware thread, and one SMX is processing one
particular WARP per clock cycle. So on any given clock cycle you still have
just as many independent _simultaneous_ control paths as you have SMX units.

~~~
pandaman
Not quite. All warps are running in parallel (otherwise you won't get the
performance numbers) and each has its own control path (actually each has its
own code) but, indeed, only one can execute control flow instructions at a
time since the control unit is shared in the SMX.

~~~
vardump
Well, GPUs don't have any branch prediction or out of order capabilities, so
you need to have a way to keep execution units (mainly floating point units)
busy.

A WARP is really nothing more than a way to have work for SMXs (and
computational units it controls) at as many clock cycles as possible. You need
some way for masking FPU pipeline and memory latency.

> All warps are running in parallel (otherwise you won't get the performance
> numbers) and each has its own control path (actually each has its own code)

It's not that different from x86 hyperthreading, just with more hardware
threads. Pipelined execution units are fed each clock cycle by the core.
Multiple FP operations are in flight in parallel, otherwise CPUs won't get the
performance numbers either.

~~~
pandaman
Sure, an SMX can also switch between warps in the manner similar to
hyperthreading on x86 but it does not mean it executes a single warp at a
time. Consider Tesla K40, a GK110 with 15 SMXs. It runs 750Mhz and has peak
performance of 4.29 Tflops. If each SMX could only execute a warp at a time it
could get, at most, 15(number of smx) x 32(warp width) x 750M(frequency) x
2(two flops per FMA) = 720Gflops.

~~~
0x07c0
The Tesla K40 has peak double performance of ~1.4 TFLOPS. It has 64 DP cores,
the warp scheduler can schedule four warps per smx per cycle. It can therefore
have two warps executing double instructions at the same time. But the number
is not very interesting, the memory bandwidth on the other hand is, a GK110
has 288GB/s, take you code, get it's arithmetic intensity and you have a upper
bound for your performance, assuming you are memory bound of course.

[https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-
GK11...](https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-
GK110-Architecture-Whitepaper.pdf)

[https://www.nvidia.com/content/tesla/pdf/nvidia-
tesla-k40-20...](https://www.nvidia.com/content/tesla/pdf/nvidia-
tesla-k40-2014mar-lr.pdf)

------
fsaintjacques
What benefits do we gain versus a standard compiler with intrinsic approach?

~~~
CyberDildonics
With intrinsics you have to write each instruction by hand. Not only is this a
lot of work and compiler specific but if you aren't familiar with all of the
instructions at your disposal it is unlikely you will get the same
performance. Not only that but ISPC can be compiled to multiple different SIMD
lane widths so that something doesn't need to be re-written with the width
increases or decreases.

One example would be the n-body simulation of the computer language benchmarks
game. The C++ version uses intrinsics but wouldn't benefit from anything that
can do 4 doubles instead of only two at a time.

~~~
fsaintjacques
I would argue that learning the 'basics' of intrinsic is less of a burden than
learning a new C extension and modifying your existing code base to include
additional new build tools.

OTOH, I like the idea of automatic lane width detection.

~~~
CyberDildonics
I've learned ISPC and didn't find it too difficult. It largely boils down to
the varying and uniform keywords along with one more loop syntax. I've looked
into intrinsics and I'm extremely skeptical that there is much benefit there
other than to play with the actual CPU instructions. ISPC produces tiny .o |
.obj files and a header file, the integration has not been a hurdle that I can
remember.

------
davidf18
Not certain why the Mac binary is for Mavericks and not El Capitan.

------
billconan
I just tried using this to accelerate my nerual network. I got about 4.5x
speedup. not bad

~~~
billconan
I thought this was a new compiler just been released, it turned out it's been
around for a while. How come I have never heard of it!

~~~
eggy
I caught that too, and I was surprised, but like all closed research, it was
there. I am guessing it will help to reinvigorate some uses for older Intel
chips, and certainly newer ones. All good for Intel, and those who need the
speedup.

