
Ispc: A SPMD Compiler for High-Performance CPU Programming - xstartup
http://ispc.github.io/documentation.html
======
gonewest
Used in the implementation of a production raytracer at Dreamworks Animation:

[http://tabellion.org/et/paper17/index.html](http://tabellion.org/et/paper17/index.html)

------
meuk
I think SIMD programming is a simple way to improve the performance of
applications that use the CPU. The real solution for applications that are
computation bound and not inherently sequential is to perform the computations
on an architecture that allows massive parallelism (a manycore architecure
like Xeon Phi, a GPU, or even an FPGA).

~~~
rurban
Sure, but you still need to the software to manage this massive parallelism.
It cannot just be OpenMP or OpenCL with C++. It needs to be a better and
faster system, such as ispc or pony. ispc is faster, pony is safer, both are
better and faster than C++.

~~~
MaxBarraclough
At a glance, ISPC looks a lot like OpenCL. Does it outperform Intel's OpenCL-
on-CPU engine? If so, why?

~~~
mattpharr
Original ispc author here.

I have no idea how performance compares to OpenCL on CPUs today, but it was in
the same ballpark a few years ago.

The big difference is that OpenCL imposes a device model which is (IMHO)
ridiculous if you're running everything on the CPU. With ispc, you have:

* Ahead-of-time compilation to binary code (no driver compiler in the way, so you can look at the ASM and know that's what will run.)

* Straightforward interop with C/C++ code: it compiles to the C ABI, so going from whatever other language to ispc code is just a function call. (Similarly, ispc can call out to C code.)

* Straightforward interop with application data structures: you can (and should!) pass pointers back and forth between C/C++ code and ispc code, do computation using the same data structures, etc.

All three of those are much uglier / more painful with the device model.

~~~
MaxBarraclough
I'll play OpenCL's advocate:

> a device model which is (IMHO) ridiculous if you're running everything on
> the CPU

How so? It's rather GPU-flavoured, sure, but is this a problem? My
understanding is that it all maps down to CPUs just fine... even if no-one's
really using OpenCL purely for parallel CPU work.

> Ahead-of-time compilation to binary code

OpenCL offers this too - `clGetProgramInfo` lets you access the compiled
binary, and `clCreateProgramWithBinary` lets you make use of that binary.

> no driver compiler in the way, so you can look at the ASM and know that's
> what will run

Intel's OpenCL development tooling is really pretty good - it's not hard to
inspect the assembly. Same goes for AMD's tooling.

> going from whatever other language to ispc code is just a function call.

Neat. OpenCL can't do either, as everything it does has to work sensibly with
GPUs.

> ispc can call out to C code

Same again.

> Straightforward interop with application data structures

Fair point. I don't know if/how OpenCL handles the question of struct layouts,
or memory compatibility more generally.

I've passed structs between CPU and GPU with OpenCL, and it worked, but I
think that's a hail Mary situation where really there's no assurance that the
compilers' data layouts will match.

Even the definition of 'int' must be free to vary. I can't see how it couldn't
be.

~~~
mattpharr
IMHO, the problem with the device model is that it imposes a bunch of
unnecessary overhead on the programmer for cases where memory is shared and
you're running on the same processor.

If I just want to call a function, pass some parameters, have it do some work,
and get a result, things like OpenCL require all sorts of annoying boilerplate
just to pass parameter values, map buffers, copy results out, etc. Sure it's
all straightforward to write, but it's friction, and it's annoying.

Regarding clGetProgramInfo: does that return actual native executable code or
IR? (I assume it's free to do either but in practice returns the latter, and
that there's the usual "final driver compiler" between that code and what runs
on the hardware, but I don't know.) An issue with that is that you can't be
sure of what will run on users' systems; you're at the mercy of the version of
the driver they've got installed.

~~~
MaxBarraclough
> annoying boilerplate just to pass parameter values, map buffers, copy
> results out, etc. Sure it's all straightforward to write, but it's friction

Agree. It's quite a lot of work to orchestrate even a simple kernel.

> I assume it's free to do either

Looks that way - it seems AMD's engine lets you configure it. There are bunch
of 'non-native' representations:

* the OpenCL C source itself (which may end up getting stored in the ELF)

* LLVM IR

* AMDIL (based on LLVM IR but not identical)

* HSAIL (again, like LLVM IR but not identical)

* SPIR (yet again, except that later versions of this IR aren't directly based on LLVM)

[http://openwall.info/wiki/john/development/AMD-
IL](http://openwall.info/wiki/john/development/AMD-IL)

The poorly-documented "-fbin-exe" flag gets you the real native code.

[http://developer.amd.com/wordpress/media/2013/07/AMD_Acceler...](http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-
rev-2.7.pdf)

I believe there's a way to get it to build for GPUs other than your own.
Whether it's exposed through the API, I'm not sure, but I'm fairly sure it can
be done with the dev tools.

(That took quite a bit of digging, which I suppose proves your point.)

