
The story of ISPC: origins (part 1) - jeffreyrogers
https://pharr.org/matt/blog/2018/04/18/ispc-origins.html
======
dragontamer
Whoever wrote ISPC "gets it", that SIMD lanes can be treated as if they were
threads. You almost have to program in OpenCL or CUDA before you realize that
SIMD code can be like normal serial code (with just a few, really really weird
performance quirks).

I haven't really been able to use ISPC myself yet, but its on my to-do list.
I've found that a lot of code fails to run well on a GPU and still needs a
normal CPU for computation... but some SIMD bits here and there can really
benefit from a higher-level language (rather that writing in intrinsics or
relying on auto-vectorizers).

GPUs are still very far away from the CPU. You gotta push your data to RAM,
then over a PCIe lanes... and spend many microseconds to miliseconds pushing
the data to-and-from the GPU. While SIMD on a CPU benefits from L1 cache
locality.

Intrinsics definitely work, but are hard to program. Auto-vectorizers don't
really structure the program correctly... and require a ton of study to use
correctly.

ISPC seems like the real way to get well-"threaded" SIMD code written
productively for CPUs. It hides things behind relatively easy to understand
abstractions and is the best 'CUDA / OpenCL for the CPU' I've found so far.

\---------

Why not use CUDA? Well, CUDA doesn't share L1 cache with my CPU cores. That's
why. You can't beat AVX in terms of latency. CUDA / NVidia GPUs are way
faster, but you have to spend a LOT of time transferring data before CUDA code
even begins to compute.

In contrast, ispc compiles to a .o that links into your standard C or C++ code
you've been writing.

~~~
CyberDildonics
SIMD lanes can absolutely not be treated as threads. Not only can they not
diverge in the instructions they execute (obviously) they can't efficiently
load from different addresses (gather) or write to different addresses
(scatter) on AVX.

Thinking of SIMD lanes as threads is not likely to be a path that pays off.

~~~
dragontamer
Performance wise, you are correct.

But from a programming point of view, CUDA and OpenCL PROVE that you can use
SIMD-cores as if they were a bunch of threads. Yes, there is a performance
penalty (an exponentially increasing penalty) on every thread divergence.
However, a HUGE set of problems (Deep learning, matrix multiplication,
graphics, ray-tracing, encoding, databases: Join-Merge, Join-Hash, etc. etc. )
which have successfully been implemented as "SIMD as if it were threads".

How to deal with divergence? Just don't use if-statements! If you can figure
out how to cheat if-statements out of your code (or to group "threads" so that
they take if-statements all together at the same time), you get HUGE
performance on these architectures. And GPU programmers have consistently
found more and more ways to "reconverge" threads so that they can be "ganged
up" as SIMD-threads in the most efficient manner possible.

No, it doesn't work for Chess or other "innately divergent" problems. It
doesn't work for many problems. But this "gang up SIMD into threads" works
very well for many, many problems. And its easy to do.

Case in point: Ray-tracing. Every ray starts off as a camera-ray, but then
when it hits an object, it may turn into a diffuse, specular, or sub-surface
scattering ray (or many other types, depending on what object you hit). Seems
like we have to use an if-statement to continue...

But that's wrong! Raytracing becomes GPU-friendly by grouping all specular-
rays together, and computing SIMD-gangs on all specular-rays. Then, group all
diffuse-rays together, and gang up the threads across those. Etc. etc.

Specular rays may bounce off and turn into Diffuse, specular, Sub-surface, or
other kinds of rays on each bounce. But just "re-configure" the SIMD-gang on
each bounce, and you get huge SIMD-based-benefits.

Again: get rid of your if-statements, and you scale. I admit it doesn't work
all the time, but it does work a LOT of the time.

> Thinking of SIMD lanes as threads is not likely to be a path that pays off.

Every supercomputer that is buying a V100 disagrees with you (ie: Summit).
There are plenty of useful problems that work under the SIMD-as-threads model.

[https://en.wikipedia.org/wiki/Summit_(supercomputer)](https://en.wikipedia.org/wiki/Summit_\(supercomputer\))

We're talking $325 million into a NVidia V100-based supercomputer. It seems
like Oak Ridge National Laboratory believes in this methodology.

> they can't efficiently load from different addresses (gather) or write to
> different addresses (scatter) on AVX.

AVX2 has Gather instructions. While Scatter instructions were added to AVX512.

See AVX2 instruction: VPGATHERDD

~~~
CyberDildonics
> But from a programming point of view, CUDA and OpenCL PROVE that you can use
> SIMD-cores as if they were a bunch of threads.

This is nonsense - to beat a CPU you have understand that you can't have every
thread doing something different.

> How to deal with divergence? Just don't use if-statements!

Individual threads don't have to avoid if statements. Trying to group threads
and SIMD lanes together, only to make all sorts of concessions as to how they
are different is silly an counter productive.

> We're talking $325 million into a NVidia V100-based supercomputer. It seems
> like Oak Ridge National Laboratory believes in this methodology.

Just because they are using recent GPUs doesn't mean they 'treat SIMD lanes as
threads'. It just is not correct to say 'SIMD lanes are just like threads
except you can't diverge, you have to run the same instruction, it's better if
you don't load from different memory addresses and any non-trivial branching
kills performance, including looping a different number of times on different
cores'

------
tom_
Even if you don't care about ispc, the description of Intel's culture might be
interesting.

~~~
ncmncm
By all the reports, it seems really astonishing that Intel ever manage to ship
anything.

The subtext might be that Intel upper management's main job is to kill
projects that might distract from things that absolutely must ship. And, maybe
the wholesale killing frequently gores stuff that succeeds in shipping despite
all.

~~~
hyperman1
Isn't this basically every company larger than a few 100 workers? The top
chooses some random direction, the middle managers feuds, and the bottom does
some work wich gets thrown away half the time. Even so, these orgs are so big
they dominate their industry despite all this.

------
jeffreyrogers
Link to all posts: [https://pharr.org/matt/blog/2018/04/30/ispc-
all.html](https://pharr.org/matt/blog/2018/04/30/ispc-all.html)

