
SPMD Is Not Intel’s Cup of Tea - panic
http://www.joshbarczak.com/blog/?p=1120
======
dzdt
Twenty years ago assembly language was losing its last battles with higher
level languages as a viable choice for game development. Skilled assembly
language programmers insisted they could give a factor of two more performance
on the same hardware. It wasn't enough: everyone moved on.

Now the author is again saying his SPMD hand-coded assembly language can
outperform higher level languages by a factor of two for some test cases.

I don't see the economics working out any better for the assembly language
solution this time than last time. It won't port across different hardware
generations, is hell to write and maintain, and requires super-specialist
knowledge.

What this might point to is room for a new high-level language that maps
better to the reality of current hardware.

~~~
lambda

      What this might point to is room for a new high-level 
      language that maps better to the reality of current 
      hardware.
    

Isn't that exactly what the author said in his last two paragraphs, including
a hand-wavy sketch of what he thinks that high-level language might look like?

~~~
vvanders
Since the site is down and I can't read the article I can only guess here.

For me it's something with value types & memory coeherancy sematics. That's
what you need to get the most out of your hardware.

Good candidates for that today are Rust, C, C++ and maybe C#(don't have enough
experience with their value types to make a judgement yet).

------
mtgx
Site's already dead:
[https://webcache.googleusercontent.com/search?q=cache:UthQNd...](https://webcache.googleusercontent.com/search?q=cache:UthQNdCvd1cJ:www.joshbarczak.com/blog/%3Fp%3D1120+&cd=1&hl=en&ct=clnk&gl=us)

~~~
breakingcups
It's up over here

~~~
X-Istence
It's flip flopping...

------
Zenst
I'm wondering if this highlights the area of compiler optimization regarding
GPU/SPMD/Vector processors being one behind conventional CPU optimizations.

Sure with CPU's you have variety but less so and certainly more dynamic
changes over time with SPMD type area's with GPU architectures still changing
more generation to generation compared to CPU architectures.

Given that perhaps this highlights that area as an opportunity.

But I suspect the whole area of such a dynamic area with GPU's from 5 years
ago at a hardware level being much more different than CPU's with changes that
any investment in low-level optimization will be cost wise higher in many
cases than just having portability and the ability to easily run on the next
generation and gain accordingly.

With this in mind the number of architectures and associated hardware changes
more often than the core CPU's and it is perhaps that alone that curtails such
optimization process's being able to mature and come into play. Though once
things settle down more.

Though I'm mindful this is in many ways the old Betamax VS VHS and from that
we learned that it is not always the best technology that progresses with the
masses, but the ease of use/cost of attaining and running it. In short
perfection rarely wins out in the face of just good enough for a lot less
effort in so many cases.

------
TickleSteve
"optimal" and "standard" do not go together....

You will always be more efficient if you can tailor your code to a particular
device, a standard abstraction will never let you get to that level.

What you need to ask is: 1) Do you _really_ need the extra 90% effort for the
10% extra performance? 2) are the tools good enough? are the abstractions
wrong?

This article seems to be saying that the abstractions are wrong... how do
these techniques fare on other devices?

~~~
eggy
I don't fully understand the full concept of SPMD, as well as I understood
SIMD (from Dart!), but it sounds like you can write one version by just using
SPMD vs. explicit coding, since it runs equally optimized on 64-bit wide or
8-bit wide SIMD lanes, no?

From the article:

    
    
      "SPMD has a serious advantage in that it’s able to target AMD’s massive 64-wide SIMDs just as easily as Intel’s itty-bitty 8-wide ones, at least for simple usage patterns like do this operation on 1 million pixels. For simple stream processing applications, SPMD makes perfect sense, and it’s relatively easy to write one source code that will run close to optimal on all the architectures."

------
vardump
From the article:

> These are not rock-solid results. I admit I’ve cut a lot of corners and left
> some room for doubt, but it looks like significant speedups are possible by
> using explicit thread-level programming. There are optimizations that can be
> done at this level which are simply not possible using a conventional SPMD
> model.

Yeah. This is the issue with abstracting using OpenCL, CUDA, etc.

If you need absolute performance, sometimes explicit control, even if it means
using SIMD intrinsics, is better than a compiler that can't understand what
you are really trying to do. Of course the other side of the coins is it's
more work and less portable to use explicit width intrinsics.

For example, compilers don't understand memory access patterns and it's hard
to share work between execution lanes (work units).

Think of an algorithm that processes images in rectangular blocks. Each block
needs also rightmost column of pixels from the block on the left. I know
swizzling performed by the cache/memory controller would make this example
less worse, but humor me. I'm trying to make a generic point, not just about
spatial image processing.

If the order of execution is suitable, the code that computes first block
could set the registers ready to transfer rightmost column of pixels for the
next block on right.

So you knowing and controlling the processing order could get this single
column of pixels from left of block from registers (set by previous processing
of left side block), otherwise just issue one memory read for current block.

Instead of just reading both blocks again, which is what a compiler would do.
Sure, it would likely come from cache. But it means more cache pressure, 95%
of those pixels might not be needed anymore -- just that right edge of the
left side block. In other words, in the worst case you need almost twice as
much data in cache than otherwise.

> Example 1: Block Min/Max

> My first test case is taking a scalar image and computing the min and max
> over 4×4 tiles. This is the first half of a BC4 compressor. ...

I'm surprised first example running time dropped just from 1.7M to 1.15M.
Maybe the code is cache/memory access (or some other resource, not familiar
with Intel's GPUs) limited.

This might be a good place to try random evolution (genetic algorithms, monte
carlo, etc.) in instruction and memory access scheduling, variations, etc. I
suspect hardware level details about Intel's GPUs are not public. Some
instruction stream orderings might just perform faster.

~~~
marcosdumay
> Instead of just reading both blocks again, which is what a compiler would
> do.

We have better compilers than that already. If you have a static data flow,
there's no reason your compiler can not take it into account when optimizing
your code.

Optimizing a dynamic data flow is the hard one (still possible, but not done
in practice), but people suck at this too.

~~~
vardump
> We have better compilers than that already. If you have a static data flow,
> there's no reason your compiler can not take it into account when optimizing
> your code.

That's right, but this is not about static data flow.

This is about choosing a particular strategy for execution and caching a small
amount of data in registers to maximize usefulness of cache by having less
"hot memory". This reduces expensive memory traffic, which is very often the
bottleneck.

More generally, you can often do significant part of the work in other lane,
but can't communicate this efficiently to somewhere else, because it might be
executing simultaneously. So you have to for example access memory
unnecessarily or do extra computation for little benefit. When you can control
execution order, this is not an issue anymore.

I know how amazing compilers are nowadays, but I have never seen any of them
optimizing this type of case.

------
nimos
He mentions 64 wide SIMD units from AMD, anyone know what he's talking about?

~~~
varelse
Yes, NVIDIA GPU SIMD units are 32-wide (also called a warp), AMD GPU units are
64-wide, and Intel is anything from 4-16 depending on what processor you're
using. CUDA/OpenCL allow one to query this width and optimize one's kernel's
to avoid pointless synchronization within a vector unit. CUDA 8 now has a NOP
synchronization that only fires between warps to canonize this sort of
programming going forward.

