
Swizzle Inventor: Data Movement Synthesis for GPU Kernels [pdf] - ingve
https://lenary.co.uk/publications/swizzle_inventor.pdf
======
adampk
In short, they created a tool that makes it easier to optimize for GPUs.

Currently you need a good grip of- [https://devblogs.nvidia.com/register-
cache-warp-cuda/](https://devblogs.nvidia.com/register-cache-warp-cuda/) to
optimize your pipeline which is not pedestrian.

Now you can define what you want to do, and let Swizzle Inventor come up with
the best way to exploit the register cache, apparently as well as top CUDA
engineers.

Great for real-time computer vision applications that might need to solve
large nonlinear least square problems.

------
cs702
Their tool automates the creation of new GPU Kernels that are 1.1x to 3.2x
faster than expert-optimized code in tested computations, including 2D
convolutions (which are widely used in deep learning models). This tool thus
has the potential to boost GPU performance for many computationally intensive
applications, including AI applications:

    
    
      "We develop Swizzle Inventor to help programmers implement
       swizzle programs, by writing program sketches that omit
       swizzles and delegating their creation to an automatic
       synthesizer. Our synthesis algorithm scales to real-world
       programs, allowing us to invent new GPU kernels for stencil
       computations, matrix transposition, and a finite field
       multiplication algorithm (used in cryptographic applications).
       The synthesized 2D convolution and finite-field multiplication
       kernels are on average 1.5-3.2x and 1.1-1.7x faster,
       respectively, than expert-optimized CUDA kernels."

------
dragontamer
Swizzles are among the most powerful instructions in all CPUs or GPUs.

AVX "vpshufb", CUDA PTX's "shfl.sync", and AMD's DPP / ds_permute /
ds_bpermute instructions can very quickly, and efficiently, move data around.

Its not so much that these instructions are hard: its that these instructions
are so flexible. Everyone who studies them instinctively knows that they are
among the best operators for data movement ever invented.

It seems like this "Sizzle Inventor" provides computer-based optimization, to
search for ideal swizzles. That's really cool.

------
avinium
This looks really interesting - but I have to admit it's gone completely over
my head. Can someone break this down into simpler terms?

~~~
dragontamer
In CUDA PTX, there is an instruction called "shfl.sync.idx", which allows you
to shuffle data between threads arbitrarily. However, this shuffling is only
efficient if you follow a variety of rules: avoiding bank conflicts and so
forth.

Lets say you have 32-threads of a Warp running. Thread0, Thread1, Thread2,
etc. etc. Now lets say they all have a variable named "x".

Thread0.x, Thread1.x, Thread2.x, etc. etc.

What "shfl.sync" does, is it allows these "x" variables to be shuffled between
all threads in one step.

    
    
        // Parallel and synchronous: All threads swap 
        // values arbitrarily. This is a simple "rotate", 
        // but all combinations are legal for a shuffle
        
        Thread0.x = Thread13.x
        Thread1.x = Thread14.x
        Thread2.x = Thread15.x
        Thread3.x = Thread16.x
        ...
        Thread30.x = Thread11.x
        Thread32.x = Thread12.x
    

That's a swizzle. Now the descriptions on the right? The 15, 16, 17, etc. etc.
can be any combination. You can broadcast (have Thread1.x copied to all 32
other threads). You can rotate, you can shuffle. All 32^32 specifications of
CUDA-thread transfers is legal.

Now, it may be legal, but broadcasts are slower due to bank conflicts. So
following the special rules to make fast swizzles is an annoying task for
sure. This paper helps the programmer create optimal swizzles for their
program.

For more information:

* [https://docs.nvidia.com/cuda/parallel-thread-execution/index...](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-shfl-sync)

* [https://devblogs.nvidia.com/using-cuda-warp-level-primitives...](https://devblogs.nvidia.com/using-cuda-warp-level-primitives/)

TL;DR: GPUs have a rich set of highly-efficient data movement operations. They
are easy to use, but hard to optimize. Paper presents concepts that help
optimize them automatically.

~~~
avinium
Great explanation, appreciate the effort. Will dig deeper into the provided
links.

------
lostmsu
Too bad source code is not published yet.

