Hacker News new | past | comments | ask | show | jobs | submit login
C-for-Metal: High Performance SIMD Programming on Intel GPUs (arxiv.org)
126 points by lelf 35 days ago | hide | past | favorite | 49 comments



Compiling from high-level lang to GPU is a huge problem, and we greatly appreciate efforts to solve it.

If I understand correctly, this (CM) allows for C-style fine-level control over a GPU device as though it were a CPU.

However, it does not appear to address data transit (critical for performance). Compilation and operator fusing to minimize transit is possibly more important. See Graphcore Poplar, Tensorflow XLA, Arrayfire, Pytorch Glow, etc.

Further, this obviously only applies to Intel GPUs, so investing time in utilizing low-level control is possibly a hardware dead-end.

Dream world for programmers is one where data transit and hardware architecture are taken into account without living inside a proprietary DSL Conversely, it is obviously against hardware manufacturers' interests to create this.

Is MLIR / LLVM going to solve this? This list has been interesting to consider:

https://github.com/merrymercy/awesome-tensor-compilers


I'm not a hardware engineer, but I am a GPU-focused graphics engineer.

> C-style fine-level control over a GPU device as though it were a CPU.

Personally, I think this is a fool's errand, and this has nothing to do with my desire for job security or anything. When I look at how code in the ML world is written for a GPU for example, it's really easy to see why it's so slow. The CPU and GPU architectures are fundamentally different. Different pipelining architecture, scalar instead of vector, 32/64-wide instruction dispatches, etc. HLSL/GLSL and other such shader languages are perfectly "high level" with other needed intrinsics needed to perform relevant warp level barriers, wave broadcasts/ballots/queries, use LDS storage, execute device level barriers, etc. This isn't to say that high level shader language improvements aren't welcome, but that trying to emulate a CPU is an unfortunate goal.


What kinds of improvements would you like to see?


From an ergonomics point of view, it would be nice to have pointers directly into a descriptor heap, but this is coming already with SM 6.6 any day now. If I'm allowed to request hardware capabilities, I would like a non-divergent warp jmp instruction based on a function pointer table (much like is done with DXR-class raytracing hardware/shaders) to eliminate pipelining introduced only to mitigate shader permutation explosion.


> is possibly a hardware dead-end.

I'm thinking the opposite, there been an unending succession of different accelerators for doing this, and that which eventually been obsoleted, and forgotten when general purpose CPU caught up to them in performance, or comp-sci learned how to do calculations more efficiently on mainstream hardware.

Just by seeing how morbid are the sales of the new "NPUs," I can guess it's already happening.

A number of cellphone brands experimented with such to run selfie filters, or do speech recognition, but later found that those work on CPUs not any worse at all if competent programmers are hired, and then threw the NPU hardware out, or disabled it.


> I'm thinking the opposite, there been an unending succession of different accelerators for doing this, and that which eventually been obsoleted, and forgotten when general purpose CPU caught up to them in performance, or comp-sci learned how to do calculations more efficiently on mainstream hardware.

There's only been two styles:

* SIMD style compute (which has been around since Cray and CM-2, and continues today with AMD GPUs and NVidia GPUs).

* Systolic-style compute -- which has been around since CD-ROM ASICs for decoding the CD-ROM realtime as it was spinning. Apparently Reed-Solomon decoding needed a ton of matrix-multiplications.

Modern "NPU Hardware" is simply systolic compute.

------------

Traditional CPUs load data and then store data. The core itself can remain unchanging in the demands of different applications.

Systolic Compute always loads/stores data in the same order, as it traverses across your systolic array. The systolic array designed for CD-ROM playback is quite different than the systolic array for NPUs.

You gain faster speeds by building the hardware itself to handle the load/store pattern of the algorithm. That way the CPU-core itself doesn't waste time on loads/stores, nor do you have to worry about memory bandwidth issues or whatever.

But by doing so: you lock yourself into a particular application. Modern neural-net NPUs can only do FP16 matrix multiplications.

That's fine. If you want the fastest neural net in the world, systolic arrays will get you there. But that thing will ONLY be able to do neural net calculations, nothing else.

Maybe you build the systolic array into an FPGA so that you can reconfigure the hardware later. Or maybe you just go with an ASIC because you're Google and can afford to build out clusters of ASICs of Systolic Arrays that can only do tensor-calculations.


When you look at how mind-bendingly precise surrogate NNs can be, sometimes you build your pipeline with 'slow' FP16 NNs and it's still faster than the 'direct' operation.

I also like how using tensor cores at the same time as cuda cores is how you get 'more' performance. It's a bit like CPU execution ports, if you /don't/ use them you're leaving performance on the floor. Only here nvidia (cuda, tensor, rt) 'cores' are a bit higher level than exec ports. Also it's kind of hard (i.e. fun) to find a use for rt cores. Digging up all those 90's papers about exotic applications of Raytracing (I mean weed must have been great then :-)

Interesting times.


Domain-specific compilers that generate explicit SIMD code from a high-level specification are even nicer. These can fully exploit the capabilities of the instruction set (e.g., fast permutes, masking, reduced precision floats, large register file, etc.) for a particular domain

For example, generating AVX-512 code for convnet inference: https://NN-512.com

NN-512 does four simultaneous 8x8 Winograd tiles (forward and backward) in the large AVX-512 register file, accelerates strided convolutions by interleaving Fourier transforms (again, with knowledge of the large register file), makes heavy use of the two input VPERMI2PS permutation instructions, generates simplified code with precomputed masks around tensor edges, uses irregular/arbitrary tiling patterns, etc. It generates code like this:

https://nn-512.com/example/11

This kind of compiler can be written for any important domain


This is great, but it doesn't address GPUs. If you built it for GPUs, from what I understand, that outcome would basically look like tensorflow, or maybe tensorflow XLA. Is that right?


My point is that a less general compiler can yield better SIMD code for a particular domain, and be easier to use for a particular domain. And I gave a concrete illustration (NN-512) to support that claim

Consider NN-512 (less general) versus Halide (more general). Imagine how hard it would be to make Halide generate the programs that NN-512 generates. It would be a very challenging problem


Understood: NN-512 is a local optimum in an optimization of hardware and problem structure.


> The SIMT execution model is commonly used for general GPU development. CUDA and OpenCL developers write scalar code that is implicitly parallelized by compiler and hardware. On Intel GPUs, however, this abstraction has profound performance implications as the underlying ISA is SIMD and important hardware capabilities cannot be fully utilized

What? That makes no sense.

GPU processor cores are basically just SIMD with a different color hat. The SASS assebly simply has _only_ SIMD instructions - and with the full instrunction set being SIMD'ized, it can drop the mention of "this is SIMD" and just pretend individual lanes are instruction-locked threads .

So, an OpenCL compiler would do very similar parallelization on a GPU and on an Intel CPU. (It's obviously not exactly the same since the instruction sets do differ, and the widths are not the same, and Intel CPUs has different widths which could all act at the same time etc.)

So, the hardware capabilities can be utilized just fine.


Modern NVIDIA GPUs (since Volta) drop that pretence at the ISA level, each thread has its own instruction pointer there.

Your GPU ISA is scalar, not vector on modern NVIDIA machines.


The compiler is open source, available at https://github.com/intel/cm-compiler

More documentation at https://01.org/c-for-metal-development-package


Another interesting reference from a few years ago: http://www.joshbarczak.com/blog/?p=1028

Also read the followups (1120 and 1197), as they go into considerably more detail about the SPMD programming model and some use cases.

The author is now at Intel working on ray tracing.


Intel has a previous SPMD compiler here: https://ispc.github.io

Although the author seemed to have fled Intel soon after releasing it, and apparently spent the whole development process terrified that corporate politics would make him cancel it.


Skimming through the paper, it seems they don't referece or review other recent GPU languages, just OpenCL and CUDA. Seems curious as it's an active area.


Why are Intel GPUs designed in such a way that typical GPU languages don’t fully exploit it? Is the new Xe architecture still SIMD?


OpenCL works on Intel GPUs, while CUDA doesn't because CUDA is an NVidia technology.

> Is the new Xe architecture still SIMD?

SIMD is... pretty much all GPUs do. There's a few scalar bits here and there to speed up if-statements and the like, but the entire point of a GPU is to build a machine for SIMD.


OpenCL is basically dead at this point, too. The de facto standard is CUDA and there aren’t currently any real challengers. Maybe eventually AMD’s ROCm or Intel’s oneAPI will get traction.


For them to get traction, they need to invest in debugger tooling that allows the productivity as on CPUs, and to help language communities other than C and C++ to target GPGPUs.

NVidia started doing both around CUDA 3.0, whereas Khronos, AMD and Intel only started paying attention that not everyone wanted to do printf() style debugging with a C dialect until it was too late to get people's attention back.


AMD had a good Visual Studio plugin for OpenCL, complete with debugging support, although I believe it's since been discontinued.


oneAPI uses DPC++ (Data-Parallel C++), which is pretty much just SYCL, which itself is a C++ library on top of OpenCL.

From my understanding, the Khronos group realized OpenCL 2.x was much too complicated so vendors just weren’t implementing it, or only implementing parts of it, so they came up with OpenCL 3.0 which is slimmed-down and much more modular. It’s hard to say how much adoption it’ll get, but with Intel focused on DPC++ and oneAPI now, there will definitely be more numerical software coming out in the next few years that compiles down to and runs on OpenCL.

For example, Intel engineers are building a numpy clone on top of DPC++, so unlike regular numpy it’ll take advantage of multiple CPU cores: https://github.com/IntelPython/dpnp


Kind of right.

DPC++ has more stuff than just SYSCL, some of it might find its way back to SYSCL standardization, some of it might remain Intel only.

OpenCL 3.0 is basically OpenCL 1.2 with a new name.

Meanwhile people are busy waiting for Vulkan compute to take off, got to love Khronos standards.


Some people are working on being able to run OpenCL kernels on Vulkan: https://github.com/google/clspv


Sure, but it will take off to actually matter?

So far I am only aware of Adobe using it to port their shaders to Vulkan on Android.


> From my understanding, the Khronos group realized OpenCL 2.x was much too complicated so vendors just weren’t implementing it, or only implementing parts of it, so they came up with OpenCL 3.0 which is slimmed-down and much more modular.

Something like this also happened to OpenGL 4.3. It added a compute shader extension which was essentially all of OpenCL again, except different, so you had 2x the implementation work. This is about when some people stopped implementing OpenGL.


OpenGL compute shaders are a natural step if you have unified programmable shader cores, and less complicated than adding new pipeline stages for everything (tessellation shaders, geometry shaders, …).

Khronos could have chosen only to add OpenCL integration, but OpenCL C is a very different language to GLSL, the memory model (among other things) is different, and so on. I don't see why video game developers should be forced to use OpenCL when they want to work with the outputs of OpenGL rendering passes, to produce inputs to OpenGL rendering passes, scheduled in OpenGL, to do things that don't fit neatly into vertex or fragment shaders?


On the contrary, for example on Android OpenGL ES you get compute shaders, but no OpenCL.

By the way, the latest version is OpenGL 4.6, it is also available on the Switch.


I realize the size of the ecosystem is important, but if you're writing all the GPU code yourself, I wouldn't say OpenCL is dead.

I recently had a chance to learn the basics for a work project, never having touched the field before. I picked OpenCL, because I knew I was writing all my non-BLAS code myself, and there's no way in hell I'll voluntarily lock myself into a closed ecosystem. (PS: CLBlast, which is different from CLblas, is a joy!)

I was pleasantly surprised. I found OpenCL very nice to work with indeed! And my code runs on any modern GPU out there. I've tested it on Intel integrated GPUs, AMD GPUs, and Nvidia's fancy datacenter devices. And even CPUs. Seamlessly, through a runtime switch fully controlled by the application itself!

Now, could I have gotten more performance out of CUDA? Yeah, I estimate about a factor 2. For the cost of tying myself to a proprietary, locked in technology from a hostile vendor, throwing out two major classes of devices, and losing the ability to test out code anywhere. Not worth it.

I hope OpenCL has life in it still. The stuff I keep reading that CUDA is far easier to approach definitely did not ring true to this beginner.


GPUs don't need to have SIMD instructions; if you give one a fully scalar program it just needs to run a lot of copies of it at once. Every architecture is different here, including within the same vendor.


> GPUs don't need to have SIMD instructions;

Except NVidia Ampere (RTX 3xxx series) and AMD RDNA2 (Navi / 6xxx series) are both SIMD architectures with SIMD-instructions.

And the #3 company: Intel, also has SIMD instructions. I know that some GPUs out there are VLIW or other weird architectures, but... the "big 3" are SIMD-based.

> if you give one a fully scalar program it just needs to run a lot of copies of it at once.

Its emulated on a SIMD processor. That SIMD processor will suffer branch-divergence as you traverse through if-statements and while-statements, because its physically SIMD.

The compiler / programming model is scalar. But the assembly instructions are themselves vector. Yeah, NVidia now has per-SIMD core instruction pointers. But that doesn't mean that the hardware can physically execute different instructions: they're still all locked together with SIMD-style at the physical level.


> with SIMD-instructions

You assume so, for Volta onwards that's not true.

Each SIMT thread on NVIDIA GPUs for Volta onwards has its own instruction pointer. And yes, it's scalar instructions at the ISA level.


> You assume so, for Volta onwards that's not true.

https://arxiv.org/pdf/1804.06826.pdf

NVidia's SASS on Volta is pretty clearly SIMD-instructions. The PTX "virtual machine assembly" is fully documented. SASS is not documented, but it ain't a secret either. SASS follows closely with PTX, with exception of some "barrier bits" that seem to track dependencies (probably compiler-managed read/write dependency chains).

> Each SIMT thread on NVIDIA GPUs for Volta onwards has its own instruction pointer. And yes, it's scalar instructions at the ISA level.

When your "scalar" instruction executes 32-threads in parallel (subject to an execution mask), that's called SIMD.

The instruction pointer is to resolve deadlock conditions with locks, and the 32-wide SIMD cores will execute one-at-a-time to prevent deadlocks. But your goal as a GPU programmer is to get as much 32-wide execution going on as possible.

1-at-a-time serialization is very, very bad for performance. Its POSSIBLE to do on Volta, but highly recommended you stay away from that cornercase (you lose over 95% of your performance).


Apart from tensor cores and some compressed instructions on small data types, they are not SIMD instructions.

Each individual SIMT lane can actually diverge from the other on the underlying hardware. Active threads are dynamically mapped to SIMT units. You can use __syncwarp() to force reconvergence, to stall until all the threads in a warp are at the same location.

The disassembled code targeting the underlying ISA for a given CUDA kernel is also easily accessible through nvdisasm.


A 32-wide SIMD core executing 16 or 8 or 1-thread at a time is just SIMD execution that's running at 50%, 25%, or 3% utilization.

See figure 22 and figure 23 in: https://images.nvidia.com/content/volta-architecture/pdf/vol...

Thread divergence is bad: very very bad. Running 2x, 4x, or 32x slower (or less parallel technically). You want to avoid those situations as much as possible.

What NVidia noticed is that thread divergence is necessary for many classical locking algorithms: where serialized code must execute one-at-a-time to run an algorithm correctly. Under these conditions, tracking the instruction pointer, and turning off 97% of your cores to execute 1-at-a-time (instead of 32-at-a-time SIMD) is done.

That's WHY __syncwarp() exists, so that you can return to 32-at-a-time execution as soon as possible. Its not always possible for the compiler to figure it out, so the programmer can put a __syncwarp() as a compiler hint that 32-at-a-time is safe again.


You claimed that SASS was a SIMD instruction set, which is false.

(through "NVidia's SASS on Volta is pretty clearly SIMD-instructions")

Yes, thread divergence on GPUs can come with performance downsides and is a tool to use carefully.


> You claimed that SASS was a SIMD instruction set, which is false.

Well, they execute 32-at-a-time SIMD, do they not?

There are tricks to split up the execution mask and execute 1-at-a-time, 2-at-a-time, 4, 8, 16, or 31 at a time based on if-statements or thread-locks or whatever. But a "SASS" instruction of "R1 = R2 + R3" is implicitly executed across a 32-wide warp if its execution mask is set as such.

----------

EDIT:

Lets put it this way: If you see a "SASS" assembly for R1 = R2 + R3 in isolation, how many adds take place on that clock tick?

Somewhere between 1 add, and 32-adds. No more than 32-at-a-time. Seems pretty SIMD to me.


There is no execution mask from the ISA perspective, each thread has its own instruction pointer, and the hardware deals with which threads get executed, instead of a mask being set/unset.

I don't know why you continue to argue this...


Saying there is no execution mask from the ISA perspective isn't quite true either: `__activemask` in CUDA translates to a simple fetch of that mask, and the Vulkan subgroup operations also access it - I believe subgroupBallot(true) is equivalent.

The way I would put this: the hardware is SIMD, and the number of operations executed per clock is the same as pure SIMD, but the independent instruction pointer per thread gives the scheduler considerably more flexibility, which is useful for avoiding deadlocks on synchronization, and also helpful for hiding memory access latency.

On top of that, there's an abstraction of a large number of scalar threads running on the hardware. It's a leaky abstraction, though, as performance really requires sympathy with the underlying SIMD reality, and also the subgroup operations also expose a good deal of the implementation details of that abstraction.


Look, lets say we have:

    if( threadIdx.x % 2 == 0){ // True for 16-threads
        doA(); // 100 clock ticks
    } else { // The other 16 threads
        doB(); // 150 clock ticks
    }
Lets say doA() takes 100 clock ticks, and doB() takes 150 clock ticks on NVidia Volta. How much time does the above code take to run?

Answer: 250 clock ticks: doA() is run with 16-threads, then doB() is run on the 16-other threads afterwards.

That's how SIMD systems work. If this were non-SIMD 32-core turing machine, it'd take 150 clock ticks (And the threads doing doA() would have spend 50-clock ticks doing something else). The most important thing about SIMD from a performance perspective is that you "add" both halves of the branch from a performance point of view.


Unlike a vanilla SIMD machine, you can diverge more cheaply when you need to. Scheduler flow control is different from a more conventional masks approach.

The underlying implementation is still just lane masking and walking down separate flow control branches sequentially of course.

The advantage of such a model however is that the complexity of this is abstracted away from even the compiler. This makes it a change in the programming model. That's why it's not _just_ named SIMD.

(I fear that this discussion went too far away arguing semantics... sigh)


> The advantage of such a model however is that the complexity of this is abstracted away from even the compiler. This makes it a change in the programming model. That's why it's not _just_ named SIMD.

That changes the history of the term SIMT. SIMT was first used for NVidia Tesla in 2006: over a decade older than the incremental changes made between Pascal -> Volta/Turing.

There has been no name change from Pascal -> Volta/Turing, as far as I'm aware. Furthermore, the PTX is substantially similar. The per-thread instruction pointers is pretty transparent in most code.

> The advantage of such a model however is that the complexity of this is abstracted away from even the compiler. This makes it a change in the programming model. That's why it's not _just_ named SIMD.

Have you looked at AMD's GPU ISA? Its just jump instructions, extremely similar to NVidia's PTX / SASS instruction set. AMD SIMD doesn't juggle execution masks explicitly either: its handled at a lower level (probably the decoder or something).

You still pull the execution mask for things like ballot instructions, but... both AMD and NVidia SIMD are pretty similar. (Similarly, NVidia PTX can still access the execution mask for ballot instructions as well)

------------

Although, I guess both of those GPUs can laugh at AVX512, where the execution masks are explicitly handled by the assembly programmer. But I don't know if explicit execution masks is necessarily a bad thing (its the job of the compiler instead of the decoder or whatever...)

> Unlike a vanilla SIMD machine, you can diverge more cheaply when you need to

If we take CM2 from 1985 as a "vanilla SIMD machine", it had execution masks and diverged extremely cheaply, just like modern machines.

C-Star and star-Lisp even had a programming model very similar to modern CUDA.

http://bitsavers.informatik.uni-stuttgart.de/pdf/thinkingMac...

Back then, you'd use "when" statements to do a parallel divergent branch, while "if" was only for uniform branches. But it wasn't like "when" statements were expensive, they just diverged.


SIMT became an abstraction that was lowered to SIMD at compile time to instead being the abstraction used when targeting the underlying hardware.

> Furthermore, the PTX is substantially similar

PTX is just a (forwards-compatible) intermediate representation.

> The per-thread instruction pointers is pretty transparent in most code

Except control flow divergence, which is what changes there, yes.

> Although, I guess both of those GPUs can laugh at AVX512, where the execution masks are explicitly handled by the assembly programmer. But I don't know if explicit execution masks is necessarily a bad thing (its the job of the compiler instead of the decoder or whatever...)

That's a very good question. AVX-512 with its masking abilities was substantial progress over AVX2.

For a CPU, having secondary instruction flows just for the vector units just isn't a (reasonable) option though.

If there wasn't the 10nm issues, the next Xeon Phi would have been very interesting on that front. You might also want to look at the Fujitsu A64fx on the Arm side, used in Fugaku. (building a supercomputer with just CPUs, no GPUs)

We'll see what will be there in the future... will certainly be very interesting.


That's partly true, but there are exceptions, of which the subgroup operations are the most obvious. These are roughly similar to broadcast and permute SIMD instructions, and in some cases can lead to dramatic speedups.


The Intel Gen GPU architecture, which includes the newest incarnation Gen12 aka Xe, is a SIMD architecture as opposed to Nvidia and AMD SIMT architecture. The reasons are historical, i.e. CPU centric design: x86 CPU, Larrabee, Xeon-Phi, etc.


The paper covers the reasons, and they aren't new or specific to Intel.

There was a talk by Andrew Lauritzen basically saying the same thing a few years back.

https://www.ea.com/seed/news/seed-siggraph2017-compute-for-g...

DX etc have been slowly exposing some of the underlying SIMD but it is still not really on the level of what is available in a full SIMD model like we have with AVX etc.

GPU languages like HLSL/GLSL are designed for ease of use and leave some performance on the ground.


The path forward seems to be CUDA style programming, it is what Optix does.

However I would rather prefer some kind of "SQL" for GPU programming.


ain't no one gonna use that.




Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: