I helped make a really cursed RISC-V version of this for a class project last year! The idea was to first compile each program to WASM using clang, and lower the WASM back to C but this time with all opcodes implemented in terms of the RISC-V vector intrinsics. That was a hack to be sure, but a surprisingly elegant one since
1. WASM's structured control flow maps really well to lane masking
2. Stack and local values easily use "structure of arrays" layout
3. Heap values easily use "array of structures" layout
In addition to ISPC, some of this is also done in software fallback implementations of GPU APIs. In the open source world we have SwiftShader and Lavapipe, and on Windows we have WARP[1].
It's sad to me that Larrabee didn't catch on, as that might have been a path to a good parallel computer, one that has efficient parallel throughput like a GPU, but also agility more like a CPU, so you don't need to batch things into huge dispatches and wait RPC-like latencies for them to complete. Apparently the main thing that sunk it was power consumption.
The recent "AI chip" proposals (Tenstorrent, Esperanto Technologies etc.) seem to be quite similar to the old Larrabee design, except based on RISC-V as opposed to x86. So we might see that happen after all.
Yes, I've got my eye on those and am hopeful. Do you know of any meaty technical description of the programming model? All I've been able to find so far is fairly high level marketing material. At least for Tenstorrent Jim Keller has promised that the software stack will be open sourced, something I'm looking forward to.
One of my colleague's Ph.D. thesis was on how to achieve high-performance CPU implementations for bulk-synchronous programming models ("GPU programming")
This so-called GPU programming model has existed many decades before the appearance of the first GPUs, but at that time the compilers were not so good like the CUDA compilers, so the burden for a programmer was greater.
As another poster has already mentioned, there exists a compiler for CPUs which has been inspired by CUDA and which has been available for many years: ISPC (Implicit SPMD Program Compiler), at https://github.com/ispc/ispc .
NVIDIA has the very annoying habit of using a lot of terms that are different from those that have been previously used in computer science for decades. The worst is that NVIDIA has not invented new words, but they have frequently reused words that have been widely used with other meanings.
SIMT (Single-Instruction Multiple Thread) is not the worst term coined by NVIDIA, but there was no need for yet another acronym. For instance they could have used SPMD (Single Program, Multiple Data Stream), which dates from 1988, two decades before CUDA.
Moreover, SIMT is the same thing that was called "array of processes" by C.A.R. Hoare in August 1978 (in "Communicating Sequential Processes"), or "replicated parallel" by Occam in 1985 or "PARALLEL DO" by "OpenMP Fortran" in 1997-10 or "parallel for" by "OpenMP C and C++" in 1998-10.
Each so-called CUDA kernel is just the body of a "parallel for" (which is multi-dimensional, like in Fortran).
The only (but extremely important) innovation brought by CUDA is that the compiler is smart enough so that the programmer does not need to know the structure of the processor, i.e. how many cores it has and how many SIMD lanes each core has. The CUDA compiler distributes automatically the work over the available SIMD lanes and available cores and in most cases the programmer does not care whether two executions of the function that must be executed for each data item are done on two different cores or on two different SIMD lanes of the same core.
This distribution of the work over SIMD lanes and over cores is simple when the SIMD operations are maskable, like in GPUs or in AVX-512 a.k.a. AVX10 or in ARM SVE. When masking is not available, like in AVX2 or Armv8-A, the implementation of conditional statements and expressions is more complicated.
SIMT and SIMD are different things. It's fortunate that they have different names.
A GPU is a single instruction multiple data machine. That's what the predicated vector operations are. 32 floats at a time, each with a disable bit.
Cuda is a single instruction multiple thread language. You write code in terms of one float and branching on booleans, as if it was a CPU, with some awkward intrinsics for accessing the vector units in the GPU.
That is, the programming model of a GPU ISA and that of Cuda are not the same. The GPU gives you vector instructions. Cuda gives you (mostly) scalar instructions and a compiler that deals with this mismatch, lowering branches to changes in exec mask and so forth.
With my numerical library hat on, I hate this. Programming a simd machine through a simt language means trying to get the compiler to transform the control flow into the thing you could easily write using vector instructions.
With my compiler implementer hat on, I hate this. It gives you two control flow graphs intertwined and a really bad time in register allocation.
It's not totally clear to me why simt won out over writing the vector operations. I'm certainly in the minority opinion here.
I think it boils down to which GPGPU camp you are in, AI or HPC.
AI has relatively simple workflow, less thread divergence, so the SIMT abstractions add very little value. HPC workflow on the other hand is lot more complex. Writing a good simulation program for example, is going to get inhumanly complex with just SIMD.
I guess that a lot of people are uncomfortable thinking about vector instructions, and dealing with masks manually? And for vector instructions you need to align things properly, pad the arrays such that they are of the right size, that people are not used to I guess.
I agree. I work on numerical simulation software that involves very sparse, very irregular matrices. We hardly use SIMD because of the challenges of maintaining predicates, bitmasks, mapping values into registers, and so on. It's not bad if we can work on dense blocks, but dealing with sparsity is a huge headache. Now that we are implementing these methods with CUDA using the SIMT paradigm, that complexity is largely taken care of. We still need to consider how to design algorithms to have parallelism, but we don't have to manage all the minutiae of mapping things into and out of registers.
Modern GPGPUs also have more hardware dedicated to this beyond the SIMD/SIMT models. In NVIDIAs CUDA programming model, besides the group of threads that represents a vector operation (a warp), you also have groups of warps (thread blocks) that are assigned the same processor and can explicitly address a fast, shared memory. Each processor has many registers that are automatically mapped to each thread so that each thread has its own dedicated registers. Scheduling is done in hardware at an instruction level so that you can effectively single cycle context switches between warps. Starting with Volta, it will even assemble vectors from threads in any warps in the same thread block, so lanes that are predicated off in a warp don't have to go to waste - they can take lanes from other warps.
There are many other hardware additions that make this programming model very efficient. Similar to how C and x86 each provide abstractions over the actual micro ops being executed that hides complexity like pipelining, out of order execution, and speculative execution, CUDA and the PTX ISA provide abstractions over complex hardware implementations that specifically benefit this kind of SIMT paradigm.
> It's not totally clear to me why simt won out over writing the vector operations.
From the user side, it is probably simpler to write an algorithm once without vectors, and have a compiler translate it to every vector ISA it supports, rather than to deal with each ISA by hand.
Besides, in many situations, having the algorithm executed sequentially or in parallel is irrelevant to the algorithm itself, so why introduce that concern?
> I'm certainly in the minority opinion here.
There are definitely more userland programmers than compiler/numerical library ones.
I think it's easier to write the kernel in terms of the scalar types if you can avoid the warp level intrinsics. If you need to mix that with the cross lane operations it gets very confusing. In the general case volta requires you to pass the current lane mask into intrinsics, so you get to try to calculate what the compiler will turn your branches into as a bitmap.
So if you need/want to reason partly in terms of warps, I think the complexity is lower to reason wholly in terms of warps. You have to use vector types and that's not wonderful, but in exchange you get predictable control flow out of the machine code.
Argument is a bit moot though, since right now you can't program either vendor hardware using vectors, so you also need to jump the barrier to assembly. None of the GPUs are very easy to program in assembly.
The same is true for CUDA/OpenCL kernels. They can include conditionals and they can execute independent instructions.
The difference is that because the software "kernels" (i.e. software threads) may be mapped by the compiler either on hardware threads or on hardware SIMD lanes, and this mapping is not controlled by the programmer, the divergent instructions will cause inefficient (serial) execution when they happen to be executed on SIMD lanes of the same core, so they must be avoided.
This however is only an optimization problem, if the speed would be irrelevant all the kernels could execute divergent instructions.
The reason for the existence of CUDA is to mask for the programmer the existence of the SIMD lanes and to allow programming as if the software threads would map only to hardware threads. Nevertheless, for optimum performance the programmer must be aware that this abstraction is not really true, so the programs should be written with awareness of the limitations introduced by SIMD.
On conventional SIMT implementations (pre-Volta), the programmer also has to be aware of it to not cause deadlocks in the atomics across different lanes in the same warp.
On NV Volta onwards, each SIMT lane has its own instruction pointer with opportunistic reconvergence when possible.
You've continued to post aggressively, attacking other users and generally posting in the flamewar style, even though we've asked you many times to stop. If this keeps up, we're going to have to ban you. I don't want to ban you, because you obviously have a lot of knowledge to share, but your aggressiveness destroys more than your knowledge contributes—a lot more, actually. Fortunately, there are plenty of other knowledgeable users here who treat their fellow users respectfully. Please be like them instead.
It pointless trying to abide by your rules of decorum - I call a thing a rant - as if it isn't and that makes me aggressive? In reality you're just mad I pointed out your kafka-esque moderation policies.
> Please be like them instead.
Be an arrogant martinet? No thanks. But I'll stop posting so congrats on keeping your community homogeneously obnoxious (and ignorant).
I don't remember anything you pointed out about moderation, and in any case we don't ban accounts for criticizing the mods.
This isn't a response to just one comment, let alone one word, but to the pattern of your comments in general: they stand out as nasty, abusive towards other commenters, and not in the intended spirit of the site; not just by a little, but a lot. It's not a borderline call.
there's a single questionable word in the flagged comment (the word `rant`) and it's not a borderline call? yup totally makes sense. Like do you even read the flagged comment or you just go by the fact that it offended someone?
Are you even reading the text you cite? That compares SIMD to SIMT. SIMD, not SPMD. SIMD is a more restrictive hardware paradigm, SPMD is just a mutiple copies of a single program running on multiple processes/threads, i.e., exactly what CUDA does.
Most MPI-programs are SPMD, and really, the only conceptual difference between MPI and CUDA is that a thread's id is a point in a grid space that's relevant for scheduling and memory locality. Oh wait, that's true for MPI as well, only the scheduling and memory locality relates to CPU cores and nodes, instead of CUDAs SMPs and blocks.
The only real justification for a different programming model is that SIMT creates opportunities for implicit SIMD on a thread scheduling level if you've written your program in a way that the warps don't diverge.
"Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that.""
Yes I have read it enough times that I knew exactly where it was in the book.
> That compares SIMD to SIMT.
Yes and so does the op I responded to.
> SIMD, not SPMD. SIMD is a more restrictive hardware paradigm,
Thanks but I was already aware. The op was the one that conflated SPMD and SIMD, not me.
> SPMD is just a mutiple copies of a single program running on multiple processes/threads, i.e., exactly what CUDA does.
CUDA is a C language extension that enables you to write host and device code in the same source. That's it. SIMT is the model of compute on NVIDIA SMs.
> The only real justification for a different programming model is that SIMT creates opportunities for implicit SIMD on a thread scheduling level if you've written your program in a way that the warps don't diverge.
No that's incorrect. SIMT is implemented using predicated execution and instruction replay exactly so that warps can diverge.
Again, it's amazing you guys are so sure that NVIDIA pulled a fast one on just the whole community/industry and didn't actually innovate but it turns out you're actually missing some pieces of the puzzle.
Subgroup operations[1]. In SIMT, they are fairly easily modeled[2] as communication between the different threads running on the same SIMD, and in most cases explicitly expose the predication mask. In fact, subgroupBallot(true) is a common idiom to extract that mask, and also to determine the subgroup size if run in subgroup uniform control flow. Generally they have good performance characteristics.
In SPMD, subgroups aren't easy to model, and would generally be emulated by inter-thread communication if it's important to match the semantics of a source program that includes them. Performance in that case would be terrible.
[1] In Vulkan, OpenGL, OpenCL, and WebGPU they are called "subgroups". In Nvidia including CUDA they are called "warps". In D3D and AMD they are called "waves". In Metal (often running on the same hardware) they are called "simdgroups". This fragmentation of terminology is part of the flavor of working with GPUs.
[2] "Fairly easily" by the standards of GPU infrastructure. In fact, the exact semantics have never been nailed down, and in particular "reconvergence" is poorly defined. Working through this is one of the things blocking subgroups in WebGPU (https://github.com/gpuweb/gpuweb/issues/4306). Even so, they're used commonly in practice, especially for things like fast matrix multiplication where you need to shuffle lots of data into place.
This is true to some extent, although certainly SPMD has subgroups and reduction operations on those are fairly common.
And you are essentially repeating my point from before, that the fact that an operation among the SIMT threads can be scheduled as one SIMD op is the only special behavior. My example was regular vectorized SIMD ops, but your example of reductions are essentially the same point. Shuffles and horizontal adds are present in SIMT just like they are in regular SIMD.
These same reductions exist in contexts where the model would traditionally be called SPMD. But because the concepts map onto a different level of the system we might not easily recognize them as the same.
The same kind of subgrouping happens in a multinode SPMD context in terms of NUMA regions and nodes, and an MPI implementation can handle reductions on a within-node or within-NUMA-region group of ranks more efficiently than a group containing ranks assigned to multiple meaningful hardware regions.
So I agree with you that the one thing SIMT has to differentiate it from SPMD is the SIMD-like scheduling of multiple threads -- and this was exactly my point from before -- but even here we find that SPMD has the same concepts, just not as explicitly, because the stack typically used in SPMD is more flexible, and SIMT can take advantage of specialized hardware ops.
Optimizations like this do not necessarily mean that the model isn't SPMD though. It's still a single program (a kernel) processing multiple data (indexed by the thread id).
The one key idea of SIMT is really all about the batch scheduling of threads, the fact that multiple threads from a warp can be executed on one SM partition (using CUDA terms here) using a single instruction and a mask.
SIMT emphasizes the GPU scheduler, but it still fits completely within SPMD. All SIMT technologies are SPMD technologies, but not all SPMD technologies are SIMT technologies. I'm not convinced there is enough new in SIMT to warrant the branding using a new architecture type in the taxonomy. They could have just as well emphasized the thread scheduling as an optimization in an SPMD context.
and again, did you even read my reply? You're disagreeing by stating my point back to me. The point about divergence is that IF the threads DONT diverge, you get SIMD scheduling, the GPU can schedule the equivalent of a SIMD OP across the warp.
Obviously they can diverge, because this is SPMD/SIMT
> This is in contrast to SIMD, or "single instruction multiple data," where the programmer explicitly uses vector types and operations in their program. The SIMD approach is suited for when you have a single program that has to process a lot of data, whereas SIMT is suited for when you have many programs and each one operates on its own data
This statement is comparing the SIMT model to SIMD. Can anyone explain the last part about SIMT being better for many programs operating on its own data? Are they just saying you can have individual “threads” executing independently (via predication/masks and such)?
SIMT let's a scheduler get clever about memory accesses, SIMD can practically only access memory linearly (scatter gather can do better but it's still usually quite linear) whereas SIMT can be much smarter in terms of having lots of similar bits of work going on in ways that use the bandwidth maximally and don't overlap.
I don't think so. I was quite interested in getting emulated GPUs to work for CI workers to be able to test CUDA stuff without needing actual GPU passthrough, but all I could find in this space were abandoned, early stage hobby projects on GitHub.
No, QEMU only supports up to AVX2. Based on the time it took to implement AVX and AVX2, I expect it would be about a month of full time work to implement AVX512F, AVX512BW and AVX512VL (it would not be much simpler to implement only one of them).
It never went anywhere but the code is still online if anyone wants to stare directly at the madness: https://gitlab.com/samsartor/wasm2simt