Hacker News new | past | comments | ask | show | jobs | submit login
SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011) (yosefk.com)
138 points by shipp02 10 months ago | hide | past | favorite | 37 comments



I think the principle things that have changed since this article was written is mostly each category taking inspiration from the other.

For example, SIMD instructions gained gather/scatter and even masking of instructions for divergent flow (in avx512 that consumers never get to play with). These can really simplify writing explicit SIMD and make it more GPU-like.

Conversely, GPUs gained a much higher emphasis on caching, sustained divergent flow via independent program counters, and subgroup instructions which are essentially explicit SIMD in disguise.

SMT on the other hand... seems like it might be on the way out completely. While still quite effective for some workloads, it seems like quite a lot of complexity for only situational improvements in throughput.


The basic architecture still matters. GPUs still lose throughput upon divergence regardless of their increased ability to run more kinds of divergent flows correctly due to having separate PCs, and SIMD still has more trouble with instruction latency (including due to bank conflict resolution in scatter/gather) than barrel threaded machines, etc. This is not to detract from the importance of the improvements to the base architecture made over time


agreed! The basic categories remain, just blurring a bit at the edges.


After primarily using AVX2, I don't think masked instructions and scatter/gather are particularly useful. Emulating masked computations with a blend is cheap. Emulating compress and some missing shuffles is expensive. Masked stores and loads don't really help with anything except for an edge case where they don't cause page faults on the part that was masked out.


On the gpu a masked out load is a nop. It certainly is better. And scatter functionality is probably quite painful to emulate without the intrinsics.


> How many register sets does a typical SMT processor have? Er, 2, sometimes 4

Way more of them. Pipelines are deep, and different in-flight instructions need different versions of the same registers.

For example, my laptop has AMD Zen3 processor. Each core has 192 scalar physical registers, while only ~16 general-purpose scalar registers defined in the ISA. This gives 12 register sets; they are shared by both threads running on the core.

Similar with SIMD vector registers. Apparently each core has 160 32-byte vector registers. Because AVX2 ISA defines 16 vector register, this gives 10 register sets per core, again shared by 2 threads.


I meant register sets visible to software. The fact that there's even more hardware not visible to software that you need for a thread to run fast just means that the cost of adding another thread is even higher


Depends on whether that hardware is shared between threads or not. Physical registers are usually shared.


I believe the author is referring to how many logical threads/hyperthreads can a core run (for AMD and Intel, two. I believe POWER can do 8, Sparc 4).

The extra physical registers are there for superscalar execution, not for SMT/hyperthreading.


quite interesting framing. A couple things have changed since 2011

- SIMD (at least intel's AVX512) does have usable gather/scatter, so "Single instruction, multiple addresses" is no longer a flexibility win for SIMT vs SIMD

- likewise for pervasive masking support and "Single instruction, multiple flow paths"

In general, I think of SIMD as more flexible than SIMT, not less, in line with this other post https://news.ycombinator.com/item?id=40625579. SIMT requires staying more towards the "embarrassingly" parallel end of the spectrum, SIMD can be applied in cases where understanding the opportunity for parallelism is very non-trivial.


One of the other major things that's changed is that Nvidia now has independent thread scheduling (as of Volta, see [1]). That allows things like individual threads to take locks, which is a pretty big leap. Essentially, it allows you to program each individual thread as if it's running a C++ program, but of course you do have to think about the warp and block structure if you want to optimize performance.

I disagree that SIMT is only for embarrassingly parallel problems. Both CUDA and compute shaders are now used for fairly sophisticated data structures (including trees) and algorithms (including sorting).

[1]: https://developer.nvidia.com/blog/inside-volta/#independent_...


It's improtant that GPU threads support locking and control flow divergence and I don't want to minimize that, but threads within a warp diverging still badly loses throughput, so I don't think the situation I'd fundamentally different in terms of what the machine is good/bad at. We're just closer to the base architecture's local maximum of capabilities, as one would expect for a more mature architecture; various things it could be made to support it now actually supports because there was time to add this support


I intentionally said "more towards embarrassingly parallel" rather than "only embarrassingly parallel". I don't think there's a hard cutoff, but there is a qualitative difference. One example that springs to mind is https://github.com/simdjson/simdjson - afaik there's no similarly mature GPU-based JSON parsing.


I'm not aware of any similarly mature GPU-based JSON parser, but I believe such a thing is possible. My stack monoid work [1] contains a bunch of ideas that may be helpful for building one. I've thought about pursuing that, but have kept focus on 2D graphics as it's clearer how that will actually be useful.

[1]: https://arxiv.org/abs/2205.11659


> That allows things like individual threads to take locks, which is a pretty big leap.

Does anyone know how those get translated into SIMD instructions. Like, how do you do a CAS loop for each lane where each lane can individually succeed or fail? What happens if the lanes point to the same location?


There's a bit more information at [1], but I think the details are not public. The hardware is tracking a separate program counter (and call stack) for each thread. So in the CAS example, one thread wins and continues making progress, while the other threads loop.

There seems to some more detail in a Bachelors thesis by Phillip Grote[2], with lots of measurements of different synchronization primitives, but it doesn't go too deep into the hardware.

[1]: https://arxiv.org/abs/2205.11659

[2]: https://www.clemenslutz.com/pdfs/bsc_thesis_phillip_grote.pd...


Thanks!


Last time i looked at intel scatter/gather I got the impression it only works for a very narrow use case, and getting it to perform wasn’t easy. Did I miss something?


The post says, about SIMT / GPU programming, "This loss results from the DRAM architecture quite directly, the GPU being unable to do much about it – similarly to any other processor."

I would say that for SIMD the situation is basically the same. gather/scatter don't magically make the memory hierarchy a non-issue, but they're no longer adding any unnecessary pain on top.


Barrel threaded machines like GPUs have easier time hiding the latency of bank conflict resolution when gathering/scattering against local memory/cache than a machine running a single instruction thread. So pretty sure they have a fundamental advantage when it comes to the throughput of scatter/gather operations that gets bigger with a larger number of vector lanes


vpgatherdd - I think that for newer CPUs it is faster than many loads + inserts, but if you are going to fault a lot, then it becomes slow.

> The VGATHER instructions are implemented as micro-coded flow. Latency is ~50 cycles.

https://www.intel.com/content/www/us/en/content-details/8141...


Modern GPUs are exposing the SIMD behind the SIMT model and heavily investing into SIMD features such as shuffles, votes, and reduces. This leads to an interesting programming model. One interesting challenge is that flow control is done very differently on different hardware. AMD has a separate scalar instruction pipeline which can set the SIMD mask. Apple uses an interesting per-lane stack counter approach where value of zero means that the lane is active and non-zero value indicates how many blocks need to be exited for the thread to become active again. Not really sure how Nvidia does it.


For a SIMD architecture that supports scatter/gather and instruction masking (like Arm SVE), could a compiler or language allow you to write "Scalar-style code" that compiles to SIMD instructions? I guess this is just auto-vectorization, but I'd be interested in explicit tagging of code regions, possibly in combination with restrictions on what operations are allowed.


Check out ispc/spmd by Matt Pharr, a very interesting take on this subject


Yes, have a look at ISPC - it's amazing. I especially like that it can generate code for multiple architectures and then select the best implementation at runtime for the CPU it's running on.


Do you know any good tutorial for ISPC? Documentation is a bit sparse.


A couple of related questions:

- It has been claimed that several GPU vendors behind the covers convert the SIMT programming model (graphics shaders, CUDA, OpenCL, whatever) into something like a SIMD ISA that the underlying hardware supports. Why is that? Why not have something SIMT-like as the underlying HW ISA? Seems the conceptual beauty of SIMT is that you don't need to duplicate the entire scalar ISA for vectors like you need with SIMD, you just need a few thread control instructions (fork, join, etc.) to tell the HW to switch between scalar or SIMT mode. So why haven't vendors gone with this? Is there some hidden complexity that makes SIMT hard to implement efficiently, despite the nice high level programming model?

- How do these higher level HW features like Tensor cores map to the SIMT model? It's sort of easy to see how SIMT handles a vector, each thread handles one element of the vector. But if you have HW support for something like a matrix multiplication, what then? Or does each SIMT thread have access to a 'matmul' instruction, and all the threads in a warp that run concurrently can concurrently run matmuls?


It is the same reason in software sometimes you batch operations:

When you add two numbers, the GPU needs to do a lot more stuff besides the addition.

If you implemented SIMT by having multiple cores, you would need to do the extra stuff once per core, so you wouldn't save power (and you have a fixed power budget). With SIMD, you get $NUM_LANES additions, but you do the extra stuff only once, saving power.

(See this article by OP, which goes into more details: https://yosefk.com/blog/its-done-in-hardware-so-its-cheap.ht... )


How would you envision that working at the hardware level? GPUs are massively parallel devises, they need to keep the scheduler and ALU logic as simple and compact as possible. SIMD is a natural way to implement this. In real world, SIMT is just SIMD with some additional capabilities for control flow and a programming model that focuses on SIMD lanes as threads of execution.

What’s interesting is that modern SIMT is exposing quite a lot of its SIMD underpinnings, because that allows you to implement things much more efficiently. A hardware-accelerated SIMD sum is way faster than adding values in shared memory.


> GPUs are massively parallel devises, they need to keep the scheduler and ALU logic as simple and compact as possible

The simplest hardware implementation is not always the more compact or the more efficient. This is a misconception, example bellow.

> SIMT is just SIMD with some additional capabilities for control flow ..

In the Nvidia uarch, it does not. The key part of the Nvidia uarch is the "operand-collector" and the emulation of multi-ports register-file using SRAM (single or dual port) banking. In a classical SIMD uarch, you just retrieve the full vector from the register-file and execute each lane in parallel. While in the Nvidia uach, each ALU have an "operand-collector" that track and collect the operands of multiple in-flight operations. This enable to read from the register-file in an asynchronous fashion (by "asynchronous" here I mean not all at the same cycle) without introducing any stall.

When a warp is selected, the instruction is decoded, an entry is allocated in the operand-collector of each used ALU, and the list of register to read is send to the register-file. The register-file dispatch register reads to the proper SRAM banks (probably with some queuing when read collision occur). And all operand-collectors independently wait for their operands to come from the register-file, when an operand collector entry has received all the required operands, the entry is marked as ready and can now be selected by the ALU for execution.

That why (or 1 of the reason) you need to sync your threads in the SIMT programing model and not in an SIMD programming model.

Obviously you can emulate an SIMT uarch using an SIMD uarch, but a think it's missing the whole point of SIMT uarch.

Nvidia do all of this because it allow to design a more compact register-file (memories with high number of port are costly) and probably because it help to better use the available compute resources with masked operations


In an operand-collector architecture the threads are still executed in lockstep. I don't think this makes the basic architecture less "SIMD-y". Operand collectors are a smart way to avoid multi-ported register files, which enables more compact implementation. Different vendors use different approaches to achieve a similar result. Nvidia uses operand collectors, Apple uses explicit cache control flags etc.

> This enable to read from the register-file in an asynchronous fashion (by "asynchronous" here I mean not all at the same cycle) without introducing any stall.

You can still get stalls if an EU is available in a given cycle but not all operands have been collected yet. The way I understand the published patents is that operand collectors are a data gateway to the SIMD units. The instructions are alraedy scheduled at this point and the job of the collector is to sgnal whether the data is ready. Do modern Nvidia implementations actually reorder instructions based feedback from operand collectors?

> That why (or 1 of the reason) you need to sync your threads in the SIMT programing model and not in an SIMD programming model.

It is my understanding that you need to synchronize threads when accessing shared memory. Not only different threads can execute on different SIMD, but also threads on the same SIMD can access shared memory over multiple cycles on some architectures. I do not see how thread synthconization relates to operand collectors.


> In an operand-collector architecture the threads are still executed in lockstep. > [...] > It is my understanding that you need to synchronize threads when accessing shared memory.

Not sure what you mean by lockstep here. When an operand-collector entry is ready it dispatch it to execute as soon as possible (write arbitration aside) even if other operand-collector entries from the same warp are not ready yet (so not really what a would call "threads lock-step"). But it's possible that Nvidia enforce that all threads from a warp should complete before sending the next warp instruction (I would call it something like "instruction lock-step"). This can simplify data dependency hazard check. But that an implementation detail, it's not required by the SIMT scheme.

And yes, it's hard to expose de-synchronization without memory operations, so you only need sync for memory operation. (load/store unit also have operand-collector)

> You can still get stalls if an EU is available in a given cycle but not all operands have been collected yet

That's true, but you have multiple multiple operand-collector entry to minimize the probability that no entry is ready. I should have say "to minimize bubbles".

> The way I understand the published patents is that operand collectors are a data gateway to the SIMD units. The instructions are alraedy scheduled at this point and the job of the collector is to sgnal whether the data is ready. Do modern Nvidia implementations actually reorder instructions based feedback from operand collectors?

Calling UE "SIMD unit" in an SIMT uarch add a lot of ambiguity, so I'm not sure a understand you point correctly. But, yes (warp) instruction is already scheduled, but (ALU) operation are re-scheduled by the operand-collector and it's dispatch. In the Nvidia patent they mention the possibility to dispatch operation in an order that prevent write collision for example.


> Not sure what you mean by lockstep here. When an operand-collector entry is ready it dispatch it to execute as soon as possible (write arbitration aside) even if other operand-collector entries from the same warp are not ready yet (so not really what a would call "threads lock-step"). But it's possible that Nvidia enforce that all threads from a warp should complete before sending the next warp instruction (I would call it something like "instruction lock-step"). This can simplify data dependency hazard check. But that an implementation detail, it's not required by the SIMT scheme.

Hm, the way I understood it is that a single instruction is executed on a 16-wide SIMD unit, thus processing 16 elements/threads/lanes simultaneously (subject to execution mask of course). This is what I mean by "in lockstep". In my understanding the role of the operand collector was to make sure that all register arguments are available before the instruction starts executing. If the operand collector needs multiple cycles to fetch the arguments from the register file, the instruction execution would stall.

So you are saying that my understanding is incorrect and that the instruction can be executed in multiple passes with different masks depending on which arguments are available? What is the benefit as opposed to stalling and executing the instruction only when all arguments are available? To me it seems like the end result is the same, and stalling is simpler and probably more energy efficient (if EUs are power-gated).

> But, yes (warp) instruction is already scheduled, but (ALU) operation are re-scheduled by the operand-collector and it's dispatch. In the Nvidia patent they mention the possibility to dispatch operation in an order that prevent write collision for example.

Ah, that is interesting, so the operand collector provides a limited reordering capability to maximize hardware utilization, right? I must have missed that bit in the patent, that is a very smart idea.

> But it's possible that Nvidia enforce that all threads from a warp should complete before sending the next warp instruction (I would call it something like "instruction lock-step"). This can simplify data dependency hazard check. But that an implementation detail, it's not required by the SIMT scheme.

Is any existing GPU actually doing superscalar execution from the same software thread (I mean the program thread, i.e., warp, not a SIMT thread)? Many GPUs claim dual-issue capability, but that either refers to interleaved execution from different programs (Nvidia, Apple) or a SIMD-within SIMT or maybe even a form of long instruction word (AMD). If I remember correctly, Nvidia instructions contain some scheduling information that tells the scheduler when it is safe to issue the next instruction from the same wave after the previous one started execution. I don't know how others do it, probably via some static instruction timing information. Apple does have a very recent patent describing dependency detection in an in-order processor, no idea whether it is intended for the GPU or something else.

> you have multiple multiple operand-collector entry to minimize the probability that no entry is ready. I should have say "to minimize bubbles".

I think this is essentially what some architectures describe as the "register file cache". What is nice about Nvidia's approach is that it seems to be fully automatic and can really make the best use of a constrained register file.


> I understood it is that a single instruction is executed on a 16-wide SIMD unit, thus processing 16 elements/threads/lanes simultaneously (subject to execution mask of course). This is what I mean by "in lockstep".

Ok I see, that definitely not what I understood from my study of the Nvidia SIMT uarch. And yes I will claim that "the instruction can be executed in multiple passes with different masks depending on which arguments are available" (using your words).

> So the operand collector provides a limited reordering capability to maximize hardware utilization, right?

Yes, that my understanding, and that's why I claim it's different from "classical" SIMD

> What is the benefit as opposed to stalling and executing the instruction only when all arguments are available?

That's a good question, note that: I think Apple GPU uarch do not work like the Nvidia one, my understanding is that Apple uarch is way closer to a classical SIMD unit. So it's definitely not killer to move form the original SIMT uarch from Nvidia.

That said, a think the SIMT uarch from Nvidia is way more flexible, and better maximize hardware utilization (executing instruction as soon as possible always help for better utilization). And let say you have 2 warps with complementary masking, with the Nvidia's SIMT uarch it goes naturally to issue both warps simultaneously and they can be executed at the same cycle within different ALU/core. With a classical SIMD uarch it may be possible but you need extra hardware to handle warp execution overlapping, and even more hardware to enable overlapping more that 2 threads.

Also, Nvidia's operand-collector allow to emulate multi-ported register-file, this probably help with register sharing. There is actually multiple patent from Nvidia about non-trivial register allocation within the register-file banks, depending on how the register will be used to minimize conflict.

> Is any existing GPU actually doing superscalar execution from the same software thread (I mean the program thread, i.e., warp, not a SIMT thread)?

It's not obvious what would mean "superscalar" in an SIMT context. For me a superscalar core is a core that can extract instruction parallelism from a sequential code (associated to a single thread) and therefore dispatch/issue/execute more that 1 instruction per cycle per thread. With SIMT most of the instruction parallelism is very explicit (with thread parallelism), so it's not really "extracted" (and not from the same thread). But anyway, if you question is either multiple instructions from a single warp can be executed in parallel (across different threads), then a would say probably yes for Nvidia (not sure, there is very few information available..), at least 2 instructions from the same thread block (from the same program, but different warp) should be able to be executed in parallel.

> I think this is essentially what some architectures describe as the "register file cache"

I'm not sure about that, there is actually some published papers (and probably some patents) from Nvidia about register-file cache for SIMT uarch. And that come after the operand-collector patent. But in the end it really depend what concept you are referring to with "register-file cache".

In the Nvidia case a "register-file cache" is a cache placed between the register-file and the operand-collector. And it makes sense in their case since the register-file have variable latency (depending on collision) and because it will save SRAM read power.


> Yes, that my understanding, and that's why I claim it's different from "classical" SIMD

I understand, yes, it makes sense. Of course, other architectures can make other optimizations, like selecting warps that are more likely to have data ready etc., but Nvidia's implementation does sound like a very smart approach

> And let say you have 2 warps with complementary masking, with the Nvidia's SIMT uarch it goes naturally to issue both warps simultaneously and they can be executed at the same cycle within different ALU/core

That is indeed a powerful technique

> It's not obvious what would mean "superscalar" in an SIMT context. For me a superscalar core is a core that can extract instruction parallelism from a sequential code (associated to a single thread) and therefore dispatch/issue/execute more that 1 instruction per cycle per thread.

Yes, I meant executing multiple instructions from the same warp/thread concurrently, depending on the execution granularity of course. Executing instructions from different warps in the same block is slightly different, since warps don't need to be at the same execution state. Applying the CPU terminology, warp is more like a "CPU thread". It does seem like Nvidia indeed moved quite far into the SIMT direction and their threads/lanes can have independent program state. So I thin I can see the validity of your arguments that Nvidia can remap SIMD ALUs on the fly to suitable threads in order to achieve high hardware utilization.

> In the Nvidia case a "register-file cache" is a cache placed between the register-file and the operand-collector. And it makes sense in their case since the register-file have variable latency (depending on collision) and because it will save SRAM read power.

Got it, thanks!

P.S. By the way, wanted to thank you for this very interesting conversation. I learned a lot.


This type of parallelism is sort of like a flops metric. Optimizing the amount of wall time the GPU is actually doing computation is just as important (if not more). There are some synchronization and pipelining tools in CUDA and Vulkan but they are scary at first glance.


> Programmable NVIDIA GPUs are very inspiring to hardware geeks, proving that processors with an original, incompatible programming model can become widely used.

Got me laughing at the first line.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: