Hacker News new | past | comments | ask | show | jobs | submit login
How simultaneous multithreading works under the hood (codingconfessions.com)
334 points by rbanffy 3 months ago | hide | past | favorite | 129 comments



A grossly over-simplified argument for SMT that resonated with me was that it could keep a precious ALU busy while a thread stalls on a cache miss.

I gather in the early days the LPDDR used on laptops was slower too and since cores were scarce so this was more valuable there. Lately, though, we often have more cores than we can scale with and the value is harder to appreciate. We even avoid scheduling work on a shared with an important thread to avoid cache-contention because we know the single-threaded performance will be the bottleneck.

A while back I was testing Efficient/Performance cores and SMT cores for MT rendering with DirectX 12; on my i7-12700K I found no benefit to either: just using P-cores took about the same time to render a complex scene as P+SMT and P+E+SMT. It's not always a wash, though: on the Xbox Series X we found the same test marginally faster when we scheduled work for SMT too.


Rendering is one of the scenarios which was either same or slower with SMT since the beginning. This is because rendering is already math heavy, and your FPU is always active, esp. dividers (which is always the most expensive operation for processors).

SMT shines while waiting for I/O or doing some simple integer stuff. If both your threads can saturate the FPU, SMT is generally slower because of the extra tagging added to the data inside the CPU to note what belongs where.


But the way you make rendering embarrassingly parallel is the way you make web servers parallel; treat the system as a large number of discrete tasks with deadlines you work toward and avoid letting them interact with each other as much as possible.

You don’t worry about how long it takes to render one frame of a digital movie, you worry about how many CPU hours it takes to render five minutes of the movie.


Yes, however in an SMT enabled processor, there are one physical FPU per two logical cores. FPU is already busy with other thread's work, so the threads in that SMT enabled core take turns for their computation in the FPU, creating a bottleneck here.

As a result, you don't get any speed boost at best case, or lose some time in the worst case.

Since SMT doesn't magically increase the number of FPUs available in a processor, if what you're doing is math heavy, SMT just doesn't help. Same is true for scientific simulation, too. I observed the same effect, and verified that indeed saturating the FPU with a thread makes SMT moot.


If you have contention around a single part of the CPU then yes, SMT will not help you. The single FPU was an issue on the first Niagara processor as well, but it still had great throughput per socket unless all processes were fighting for the FPU.

If, however, you have multiple FPUs on your processor, then it might be useful to enable SMT. As usual, it pays to tune hardware to the workload you have. For integer-heavy workloads, you might prefer SMT (there are options for up to 8 threads per physical core out there) up to the point either cache misses or backend exhaustion happens.


Current processors contains one FPU per core. When you have a couple of FPU heavy programs in a system, SMT makes sense, because it allows you to keep FPU busy while other lighter threads play in the sand elsewhere.

OTOH, when you run a HPC node, everything you run wants the FPU, Vector Units and the kitchen sink in that general area. Enabling SMT makes the queue longer, and nothing is processed faster in reality.

So, as a result, SMT makes sense sometimes, and is detrimental to performance in other times. Benchmarking, profiling and system tuning is the key. We generally disable SMT in our systems because it lowers performance when the node is fully utilized (which is most of the time).


I'm not really sure why you say "one FPU per core". are you talking about the programmer-visible model? all current/mainstream processors have multiple and parallel FP FUs, with multiple FP/SIMD instructions in flight. afaik, the inflight instrs are from both threads if SMT is enabled (that is, tracked like all other uops). I'm also not sure why you say that enabling SMT "makes the queue longer" - do you mean pipeline? Or just that threads are likely to conflict?


Yes, but well optimized math heavy software will already max out the super-scalarity of the FPU. I.e. one cpu thread can already schedule multiple fpu-heavy instructions at the same time. If you run such software twice on the same fpu you will only gain overhead. I guess by queue he meant the processor internal work queue, the processor pipeline is only half of the picture. Processors have a small data-dependency graph of micro-instructions they have to perform. That is used to implement the machine code instructions that are currently in-flight.


> I guess by queue he meant the processor internal work queue...

Yes, I meant the internal one. Also, when you enable SMT, a small tag is added in front of every instruction, noting which logical core owns this instruction for a given physical core. So instead of tagging every instruction with a core-ID, you add a longer tag in the form of core-ID/logical_core-ID.

This extra tagging also makes instructions bigger, so the queue can hold less instructions, adding fuel to already chaotic and choked FPU logistics.

As a result, if you're saturating your FPU(s), SMT can't save you. In fact can make you slower.


If you're waiting for IO, you're likely getting booted off the processor by the OS anyway. SMT is most useful when your code doesn't have enough instruction-level parallelism but is still mostly compute bound.


I believe "I/O" here is referring to data movement between DRAM and registers. Not drives or NICs.


Yes, exactly. One exception can be Infiniband, since it can put the received data to RAM directly, without CPU intervention.


DMA is a much older technology. It's just that at some point you do need the CPU to actually look at it.


Infiniband uses RDMA, which is different than ordinary DMA. Your IB card sends the data to the client point to point, and the IB card directly writes it to the RAM. IB driver notifies that the data is arrived (generally via IB accelerated MPI), and you directly LOAD your data from the memory location [0].

IOW, your data magically appears in your application's memory, at the correct place. This is what makes Mellanox special, and made NVIDIA to acquire them.

From the linked document:

Instead of sending the packet for processing to the kernel and copying it into the memory of the user application, the host adapter directly places the packet contents in the application buffer.

[0]: https://docs.redhat.com/en/documentation/red_hat_enterprise_...


Linux has had zero copy network support for 15 years. No magic.


It's not "zero copy networking" only.

In an IB network, two cards connect point to point over the switch and "beam" one's RAM contents to other. On top of it, with accelerated MPI, certain operations are offloaded to IB cards and IB switches (like broadcast, sum, etc.), so MPI library running on the host doesn't have to handle or worry about these operations, leaving time and processor cycles for computation itself.

This is the magic I'm talking about.


IB didn't invent RDMA, and it's not even the only way to do it today.

it's also not amazingly great, since it only solves a small fraction of the cluster-communication problem. (that is, almost no program can rely on magic RDMA getting everything were it needs to be - there will always be at least some corresponding "heavyweight" messaging, since you still needs locks and other synchronization.)


I’ve used other peripherals that did this. Under the hood you would have a virtual mapping to a physical address and extent where the virtual mapping is in the address space of your process. This is how dma works in qnx because drivers are userspace processes. The special thing here is essentially doing the math in the same process as the driver.

I agree that sounds very nice for distributed computation.


> The special thing here is essentially doing the math in the same process as the driver.

No, you're doing MPI operations on the switch fabric and the IB ASIC itself. CPU doesn't touch these operations, but only see the result of the operation. NVIDIA's DPU is just a more general purpose version of this.


Intel’s hyperthreading is really a write pipe hack.

It’s not so much cache misses as allowing the core to run something else while the write completes.

This is why some code scales poorly and other code achieves near linear speed ups.


Why would the core have to wait for the write to complete?

A core stalls on a write only if the store buffer is full. As hyper threads share the write buffer, SMT makes store stalls more likely, not less ( but still unlikely to be the bottleneck).


At this point, especially with backside power, I wonder how much cache stalls on one processor result in less thermal throttling both on that processor and neighboring ones.

Maybe we should just be letting these procs take their little naps?


This leads, in the extreme, to the idea of a huge array of very simple cores, which I believe is something that has been tried but never really caught on.


That description reminds me of GreenArrays' (https://www.greenarraychips.com) Forth chips that have 144 cores – although they call them "computers" because they're more independent than regular CPU cores, and eg. each has its own memory and so on. Each "computer" is very simple and small – with a 180nm geometry they can cram 8 of them in 1mm^2, and the chip is fairly energy-efficient.

Programming for these chips apparently a bit of a nightmare though. Because the "computers" are so simple, even eg. calculating MD5 turns into a fairly tricky proposition as you have to spread out the algorithm to multiple computers with very small amounts of memory, so something that would be very simple on a more classic processor turns into a very low level multithreaded ordeal


Worth noting that the GreenArrays chip is 15 years old. 144 cores was a BIG DEAL back then. I wonder what a similar architecture compiled with a modern process could achieve. 1440 cores? More?


those weren't "real" cores. you know what current chip has FUs that it falsely calls "cores"? that's right, Nvidia GPUs. I think that's the answer to your question (pushing 20k).


In what way were they not “real” cores? They had their own operating environment completely independent of other cores. GPU execution units on the other hand are SIMD--a single instruction stream.


How are GA144's nodes / "computers" not real cores? They're fully independent, each has its own memory (RAM and ROM), stacks, and registers, its own I/O ports and GPIO pins (some of them), and so on.

https://www.greenarraychips.com/home/documents/greg/PB003-11...


it's a very interesting product - sort of the lovechild of a 1980s bitsliced microcode system optimized for application-specific pipelines (systolic array, etc).

do you know whehter it's had any serious design wins? I can easily imagine it being interesting for missile guidance, maybe high-speed trading, password cracking/coin mining. you could certainly do AI and GPUs with it, but I'm not sure it would have an advantage, even given the same number of transistors and memory resources.


I think they were more meant to minimize power use than be very performant. Honestly no clue how widely the GA144 is used in real-world applications, but I guess that since the company still exists they have to be getting money from somewhere


Calling all TIS-100 fans


Sounds like gpu to me.


The Xeon Phi was a “manycore” x86 design with lots of tiny CPU cores, something like the original Pentium, but with the addition of 512-bit SIMD and hyperthreading:

https://en.m.wikipedia.org/wiki/Xeon_Phi


IIRC the first Phi has SMT4 in a round robin fashion similar to the Cell PPUs. To make a core run at full speed, you should schedule 4 threads on it.


The very, very first Phi still had its ROPs and texture units, being essentially a failed GPU with identifying marks filed off (yes, the first units were Larrabee prototypes with video outputs unpopulated)


> the first units were Larrabee prototypes with video outputs unpopulated

Such a shame. I'd love to base a workstation on those.

Seems to be a hobby I never quite act upon - to misuse silicon and make it work as a workstation.


GPUs are actually SMT'd to the extreme. For example, Intel's Xe-HPG has 8-wide SMT. Other vendors have even bigger SMT: RDNA2 can have up to 16 threads in flight per core.


Transputers.


>I gather in the early days the LPDDR used on laptops was slower too

Oddly, the latency hasn't improved too much - CAS is often 5-10ns for DDR2/3/4/5. Bus width, transfers per second, queueing, power per bit transfered and stored have, but if a program depends on something that's not in cache and was poorly predicted, the RAM latency is the issue.


I wonder if instead of having SMT, processors could briefly power off the unused ALUs/FPUs while waiting for something further up the pipeline, and focus on reducing heat and power consumption rather than maximizing utilization.


They basically do: it's pretty common to clock gate inactive parts of the ALU, which reduces their power consumption greatly. Modern processor power usage is very workload-dependent for this reason.


I consider SMT a relic left over from the days when CPU design was all about performance per square millimeter. We are in the process of substituting that goal with that of performance per watt, or in the process of slowly realizing that our goals have shifted quite a while ago.

I really don't expect SMT to stay much longer. Even more so with timing visibility crosstalk issues lurking and big/small architectures offering more parallelism per chip area where single thread performance isn't in the spotlight. Or perhaps the marketing challenge of removing a feature that had once been the pride of the company is so big that SMT stays forever.


Intel is removing SMT from their next gen mobile processor.

My guess is this will help them improve ST perf. We will see how well it works, and if amd will follow


Could you, do they, put the “extra” LUs right next to the parts of the chip with the highest average thermal dissipation to even out the thermal load across the chip?

Or stack them vertically, so the least consistently used parts of the chip are farthest away from the heat sink, delaying throttling.


Intel has internal papers that investigated the use of the third dimension and the effect it would have on power consumption and performance. Of course it improves things, but it is very difficult to implement in the real world. The first real use of this technique by Intel is coming soon in the form of backside power delivery.

AMD 3D-Vcache technology shows that stacking an additional layer of transistors has a significant effect on thermal limits of a modern CPU. The extra cache is strategically placed over parts of the CPU die that use less power, yet those CPUs still have to run at lower temperatures and power settings compared to their non-vcache models. Just because you can build it doesn't mean that it will be a good fit for the mass market.


> on the Xbox Series X we found the same test marginally faster when we scheduled work for SMT too.

This makes sense, the Series X has GDDR RAM, and so has substantially worse latency than DDR/LPDDR. SMT can help cover that latency, and the higher GDDR bandwidth mitigates the higher memory bandwidth needed to feed both threads.


Anecdotally, mkp224o (.onion vanity address miner, supposedly compute-bound with little memory access) runs about 5-10% faster on my 24-core AMD with 48 threads than with 24 threads. However, I haven't tried the same benchmark with SMT disabled in firmware.


Intel’s next generation Arrow Lake CPUs are supposed to remove hyperthreading (i.e. SMT) completely.

The performance gains were always heavily application-dependent, so maybe it’s better to simplify.

Here’s a recent discussion of when and where it makes sense: https://news.ycombinator.com/item?id=39097124


Most programs end up with some limitation on the number of threads they can reasonably used. When you have a lot less Cores than that SMT makes a lot of sense to better utilise the resources of the CPU. However once you get to the point where you have enough cores SMT no longer makes any sense. I am not convinced we are necessarily there yet but the P/E cores Intel are using are an alternative towards a similar goal and makes a lot of sense on the desktop given how many workloads are single/low threaded. I can see the value in not having to deal with SMT and E core distinctions in application optimisation.

AMD on the other hand intends to keep mostly homogenous cores for now and continue to use SMT. I doubt its going to be simple to work out which strategy in practice is the best, its going to vary widely by application.


It is my understanding that SMT should be beneficial regardless of core count, as SMT should enable two threads that can stall waiting for memory fetches to fully utilize a single ALU, i.e. SMT improves ALU utilization in memory bound applications with multiple threads by interleaving ALU usage when each thread is waiting on memory. Maybe larger caches are reducing the benefits of SMT, but it should be beneficial as long as there are many threads who are generally bound by memory latency.


In a CPU with many cores, when some cores stall by waiting for memory loads, other cores can proceed by using data from their caches and this is even more likely to happen than for the SMT threads that share the same cache memory.

When there are enough cores, they will keep the common memory interface busy all the time, so adding SMT is unlikely to increase the performance in a memory-throughput limited application when there already are enough cores.

Keeping busy all the ALUs in a compute-limited application can usually be done well enough by out-of-order execution, because the modern CPUs have very big execution windows from which to choose instructions to be executed.

So when there already are many cores, in many cases SMT may provide negligible advantages. On server computers there are much more opportunities for SMT to improve their efficiency, but on non-server computers I have encountered only one widespread application for which SMT is clearly beneficial, which is the compilation of big software projects (i.e. with thousands of source files).

The big cores of Intel are optimized for single-thread performance. This optimization criterion results in bad multi-threaded performance. The reason is that the MT performance is limited by the maximum permissible chip area and by the maximum possible power consumption. A big core has very poor performance per area and performance per power ratios.

Adding SMT to such a big core improves the multi-threaded performance, but it is not the best way of improving it, because in the same area and power consumption used by a big core one can implement 3 to 5 efficient cores, so such a replacement of a big core with multiple efficient cores will increase the multi-threaded performance much more than adding SMT. So unlike for the case of a CPU that uses only big cores, in hybrid CPUs, SMT does not make sense, because a better MT performance is obtained by keeping only a few big cores, to provide high single-thread performance, and by replacing the other big cores with smaller, more efficient cores.


> Maybe larger caches are reducing the benefits of SMT, but it should be beneficial as long as there are many threads who are generally bound by memory latency.

I thought the reason SMT sometimes resulted in lower performance was that it halved the available cache per thread though - shouldn't larger caches make SMT more effective?


My understanding is that a larger cache can make SMT more effective, but like usual, only in certain cases.

Let’s imagine we have 8 cores with SMT, and we’re running a task that (in theory) scales roughly linearly up to 16 threads. If each thread’s working memory is around half as much as there is cache available to each thread, but each working set is only used briefly, then SMT is going to be hugely beneficial: while one hyperthread is committing and fetching memory, the other one’s cache is already full with a new working set and can begin computing. Increasing cache will increase the allowable working set size without causing cache contention between hyperthreads.

Alternatively, if the working set is sufficiently large per thread (probably >2/3 the amount of cache available), SMT becomes substantially less useful. When the first hyperthread finishes its work, the second hyperthread has to still wait for some (or all) of its working set to be fetched from main memory (or higher cache levels if lucky). This may take just as long as simply keeping hyperthread #1 fed with new working sets. Increasing cache in this scenario will increase SMT performance almost linearly, until each hyperthread’s working set can be prefetched into the lowest cache levels while the other hyperthread is busy working.

Also consider the situation where the working set is much, much smaller than the available cache, but lots of computing must be done to it. In this case, a single hyperthread can continually be fed with new data, since the old set can be purged to main memory and the next set can be loaded into cache long before the current set is processed. SMT provides no benefit here no matter how large you grow the cache (unless the tasks use wildly different components of the core and they can be run at instruction-level parallelism - but that’s tricky to get right and you may run into thermal or power throttling before you can actually get enough performance to make it worthwhile).

Of course the real world is way more complicated than that. Many tasks do not scale linearly with more threads. Sometimes running on 6 “real” cores vs 12 SMT threads can result in no performance gain, but running on 8 “real” cores is 1/3 faster. And sometimes SMT will give you a non-linear speedup but a few more (non-SMT) cores will give you a better (but still non-linear) speedup. So short answer: yes, sometimes more cache makes SMT more viable, if your tasks can be 2x parallelized, have working sets around the size of the cache, and work on the same set for a notable chunk of the time required to store the old set and fetch the next one.

And of course all of this requires the processor and/or compiler to be smart enough to ensure the cache is properly fed new data from main memory. This is frequently the case these days, but not always.


Let's say your workload consists solely in traversing a single linked list. This list fits perfectly in L1.

As an L1 load takes 4 cycles and you can't start the next load untill you completed the previous one, the CPU will stall doing nothing 3/4th of cycles. A 4-way SMT could in principle make use of all the wasted cycles.

Of course no load is even close to purely traversing a linked list, but a lot of non-hpc real world load do spend a lot of time in latency limited sections that can benefit from SMT, so it is not just cache misses.


> so it is not just cache misses.

Agreed 100%. SMT is waaaay more complex than just cache. I was just trying to illustrate in simple scenarios where increasing cache would and would not be beneficial to SMT.


Depends greatly on the work load.


I’d like to see the math for why it doesn’t work out to have a model where n real cores share a set of logic units for rare instructions and a few common ones where say the average number of instructions per clock is 2.66 so four cores each have 2 apiece and then share 3 between them.

When this whole idea first came up that’s how I thought it was going to work, but we’ve had two virtual processors sharing all of their logic units in common instead.


It is difficult to share an execution unit between many cores, because in that case it must be placed at a great distance from at least the majority of those cores.

The communication between distant places on a chip requires additional area and power consumption and time. It may make the shared execution unit significantly slower than a non-shared unit and it may decrease the PPA (performance per power and area) of the CPU.

Such a sharing works only for a completely self-contained execution unit, which includes its own registers and which can perform a complex sequence of operations independently of the core that has requested it. In such cases the communication between a core and the shared unit is reduced to sending the initial operands and receiving the final results, while between these messages the shared unit operates for a long time without external communication. An example of such a shared execution unit is the SME accelerator of the Apple CPUs, which executes matrix operations.


Aside for large vector ALUs, execution units are cheap in transistor count. Caches, TLBs, memory for the various predictors, register files, reorder buffers, cost probably a significantly larger amount of transistors.

In any case, execution units are clusters around ports, so, AFAIK, you wouldn't really be able to share at the instruction level, bu only groups of related instructions.

Still, some sharing is possible, AMD tried to share FPUs in Bulldozer, but it didn't work well. Some other CPUs share cryptographic accelerators. IIRC Apple shares the matrix unit across cores.


Are predictors and TLBs shared between SMTs?


As far as I know yes, most structures are dynamically shared between hypertreads.


Im creating a game + engine and speaking from personal experience/my use case, hyperthreading was less performant than (praying to the CPU thread allocation god) each thread utilizing its own core. I decided to max out the number of threads by using std::thread::hardware_concurrency() / 2 - 1. (i.e. number of cores - 1 ).

I'm working with a std::vector


On common industry benchmarks at least every second generation of Intel hyperthreading ended up being slower than turning it off. Even when it worked it was barely double digit percent improvements, and there were periods when it was worse for consecutive generation. Why do they keep trying?


On the other hand a lot of use cases get high speedups from SMT (eg "AMD’s implementation of simultaneous multithreading appears to be power efficient as it achieves 41% additional requests per second while consuming only 7% more power compared to the skewed baseline" in https://blog.cloudflare.com/measuring-hyper-threading-and-tu...)

Seems Intel is maybe just not very good at it.


Because the benchmarks don't measure what people do with computers.


Most benchmarks are fine-tuned code that provides pretty much the perfect case for running with SMT off, because the real-world use cases that benefit from SMT are absent in those benchmarks.


Even on server parts?


Reading about how some of these low-level CPU features work always blows my mind.

Back in college, I took a class that I think was called Intro to Computer Hardware, but it really should have been called Intro to CPU Design. We learned how to turn logic gates into adders, latches, flip-flops, etc, and by the end of the course, we were able to design a very rudimentary processor at the gate level.

How someone could create something like register renaming figuring how out-of-order execution is beyond me. They're not designing these at the gate level, are they? Or is there a language they use and a "compiler" that arranges the gates and transistors?


I took the next level of that course and they taught up to SMT and a few other things. We did all of the homework assignments in a hardware description language called Verilog which allowed us to write abstractions for all the things.


One of the biggest mistakes users have is a mental model of SMT that imagines the existence of one "real core" and one inferior one. The threads are coequal in all observable respects.


I suspect that’s a result of the performance. While both threads are capable of the same tasks, you don’t get the 2x performance you would with a “real” second thread, really a second core.

So conceptually it’s a little more like having 1.25 single threaded cores, or whatever the ratio is for your application, if you only care about performance in the end.


If you're compressing a video or a similar highly optimised compute-hungry task, your computer's fans are roaring like jet engines, and task manager says you're at 50% CPU utilisation - I can see how people formed that opinion.


I mean, Intel's new CPUs certainly have both real cores ("P-cores") and inferior cores ("E-cores"). I suspect the reason they introduced E-cores is primarily thermal- and die-space-related, not actually power usage or performance. I always make sure to buy the chips without E-cores because they are better.


That's weird, I buy the one with the most e-cores. They're great for batch processes. For example the 14900K builds the Linux kernel faster than the Ryzen 9 7950X, even though it only has 8 "real cores" versus 16.


Until it degrades...


Does anyone know how to search for this type of detailed technical articles?

I've searched for this exact topic. As expected with kind of end user facing techs, the search results are always end user oriented article that explain nothing.


https://hn.algolia.com, assuming that most articles of that sort end up getting submitted to or mentioned on HN.


I find that LLMs with web access are a good fit for this kind of search, at least to point me in the right direction. The URLs provided are mostly hallucinations, however.


I don't know if google is somehow tracking the surge of interest in this article due to this HN post, but this very blog post is like search result 5 for me when looking up "how does simultaneous multi threading work" (results from a fresh firefox private tab on a different device, which I understand doesn't necessarily stop all tracking / caching but should be a fairly reasonable approx).


> As we have seen, enabling SMT on a CPU core requires sharing many of the buffers and execution resources between the two logical processors. Even if there is only one thread running on an SMT enabled core, these resources remain unavailable to that thread which reduces its potential performance.

This isn't true (anymore?). We've seen a variety of SMT cores which partition the ROB, fetch/decode bandwidth, etc. when running in SMT mode but allow full use when not in SMT mode.


This is exactly how the x200 series Phi processors work, you get far more resources per thread in non-smt mode than the 4-way smt mode.


The whole point of SMT is to maximize utilization of a superscalar execution engine.

I wonder if that trend means people think superscalar is less important than it used to be.


Good summary overall, although seemed a little muddled in places.

Would love to know some of the tricks of the trade from insiders (not relating to security at least)


Poor AMD their bulldozer architecture got so much flak for not including SMT and now everyone's moving away from it.

Yes yes I know bulldozer had a bunch more issues than just no SMT. It actually had the exact opposite with multiple cores sharing the same ALU or something like that. But still they could have been onto something if they had made it marginally more performant.


> for not including SMT and now everyone's moving away from it.

The PowerXX architecture isn't moving away from it. Power10 currently supports SMT8 (8 threads per core) effectively and I can't imagine them moving away from it given all of the work they've done over the years to continue to evolve their design around SMT.


What I think is worth knowing is that compute units in GPU’s also use SMT, usually at a level of 7 to 10 threads per CU. This helps to hide latency.


Most GPUs do not use SMT, but the predecessor of SMT, fine-grained multi-threading. In every clock cycle the instruction that is initiated is selected from the many threads that are available, depending on which of them need resources that are not busy. Most GPUs do not initiate multiple instructions per clock cycle (even if multiple instructions may proceed concurrently after initiation) or if they initiate multiple instructions per clock cycle those may have to belong to distinct classes of instructions, which use distinct execution resources, for example scalar instructions and vector instructions.

SMT, i.e. simultaneous multi-threading, means that in every clock cycle many instructions are simultaneously initiated from all threads and then those instructions compete for the multiple execution units of a superscalar CPU, in order to keep busy as many of those execution units as possible. For each of the concurrent execution units, e.g. for each of the 6 integer adders available in the latest CPUs, a decision is made separately from the others about which instruction to be executed from the queues that hold instructions belonging to all simultaneous threads.


As far as I know hypertreads on intel share fetchers and decoders, so each clock cycle only one thread is feeding in instructions to the pipeline. That's no different to even for a simple barrel processor.

It is true that once fetched an OoO CPU does a significant amount of scheduling and it is possible that in a given clock cycle instructions from both threads are getting fed to an execution unit. But I don't think that's the essence of SMT.

For example the original larrabee is described as 4-way SMT, but as P5-derived it was a simple in-order design with very limited superscalar capabilities. I very much doubt that at any time instructions from more than one thread were at the execution stage.


The fetchers and decoders are shared, but they fetch and decode many instructions per clock cycle (up to 8 or 9 instructions per clock cycle in the latest Intel cores, i.e. Lion Cove and Skymont).

While the shared Intel decoders alternate between the threads and the queue that stores micro-operations before they are dispatched is also partitioned between threads, this front-end is decoupled from the schedulers that select micro-operations for execution, which may choose in any clock cycle as many uops as there are execution units and in any combination between the SMT threads.

Even in the first Intel CPU with SMT, Pentium 4, up to 3 instructions were fetched and decoded in each clock cycle and there were places where up to 7 instructions in any combination between the 2 SMT threads were executed during the same clock cycle.

In modern CPUs the concurrency is much greater.


They fetch and decode many instructions, but only from a single cache line fetched by the fetcher hence they can't decide instructions for more than one hyperthread at a time.

Except the newer *mont cores that have truly separate decoders and fetchers and could indeed decode for two hypertreads separately.


As a pre-read you should also go through https://www.lighterra.com/papers/modernmicroprocessors/.


It seems a bit high-level, kind of skimming over a bunch of architecture concepts with a couple references to the fact that this might be duplicated when hyperthreading, this might not…

IMO a blog post should be more actionable. This isn’t a textbook chapter. For example we go through the frontend. When discussing the trace cache we have:

> … Instruction decoding is an expensive operation and some instructions need to be executed frequently. Having this cache helps the processor cut down the instruction execution latency.

> Trace cache is shared dynamically between the two logical processors on an as needed basis.

> Each entry in the cache is tagged with the thread information to distinguish the instructions of the two threads. The access to the trace cache is arbitrated between the two logical processors each cycle.

So the threads share a trace cache, but keep track of which hyperthread used which instructions—but we don’t really know, practically, if we prefer threads that are running very similar computations or if that is a non-issue (that is, does the fact that they share the trace cache mean one thread can benefit from things the other has cached? Or does the tagging keep them separated?).

In general, often they say “this is split equally between the two threads” or “this is shared,” which makes me wonder “if I disable SMT does the now single-thread get access to twice as much of this resource, and are there cases where that matters.”

This is somewhat covered in:

> As we have seen, enabling SMT on a CPU core requires sharing many of the buffers and execution resources between the two logical processors. Even if there is only one thread running on an SMT enabled core, these resources remain unavailable to that thread which reduces its potential performance.

But this seems a bit fuzzy to me, I mean, we talk about caches which are shared dynamically between the two threads so at least some resources will be more readily available if only a single thread is running.

It also could be interesting—if the author is an expert, perhaps they could share their experience as to which pipeline stages are often bottlenecks that get tighter with hyperthreads on, and which aren’t? We have a sort of even focus on each stage without many hints as to which practically matter. Or how we can help them out. Also it is largely based on a 2002 whitepaper so I guess the specific pipeline stages must have evolved a bit since then.

Or maybe they could share some battle stories, favorite tools, some examples of applications and why they put pressure on particular stages, things which surprisingly didn’t scale when hyperthreads were enabled (I’m not asking for all these things, just any would be good).


Disabling SMT gives about +5% fps in the games I play.


What CPU?


7950X3D


Tangent: Ever since I became familiar with Erlang and the impressive BEAM, any other async method seems subpar and contrived, and that includes Python, Go, Rust, etc.

It's just weird how there's a correct way to do async and parallelism (which erlang does) and literally no other language does it.


Other languages do sometimes implement this at the library level. Clojure's core.async comes to mind (though there are subtle differences). There's downsides to this approach though.

The data going into each "mailbox" either needs to be immutable, or deep copied to be thread safe. This obviously comes at a cost. Sometimes you just have a huge amount of state that different threads need to work on, and the above solution isn't viable. Erlang has ets/dets to help deal with this. You will notice ets/dets looks nothing like the mailbox/process pattern.

Erlang is great, but it is hardly the "one true way". As with most things, tradeoffs are a thing, and usually the right solution comes down to "it depends".


> The data going into each "mailbox" either needs to be immutable, or deep copied to be thread safe

Or moved. The mailbox/process pattern works great in Rust because you can simply move ownership of a value. Kind of like if in C you send the pointer and then delete your copy.

Of course doing this across threads doesn't work with every type of value (what Rust's type system encodes as a value being `Send`). For example you can't send a reference-counted value if the reference counter isn't thread-safe. But that's rarely an issue and easily solved with a good type system.


The actor model (share-nothing) is one way to address the problem of shared, mutable state.

But what if I want to have my cake and eat it too? What if I want to have thread-safe, shared, mutable state. Is it not conceivable that there's a better approach than share-nothing?


You can always share immutable data. So shared nothing is a bit strong choice of words imo.

> But what if I want to have my cake and eat it too? What if I want to have thread-safe, shared, mutable state.

No, it’s not possible. Shared mutable state invokes ancient evils that violate our way of thinking about traditional imperative programming. Let’s assume you have a magical compiler & CPU that solves safety and performance. You still have unsynchronized reads and writes on your shared state. This means a shared variable you just read can change anytime “under your feet”, so in the next line it may be different. It’s a race condition, but technically not a data race. The classical problem is if multiple threads increment the same counter, which requires a temporary variable. A magical runtime can make it safe, but it can’t make it correct, because it cannot read your mind.

This unfortunately leaves you with a few options, that all violate our simple way of life in some manner: you can explicitly mark critical sections (where the world stands still for a short amount of time). Or you can send messages, which introduces new concepts and control flow constructs to your programming environment (which many languages do, but Erlang does perhaps the most seriously). Finally, you can switch to entirely different paradigms like reactive, dataflow, functional, etc, where the idea is the compiler or runtime parallelizes automatically. For instance, CSS or SQL engines.

I like message passing for two reasons: (1) it is already a requirement for networked applications so you can reuse design patterns and even migrate between them easily and (2) it supports and encourages single ownership of data which has proven to work well in applications when complexity grows over time.

OTOH, I am still using all of the above in my day to day. Hopefully in the future we will see clearer lines and fewer variations across languages and runtimes. It’s more complex than it needs to be, and we’re paying a large price for it.


There isn't anything inherently evil with mutable shared state. If you think about it, what is a database if not mutable shared memory? Concurrency control (or the lack of it) is the issue. But you can build concurrency control abstractions on top of shared memory.

Also remember that shared memory and message passing are duals.


Not if you put it into moralising terms like that. There's nothing wrong with crime either - lack of law enforcement is the issue.

Shared, mutable state means that your correct (single-threaded) reasoning ceases to be correct.

Databases are a brilliant example of safe, shared, mutable state. You run your transaction, your colleague runs his, the end result makes sense, and not a volatile in sight (not that it would have helped.)


> If you think about it, what is a database if not mutable shared memory?

Eh --- I send the database a message, and it often sends me a message back. From outside, it's communicating processes. Most databases do mutable shared state inside their box (and if you're running an in-process database, there's not necessarily a clear separation).

I don't think shared mutable state is inherently evil; but it's a lot harder to think about than shared-nothing. Of course, even in Erlang, a process's mailbox is shared mutable state; and so is ets. Sometimes, the most effective way forward is shared mutable state. But having to consider the entire scope of shared mutation is exhausting; I find it much easier to write a sequential program with async messaging to interface with the world; it's usually really clear what to do when you get a message, although it moves some complexity towards 'how do other processes know where to send those messages' and similar things. It's always easy to make async send/response feel synchronous, after you send a request, you can wait for the response (best to have a timeout, too); it's very painful to take apart a synchronous api into separate send and receive, so I strongly prefer async messaging as the basic building block.


As much as I hate to say it, I think Java probably has the best eaten cake implementation by far. Volatile makes sure variables stay sane on the memory side, if you only write to a variable from one thread and read it from others, then it just sort of magically works? Plus the executors to handle thread reuse for async tasks. I assume C# has the same concepts given that it's just a carbon copy with title case naming.

Python can't execute in on two cores at once, so it functionally has no multithreading, JS can share data between threads, but must convert it all to string because pointless performance penalties are great to have. Golang has that weird FIFO channel thing (probably sockets in disguise for these last two). C/C++ has a segfault.


> JS can share data between threads, but must convert it all to string

To be more precise, you can send data to web workers and worker threads by copying via the structured clone algorithm (unlike JSON this supports almost all data types), and you can also move certain transferable objects between threads which is a zero-copy (and therefore much faster) operation.


Ah yeah dataviews, but you still need to convert from json to those and that takes about as much overhead, plus they're much harder to deal with complexity-wise being annoying single type buffers and all. For any other language it would work better, but because JS mainly deals with data arriving from elsewhere it means it needs to be converted every single time instead of just maintaining a local copy for thread comms.


> Ah yeah dataviews, but you still need to convert from json to those and that takes about as much overhead

You don't necessarily need to have an intermediate JSON representation. Many of the built in APIs in Node and browsers return array buffers natively. For example:

  const buffer = await fetch('foo.wav').then(res => res.arrayBuffer())
  new Worker('worker.js').postMessage(buffer, [buffer])
This completely transfers the buffer to the worker thread, after which it is detached (unusable from the sending side) [1][2].

[1] https://developer.mozilla.org/en-US/docs/Web/API/Worker/post...

[2] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...


Hmm, well this I have to try sometime, thanks for the heads up :)


C# has Interlocked https://learn.microsoft.com/en-us/dotnet/api/system.threadin... which is strictly better than volatile because it lets you write lockfree code that actually observes the memory semantics.

"Volatile" specifies nothing about memory barrier semantics in either Java or C++, if I remember correctly?


std::atomic is equivalent of volatile in C++. C and C++ volatile is for other memory mapped I/O.


> I assume C# has the same concepts given that it's just a carbon copy with title case naming.

Better not comment than look clueless. Moreover, this applies to the use of volatile keyword in Java as well.


> Volatile makes sure variables stay sane on the memory side

This doesn't get you from shared-mutable-hell to shared-mutable-safe, it gets you from shared-mutable-relaxedmemorymodel-hell to shared-mutable-hell. It's the kind of hell you don't come across until you start being too smart for synchronisation primitives and start taking a stab at lockfree/lockless wizardry.

> if you only write to a variable from one thread and read it from others, then it just sort of magically works

I'm not necessarily convinced by that - but either way that's a huge blow to 'shared' if you are only allowed one writer.

> Plus the executors to handle thread reuse for async tasks.

What does this solve with regard to the shared-mutable problem? This is like "Erlang has BEAM to handle the actors" or something - so what?


Well it doesn't get you there because shared-mutable-safe doesn't exist, at least I doubt it can without major tradeoffs. You either err on the side of complete safety with a system that is borderline unusable for anything practical, or you let people do whatever they want and let them deal with their problems once they actually have them.

> either way that's a huge blow to 'shared' if you are only allowed one writer

Yeah for full N thread reading and editing you'd need N vars per var which is annoying, but that kind of every-thread-is-main setup is something that is exceedingly rare. There's almost always a few fixed main ones and lots running specific tasks that don't really need to know about all the other ones.


I'm a huge fan of the BEAM, but I wonder if you're not overselling it a little? Surely there are trade-offs here that sometimes aren't worth it and alternative ways are better?

For example, the BEAM isn't optimized for throughput and if that's a high priority for you then you might want to choose something else (maybe Rust).


> For example, the BEAM isn't optimized for throughput...

weird take, given that Erlang powers telecom systems with literally millions of connections at a time.


The original use that Erlang was built for was controlling telecom switches. In that application, Erlang supervises phone lines, responding to on/off hook and dialed numbers etc, but does not touch the voice path at all --- it only controls the relays/etc to connect the voice path as desired. That's not a high throughput job at all.

I used Erlang at WhatsApp. Having a million chat connections was impressive, but that's not high throughput either. Most of those connections were idle. As we added more things to the chat connection, we ended up with a significantly lower target per machine (and then at Facebook, the servers got a lot smaller and the target connections per machine was way less).

We did have some Erlang machines that pushed 20gbps for https downloads, but I don't think that's that impressive; serving static files with https at 20gbps with a 2690v4 isn't hard to do if you have clients to pull the files.

IMHO, Erlang is built for reliability/fault tolerance first. Process isolation happens to be good for both fault tolerance and enabling parallel processing. I find it to be a really nice environment to run lots of processes in, but it's clearly not trying to win performance prizes. If you need throughput, you need to at least process the data outside of Erlang (TLS protocol happens in Erlang, TLS crypto happens in C), and sometimes it's better to keep the data out of Erlang all together. Erlang is better suited for 'control plane' than 'data plane' applications; but it's 2024 and we have an abundance of compute power, so you can shoehorn a lot into a less than high performance environment ;)


Thanks, this is a really informative comment.


that's not throughput


You need to stop reading the Erlang/BEAM propaganda from the 20th century and thinking it's still up-to-date. The BEAM ecosystem is still a viable choice today, but what you're spouting is massively out of date. BEAM is merely one viable alternative in a rich set of alternatives today.

It wasn't necessarily wrong then either. I'm not saying it was "wrong". I'm saying, you're running around giving very, very old talking points. I recognize them.

AIUI it's not clear that Erlang actually powers any telecoms systems anymore. Ericsson deprioritized it a long time ago. Fortunately it has found success in other niches.

And I will highlight one more time, this is not a criticism of Erlang. I can, but I'm not doing that now. You're rolling into a discussion about cloud provider to use while extolling the virtues of "virtual machines", or which web framework to use while waxing poetic about these "frames" things. It's not even that you're wrong necessarily, but it's not relevant.


but the traffic on each connection doesn't need bandwidth in the throughput sense mentioned


I worked in telecoms, no erlang found. Perhaps we were too modern ?


I worked in telecoms. We had a third party component written in Erlang. I'd say it was about the second most reliable component of the system. It was susceptible to some memory leaks, but they usually turned out to be caused by misuse of the C client library.

The most reliable component was written in C, with what must have been a space shuttle level of effort behind it. No memory allocation, except in code paths that can return an error to the user who asked for something that caused the system to need more memory (we probably got this wrong on our side of the API, and didn't test out of memory scenarios, and they probably would have just resulted in OOM kills anyway). Every line of code commented, sometimes leading to the infamous "//add 1 to i" but most times showing deep design thought. State machines documented with a paragraph for every state transition explaining non-obvious concerns.


Even Go? For me, its main redeeming point [1] was its so-called "goroutines" (which are basically Erlang lightweight processes, with the mailbox replaced with Concurrent ML-style channels).

That being said, I've gotten used to Rust's async. It's a bit unusual, but it also adds a layer of clarity which I appreciate.

[1] Sorry, I've really tried to click with Go, but too many things don't work with my mental models.


What are you talking about?

In C++ you have ASIO (Boost) that’s mostly used for IPC but can be used as a general purpose async event queue. There is io_uring support too. You can sit a pool of threads as the consumers for events if you want to scale.

C++ has had a defacto support for threads for ages (Boost) and it has been rolled into the standard library since 2011.

If you’re using compute clusters you also have MPI in Boost. That’s a scatter/gather model.

There’s also OpenMP to parallelize loops if you’re so inclined.


I should look this up but I’m lazy.

I’m familiar with MPI in Fortran/C, and IIRC they had some MPI C++ primitives a that never really got a ton of traction (that’s just the informal impression I got skimming their docs, though).

How’s MPI in C++ boost work? MPI communicates big dumb arrays best I think, so maybe they did all the plumbing work for translating Boost objects into big dumb arrays, communicating those, and then reconstructing them on the other side?


It’s still a dumb way to do computation. The Boost library hides the implementation away though, on Linux and Macintosh, it requires gomp.

Really, it’s a wrapper but makes it “easier” for scientists to use who are, in general, not good coders.

Source: I’ve seen CERN research code.


What’s a dumb way to do computation? Using objects? I’m generally suspicious of divergence from the ideal form of computation (math, applied to a big dumb array) but C++ is quite popular.


Could you share more intels about this ? Links or whatever

I'd like to learn more


Go does something similar and ponylang is basically compiled Erlang.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: