But the way you make rendering embarrassingly parallel is the way you make web s...

bayindirh · 2024-07-29T08:56:33 1722243393

Yes, however in an SMT enabled processor, there are one physical FPU per two logical cores. FPU is already busy with other thread's work, so the threads in that SMT enabled core take turns for their computation in the FPU, creating a bottleneck here.

As a result, you don't get any speed boost at best case, or lose some time in the worst case.

Since SMT doesn't magically increase the number of FPUs available in a processor, if what you're doing is math heavy, SMT just doesn't help. Same is true for scientific simulation, too. I observed the same effect, and verified that indeed saturating the FPU with a thread makes SMT moot.

rbanffy · 2024-07-29T09:34:25 1722245665

If you have contention around a single part of the CPU then yes, SMT will not help you. The single FPU was an issue on the first Niagara processor as well, but it still had great throughput per socket unless all processes were fighting for the FPU.

If, however, you have multiple FPUs on your processor, then it might be useful to enable SMT. As usual, it pays to tune hardware to the workload you have. For integer-heavy workloads, you might prefer SMT (there are options for up to 8 threads per physical core out there) up to the point either cache misses or backend exhaustion happens.

bayindirh · 2024-07-29T09:56:20 1722246980

Current processors contains one FPU per core. When you have a couple of FPU heavy programs in a system, SMT makes sense, because it allows you to keep FPU busy while other lighter threads play in the sand elsewhere.

OTOH, when you run a HPC node, everything you run wants the FPU, Vector Units and the kitchen sink in that general area. Enabling SMT makes the queue longer, and nothing is processed faster in reality.

So, as a result, SMT makes sense sometimes, and is detrimental to performance in other times. Benchmarking, profiling and system tuning is the key. We generally disable SMT in our systems because it lowers performance when the node is fully utilized (which is most of the time).

markhahn · 2024-07-29T15:51:43 1722268303

I'm not really sure why you say "one FPU per core". are you talking about the programmer-visible model? all current/mainstream processors have multiple and parallel FP FUs, with multiple FP/SIMD instructions in flight. afaik, the inflight instrs are from both threads if SMT is enabled (that is, tracked like all other uops). I'm also not sure why you say that enabling SMT "makes the queue longer" - do you mean pipeline? Or just that threads are likely to conflict?

funcDropShadow · 2024-07-31T14:02:06 1722434526

Yes, but well optimized math heavy software will already max out the super-scalarity of the FPU. I.e. one cpu thread can already schedule multiple fpu-heavy instructions at the same time. If you run such software twice on the same fpu you will only gain overhead. I guess by queue he meant the processor internal work queue, the processor pipeline is only half of the picture. Processors have a small data-dependency graph of micro-instructions they have to perform. That is used to implement the machine code instructions that are currently in-flight.

bayindirh · 2024-08-02T20:15:58 1722629758

> I guess by queue he meant the processor internal work queue...

Yes, I meant the internal one. Also, when you enable SMT, a small tag is added in front of every instruction, noting which logical core owns this instruction for a given physical core. So instead of tagging every instruction with a core-ID, you add a longer tag in the form of core-ID/logical_core-ID.

This extra tagging also makes instructions bigger, so the queue can hold less instructions, adding fuel to already chaotic and choked FPU logistics.

As a result, if you're saturating your FPU(s), SMT can't save you. In fact can make you slower.