little annoying to see the one-core-compared-to-whole-gpu comparisons - now deca...

oivey · 2024-10-11T15:54:59 1728662099

The first example plot is a 9950X that includes all threads with AVX512 vs a 4090. The 9950X has a 170W TDP, which doesn’t include any other components like the RAM or motherboard. The 4090’s total max power is ~450W. The chart shows the 4090 burying the 9950X by far more than 450/170.

Comparing SMs to CPU cores 1:1 also makes no sense. They don’t do the same things.

adrian_b · 2024-10-11T16:50:57 1728665457

It should be kept in mind that a 4090 only buries a 9950X for FP32 computations.

For FP64 computations, the reverse happens, a 9950X buries a 4090, despite the latter having a 3-times higher price and a 2.5-times higher power consumption.

For FP64 operations, 4090 and 9950X are able to do a similar number of operations per clock cycle (288 vs. 256), but 9950X can do them at a double clock frequency and it is easier to reach a high fraction of the maximum theoretical throughput on a 9950X than on a 4090.

xfalcox · 2024-10-11T17:51:21 1728669081

What about FP8? It is a target that is very popular for LLM inference.

adrian_b · 2024-10-11T20:01:08 1728676868

AMD Zen 5 has the so-called "Vector Neural Network Instructions", which can be used for inference with INT8 quantization and also instructions for computing inference with BF16 quantization.

FP8 is a more recent quantization format and AFAIK no CPU implements it.

I do not know which is the throughput of these instructions for Zen 5. It must be higher than for older CPUs, but it must be slower than for the Intel Xeon models that support AMX (which are much more expensive, so despite having a higher absolute performance for inference, they might have lower performance per dollar) and obviously it must be slower than for the tensor cores of a big NVIDIA GPU.

Nevertheless, for models that do not fit inside the memory of a GPU, inference on a Zen 5 CPU may become competitive.