Hacker News new | past | comments | ask | show | jobs | submit login

little annoying to see the one-core-compared-to-whole-gpu comparisons - now decades past when this was an innocent wrong.

compare a 500W GPU to all the cores of a 500W CPU, please. I'm not expecting the CPU (say, a 192-core AMD that does fast AVX512) to beat the GPU on all data-parallel workloads, but it won't be the silly sort of graphs shown in this blog.

or compare one SM to one CPU core - that has merit as well.

best yet, we're finally getting some CPUs (well, APUs...) with in-package RAM. that makes the comparison more interesting as well.




The first example plot is a 9950X that includes all threads with AVX512 vs a 4090. The 9950X has a 170W TDP, which doesn’t include any other components like the RAM or motherboard. The 4090’s total max power is ~450W. The chart shows the 4090 burying the 9950X by far more than 450/170.

Comparing SMs to CPU cores 1:1 also makes no sense. They don’t do the same things.


It should be kept in mind that a 4090 only buries a 9950X for FP32 computations.

For FP64 computations, the reverse happens, a 9950X buries a 4090, despite the latter having a 3-times higher price and a 2.5-times higher power consumption.

For FP64 operations, 4090 and 9950X are able to do a similar number of operations per clock cycle (288 vs. 256), but 9950X can do them at a double clock frequency and it is easier to reach a high fraction of the maximum theoretical throughput on a 9950X than on a 4090.


What about FP8? It is a target that is very popular for LLM inference.


AMD Zen 5 has the so-called "Vector Neural Network Instructions", which can be used for inference with INT8 quantization and also instructions for computing inference with BF16 quantization.

FP8 is a more recent quantization format and AFAIK no CPU implements it.

I do not know which is the throughput of these instructions for Zen 5. It must be higher than for older CPUs, but it must be slower than for the Intel Xeon models that support AMX (which are much more expensive, so despite having a higher absolute performance for inference, they might have lower performance per dollar) and obviously it must be slower than for the tensor cores of a big NVIDIA GPU.

Nevertheless, for models that do not fit inside the memory of a GPU, inference on a Zen 5 CPU may become competitive.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: