Hacker News new | past | comments | ask | show | jobs | submit login
Threadripper 7000 Storm Peak CPU Surfaces with 64 Zen 4 Cores (tomshardware.com)
70 points by ekoutanov on Sept 29, 2022 | hide | past | favorite | 53 comments



I recently ran some machine learning benchmarks on CPU versus GPU. The gap for a multicore CPU was much smaller than I expected. I, for one, am excited by a 64-core CPU in a way I wouldn't have been a year ago.

Between Hugging Face, Stable Diffusion, and Whisper, I'm using ML workloads a lot more. Being able to do so:

* with a standard instruction set

* with open-source software

* with my full system RAM

* without having to worry about what is in VRAM versus main RAM

is a big step up. I see about a 10x speed difference between an older 16-core CPU and a hot-off-the-press high-end Ampere card costing 3x as much as the CPU. If 64 core could bring that within 2x, or even 4x, I'd dump the GPU entirely.


Inference or training? I think with full training you are out of luck with CPUs, the gap is much bigger. 64c TR could only get to roughly 1TFlops.


Eh?

My 5950x (measured) flops are ~2 TFLOPS in single-precision, ~1TFLOPS in double precision (obviously, due to half the SIMD vector size). This is a desktop-class 16-core machine.



Go and measure it yourself, if you have one :)

https://github.com/Mysticial/Flops/

You can also get a theoretical computation of the Flops, which matches nicely with the experimental measurement. You have to take into account:

- the clock frequency (~3.9 GHz on multithreaded workloads on my machine)

- the number of cores (16)

- the reciprocal throughput of the FMA instruction (~.5, that is, 2 instructions per clock cycle)

- the number of flops per instruction (2 for the FMA instruction, that is, 1 multiply + 1 add)

- the SIMD vector width (4 for double, 8 for float).

Putting it together:

3.9e9 * 16 * 2 * 2 * 4 = 998.4 GFlops (double)

3.9e9 * 16 * 2 * 2 * 8 = 1996.8 GFlops (single)

The measured values on my machine are a bit different, but close (1070 and 2151 respectively).

References:

https://www.agner.org/optimize/instruction_tables.pdf

https://www.agner.org/forum/viewtopic.php?t=56

https://gadgetversus.com/processor/amd-ryzen-9-5950x-gflops-...


I've tried it on 10980XE (18-core) that got between 600GFlops-1.6TFlops depending on the instruction in quad channel mode. Will try later on a 32-core Threadripper. The challenge there is to keep all cores busy during training while not repeating the same gradient computation I guess (both scheduling and memory stuff).


2 TFlops or 5 TFlops does not matter much. 3090Ti does 160 TFlops, e.g. at least 30x (!) times faster.


Those are Tensor flops, the numbers for the Zen CPU are "general-purpose" flops (sometimes called "vector flops" in marketing material).

The vector flops for the 3090Ti are 33 TFlops for single precision, 0.5 TFlops for double precision. So, 16x faster than the 5950x in single precision, 2x slower for double precision. At almost 3x the price and >4x the power consumption.

Of course, if all you care about is AI, then there's no argument - but then we are not really talking about a general-purpose device any more.

The narrative of GPUs being "hundreds of time" faster than CPUs is vastly blown out of proportion for general-purpose computing.


I think you missed that this whole discussion is in the context of deep learning, therefore your comment does not apply. It is 30x slower that 3090Ti for that purpose.


My initial comment was correcting a factually inaccurate statement regarding CPU performance.

It is you who barged into the thread with unrelated GPU performance numbers, but whatever :)


You are missing forest for the trees.

Here's the comment I assume you are allegedly trying to "correct":

> with full training you are out of luck with CPUs, the gap is much bigger. 64c TR could only get to roughly 1TFlops

1TFlops is not the main part of that statement, and it is qualified with "roughly" which I suppose is not too far from the truth in the context. And the context is "training ... the gap is much bigger", and in this case "much" is at least 30x even with the updated number.


Zen4 will do 16bit BFloat FP, so one would expect it to do a lot better than threadripper on ML training applications?


This Threadripper is Zen 4.


Fair enough. My benchmark was inference. I care about inference much more than training for most of what I do.


Recently, I've been implementing my custom inference code in C for various models (GPT, Whisper) and am interested to see how it compares to various GPUs in terms of performance. So far, I've been running it only on my MacBook M1 as I don't have the necessary hardware.


It seems super easy for someone to fake these identifiers sent back to projects like Folding@Home...

People might do it just for fun, or maybe to manipulate the share price (make performance better or worse than expected), or maybe even for marketing.


Anyone is using 64 cores besides Linus :) I'm much more excited for 7900x on 12 cores rather than 64 cores. But I understand the limited amount of people that needs this power on desktop can also be excited.


If I'm not lighting up my Windows Task Manager, what was even the point of making money?


People trying to cram 500 VPS customers onto a single box!


I could use an almost indefinite number of cores for fuzzing and compiling. Currently I have to limit my fuzzing runs to 12 cores because the 3 year old AMD machine can't handle more without impacting other development work.


But then if 64 is not useful to you, why the 7900X instead of the 7700X and it's 8 cores? The 7700X is way less power hungry and boosts to nearly the same speed as the 7900X.

Genuinely asking as I plan to replace my Ryzen 3700X with a 7700X.


im on 3900x and i was planning to upgrade to 7900x i tend to run few VMs. But I'm not sure yet. Would be cool to get on DDR5 but it feels like this time it is just to have upgrade. So not sure.


Some common dev work loads that benefit: huge builds (especially for C++ and Rust), running lots of VMs to run a copy of a cloud infrastructure locally, emulating foreign hardware for testing (qemu), large scale data analytics locally instead of paying some ridiculously expensive SaaS to do it.


Parallel builds of C++.


It would be handy for people that use gentoo :)


You might want to wait for 7900X3D based on what we see with 5800X3D vs the latest Zen4...


I'm actually considering upgrading to 5800x or 5800x3d as a cheap temporary upgrade since the newest generation (which I initially planned on) just seems too expensive given the need for new DDR5, and new very expensive motherboards which will likely need at least another year to mature. So far I've been leaning towards the pretty cheap 5800x (280e vs 450e for 5800x3d) since the difference actually doesn't look that big for real workloads (and since it's an upgrade for a shorter than usual timeframe). Is the 5800x3d actually 60% better in non-gaming to be worth it? If not (as it seems to me) I'm not sure why waiting for the next 3d specifically makes sense.


X3D seems to shine in gaming but likely helps with other code that is not computation-heavy as well. If you don't need AVX-512 or higher memory bandwidth, either 5800 CPU is probably good. X3D is going to be a bit more future-proof, requiring a later upgrade, based on how well 5775C holds up even today.


I thought whether Torvalds or Sebastian


I'm reminded of ... that ... Was it power cpu or risc? The one with the 1024 cpu cores. I did a search but can't find it.

Essentially 1024 cores soc server, affordable. Compared to that 64 cores sounds rather unimpressive. IMHO.


Having 64 powerful cores sounds more impressive than 1024 weak cores.

Also why even mention that 1024 core CPU instead of upmem. 128 cores per DIMM slot and up to 2560 cores in a single machine and they are fast precisely because they are directly attached to memory with a total memory bandwidth of 2.56TB/s.



Are you thinking the old sparc Niagara?


Parallela?


That was the platform; the company was Adapteva. Cavium and Tilera also had more cores than this approximately a decade ago. "Manycore" would be the generic term to search for.


Imho if cpu manufacturers figure out how to slap a large cache on the same die (something like amd 3D V-Cache but much more) we may actually see graphic cards become obsolete in favor of software rendering.


Specialized silicon will always beat general purpose silicon.

It is true that a chip like this probably could render pretty decent 3d in software though. I wonder if combining this with the GPU in a clever way could allow more people to experience real time raytracing?


> Specialized silicon will always beat general purpose silicon.

The whole history of PCs is repeatedly proving otherwise. The NES had hardware sprites. Then Carmack & Romero showed up and proved you can have smooth side scrolling in software, on an underpowered CPU. The whole concept of a PPU was thus rendered obsolete. Repeat for discrete FPUs, discrete sound cards, RAID cards (ZFS), and so on.

Specialised silicon will beat general purpose silicon at the given task, until general purpose silicon + software catches up. You need to keep pouring in proportional R&D effort for the specialised silicon to stay ahead.

What keeps GPUs relevant is that they're in fact much more general than what the "G" originally stood for.


CPU’s have integrated a lot of specialized silicon as transistor budgets increased. x86 treats integer and floating point arithmetic as separate things because the math coprocessor used to be a separate and optional chip. Now days it’s GPU cores making the migration, but that’s hardly going to be the end of it.


When the second generation of EPYC came out, linus ran a "software rendered" version of crysis that did all rendering on CPU cores instead of GPU shader units. At 640x480 it ran alright.


Possibly - there are a lot of ray tracing algoeithms that don't really work well on GPUs (anything MCMC, for instance). But context and time aware denoising seems to be able to compensate.


GPU is a bandwidth monster, CPU is a latency monster. You can't have both on the same silicon.


Every console SoC designed by AMD proves you wrong.


APU has two parts, there is no GPUCPU hybrid in the same functional package, they are separate modules.


Finally I can compile Chrome and Firefox faster


Can I use it in a base of my kettle to boil water?


the cpu is certified by AMD to be running up to 105 celsius, but it thermal throttles automatically at 95 celsius, so out of the box probably not enough to boil water, but just barely :P.

the fun fact, is that if you manually reduce the power limit to 65W the initial single thread results so virtually no loss in ST performance vs 170W, and it appears that the original AMD slides stating 75% more efficient cores at that level not too far off.


The previous generation of Intel and AMD CPUs could not consume more than 20 to 30 W with a single active core (non-overclocked).

So with the power limit set to 65 W or more the single-thread performance was always limited by the maximum turbo frequency (which may depend on the temperature of the CPU) and never by the power limit.

I have not seen yet any published value about the single core power consumption of Zen 4, but it is likely that the single core power is not higher. It is certainly much less than 65 W even at 5.85 GHz.

So the expected behavior is that the single-thread performance does not depend on whether you set in BIOS the steady-state power limit to 170 W, 105 W or 65 W. Only the multi-threaded performance is modified by the power limit, because when the power limit is reached, the clock frequency is decreased until the power consumption matches the limit.


Note that anyone trying to boil a kettle with a 170 W heater would be waiting around for 10s of minutes.


That's the consumer variants; the Threadrippers will almost certainly not be at a lower rated TDP than current gen's 280W. If they increased it by same percentage as they did for consumer, it'd be 450W, but that's unlikely; 350W might be in the cards, though.


Although if you just want to make hot drinks, then you probably don’t need 100c anyway, kettles just use that as an easy off switch.


It's AMD, not Intel


Great for monero mining

edit: seethe




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: