Hacker News new | past | comments | ask | show | jobs | submit login
AMD Ryzen Threadripper 3000 32-core CPU is more bad news for Intel (zdnet.com)
277 points by rbanffy on Sept 20, 2019 | hide | past | favorite | 156 comments

Worth noting AMD just announced that the first third-gen Threadripper will come out in Nov with 24 cores, and the planned 16-core mainstream chip (the 3950X) is delayed to November: https://www.tomshardware.com/news/amd-ryzen-9-3950x-delay-la...

A 32-core chip is still almost certain to show up, since benchmarks have leaked, and folks have also leaked some pics of 3950X's and packaging, but I guess supply/demand have kept everything from coming out yet.

As the Tom's post notes, server and client chips use the same chiplets, so it could be that most of the higher-binned ones are going to server parts; some are higher-margin (7742 is almost $100/core, vs. 3950X under $50/core) and plausible there're some enterprise orders (AWS etc.) getting priority. I wonder if they tightened the binning for the 3950X to respond to the fuss over turbo, too.

I had never really considered everything that goes into getting to that SKU/price list and launch date: you don't really know how much supply you'll have at various perf levels, or what demand there will be for what at what price, and if you don't exactly match them you might end up losing money by underpricing, by having to nerf good silicon to fill highly-demanded lower-end SKUs, or by having a shortage that makes customers go elsewhere. (Plus vendors don't just have the public SKU list, they're working out contracts with OEMs/other big customers. And who knows what the competition will do.) High stakes, and practically speaking no backsies; seeing how upset folks are about turbo clocks, imagine if AMD announced a price hike. Glad it's not my job.

> As the Tom's post notes, server and client chips use the same chiplets, so it could be that most of the higher-binned ones are going to server parts; some are higher-margin (7742 is almost $100/core, vs. 3950X under $50/core)

The 7742 is the "halo" chip, though. Or was, rather, until the Epyc 7H12 was announced.

Other Epycs have much lower $/core. For example the 24-core EPYC 7352 is $1350, making it $56/core. The 16-core EPYC 7282 is actually even cheaper than the 3950X at $650, or $40/core.

No doubt bulk orders are going to get the priority, but the margins may not actually be that different depending on what companies are actually bulk-ordering. The $/core drops pretty quickly even just going down slightly in the stack. The 32c 7452 is $65/core.

And we don't entirely know which aspect of the binning is the limiting factor. If it's just functional cores that's going to be different from if they run at the right frequency/voltage. Epyc is all 225 W or lower. Threadripper 3000 AMD could pretty easily slap a 300W TDP on it and ram voltage through the chips that couldn't cut it at Epyc specs. The 2990WX is, after all, a 250W TDP part. The existing socket is already spec'd for more power capability than the top-end Epyc.

I don't really hard-disagree with any of that; we're all guessing. The co. sending chiplets where they fetch the most also doesn't seem outlandish as a factor, though.

FWIW, here's the price-per-core chart for the whole second-gen server line:


(Doesn't factor in voltage/freq needs for different SKUs, some of them needing fully-working chiplets, etc. Still.)

The low pricing on the 7282 and a few others is interesting, too. Wonder if it's a factor that it can use partially-working chiplets since the server I/O die can take up to eight chiplets and the client die only two, or if that's totally unrelated.

They also get cheaper if you only buy the single socket versions (assuming you don't need dual socket). The 7702P (the single socket 64 core) is $4425, or $70/core.

TSMC has also had issues meeting demand directy, they're at a 6 month backlog for 7nm. It seems like the demand has been huge, and a lot of server vendors who have been waiting to upgrade servers are going AMD and pulling the triggers now.

The 3900X has been pretty much sold out since launch, and sells out within hours of inventory showing on newegg/amazon. Been using a 3600 on a new build, waiting for 3950X. And likewise a lot of the aftermarket design rx5700xt cards have been selling out quickly. Took 4 tries to get the gigabyte one. First two times, sold out before I saw the notice, third it sold out while in my card going through checkout. Then I got the order in.

AMD is in a really good position right now.

Intel still has the lead in low idle power which is good in laptops.

Ryzen lets different cores have different max frequencies so if your code is single threaded and your operating system isn't the newest that could be a reason to go Intel. Likewise if its single threaded and can take advantage of AVX-512 or the Intel Math Kernel Library.

But otherwise?

The main selling point for Intel isn't speed anymore.

Intel still has superior performance counters and debugging features. Mozilla's rr (Record and Replay framework) only works on Intel for example, and Intel vTune is a very good tool. AVX512 is also an advantage, as you've noted.

There are other instruction set advantages: I think Intel has faster division / modulus operator, and also has single-clock pext / pdep (useful in chess programs).

For most people however, who might not be using those tools, I'd argue that AMD's offerings are superior.

It also has superior vulnerabilities

For the consumer market, this small advantages are not worth considering, imho. Amd processors do more and cost less, and power consumption is being optimised at each iteration.

For server/pro market they might be worth considering but again, the huge BOM cut that you get by choosing AMD processor might be worth the performance penalty.

Is AVX512 really an advantage though?

Intel is infamous for severely downclocking the processor for these and other AVX/SSE family instructions, to a point where sometimes using them makes the program slower than it would be otherwise, especially if you're constantly provoking frequency switches between them and regular instructions.

AMD might not have implemented AVX512 specifically yet (there's nothing legally keeping them from doing so however, they have patent sharing agreements with Intel regarding the entire x86/x64 ISA and extensions), but what they currently DO have is all common SIMD extensions implemented (up to SSE4 and AVX2 if I'm not mistaken) without incurring any frequency penalties on clock speeds for using them.

I can live without AVX512 for now, even though I'd be happier to have it. But I would really rather not have it if it came out in the same crap implementation that Intel has.

You can look at binning statistics for non-avx/avx2/avx512 clock speeds: https://siliconlottery.com/pages/statistics

For example, the worst 7980XEs do 4.1/3.8/3.6 GHz for each of these respectively. 0.5 GHz down clock isn't too bad. You can change these settings in your bios however you'd like on the unlocked CPUs (ie, the HEDT lineup + W3175X). I do find those down clocks are necessary; I've passed 100C with a 360mm radiator and roaring fans. AVX2 loads don't get anywhere near that hot.

But for all that, I do see a substantial benefit on many workloads from avx512 -- at least 50% better performance than what I'd get from avx2.

I definitely think it's nice to have, especially if you enjoy vectorizing code and looking at assembly. With much bigger performance wins on the line, it's more rewarding and more fun -- and you have more tools to play with, like vector masking (unfortunately, gather/scatter have been disappointing). Fun or not though, if you offer me avx512 on one hand vs twice the cores with full avx2 for the same price on the other, I'd have a hard time rationalizing avx512.

> I've passed 100C with a 360mm radiator and roaring fans

Damn, which radiator? Was there a GPU under load in your water loop?

Celsius S36. My GPU is on a separate loop and was idle at the time.

I was running benchmarks of Intel MKL's zgemm vs zgemm3m because of a Julia PR that recommended replacing the former with the latter. I don't think anything hits a CPU quite as hard as a good BLAS.

I think my thermal paste may be bad, because the CPU idles hot -- nearly 35 C. I ordered a Direct Die-X and MO-RA-420 radiator, so I'm planning on swapping the AIO for an open loop with way more radiator area and flow through the fins.

Running dgemm, that CPU would hit just a tad below 2 teraflops. I'd like to get it just over that (and run much cooler).

I personally run AMD at home, but in all the benchmarks I have seen, Intel wins by a large margin in AVX enabled tasks - often double the speed. Downclocking did not seem to be a factor here.

I think saying that Intel wins in AVX tasks is absolutely fair.

For example, I had a simulation that had to run on the CPU for reasons, but made use of AVX. Intel was consistently faster on any system I tested.

The 7nm Ryzen parts should have double the avx2 as the older parts. Zen1 has half the throughput per cycle (or twice the reciprocal throughput) when using ymm (256 bit) registers vs xmm (128 bit) in general


If you want to look at `vmov`s or arithmetic like `vadd` or `vmul`. Particularly glaring is that for moves between memory and (xmm vs ymm) Zen1 has a recirpical throughput of (1 vs 2), ie that on average it is able to complete an xmm-memory move once per cycle, and a 256-bit move one every two cycles. Skylake-X instead has 0.5 for xmm/ymm/zmm-memory. That is, it can move up to 512-bits between a register and memory twice per cycle. That is 8-times the throughput.

Arithmetic isn't as bad, but Zen1's reciprocal throughput goes from 0.5 to 1 on xmm to ymm, while Intel stays at 0.5 independent of vector size.

I haven't seen data on the 7nm Ryzen parts, but their marketing claimed it was supposed to have full width avx2, so I imagine things are different now, and that 7nm Ryzen will do just as well for avx workloads per core and clock as all the Intel parts without avx512.

EDIT: Some instructions on Intel get slower with wider vectors, like vdiv, vsqrtpd, vgather...

> sometimes using them makes the program slower than it would be otherwise

oh no, that program will be super fast.

It would make other programs that run concurrently with the AVX one slower.

Even if you do, in most cases you won't use them on all your workloads. This means not all your boxes need to be Intel.

I would hope that your production hardware matches your developer / staging / testing hardware.

Lets say production is 50% slower than what's tested in staging / developer test cases. Is it the data in production that causes this performance loss? Or is it hardware differences?

If you are using Intel tools to debug performance problems on developer / testing stages, you probably want to keep using those Intel tools in staging / production. There are enough cache differences and instruction-level differences (speed of "division" instruction. PEXT vs PDEP. Cache differenches, branch predictor differences, TLB differences) between the chips.

Intel has interesting optimizations: an Intel Ethernet card drops the data off in L3 cache (bypassing DDR4 RAM entirely). These little differences in the driver / motherboard / CPU can have a huge difference in performance, and complicate performance testing / performance debugging.

If you are deploying to AMD hardware for production, you probably want to be running AMD hardware in testing / developer stages as well. You want all your hardware performing as similarly as possible.

Said interesting networking optimization is a gaping security home that has already been exploited in the wild.

From my understanding, that vulnerability exists only if RDMA is also enabled.

RDMA, the ability to share RAM as if it were local RAM (through a memory-mapped IO mechanism) across Ethernet is not a common setup. The fact that you can perform cache-timing attacks over RDMA + Intel L3 cache is a testament to how efficient the system is if anything.

Consider this interpretation: RDMA + DDIO is so fast, you can perform cache-timing attacks over Gigabit Ethernet(!!). NetCAT (the "vulnerability" you describe) is proof of it.

Cache-timing / side channel attacks aren't exactly the kind of vulnerabilities that most people think of though. Its kinda cool, but its nothing as crazy as Meltdown / Spectre were.

>Mozilla's rr (Record and Replay framework) only works on Intel for example

Can you source that? I can't find it on the Wikipedia page[1] or its homepage[2].

[1]: https://en.wikipedia.org/w/index.php?title=Rr_(debugging)&ol... [2]: https://rr-project.org/

I looked for some numbers on low idle power, apparently 3k significantly improved something: https://www.reddit.com/r/AMD_Stock/comments/ado0ix/ryzen_mob.... This reviewer seems impressed, something about only using 10W in idle on desktop: https://youtu.be/M5pHUHGZ7hU?t=363.

And a head-to-head test of a laptop available in AMD and Intel variants says it has better battery life, although the screen panel could be the reason: https://www.notebookcheck.net/Lenovo-ThinkPad-T495-Review-bu...

I'd like more info, and Intel still probably has a lot of firmware tweaks etc. that AMD has to implement to win microbenchmarks, but to a first order it's not clear Intel has a lead there anymore.

The BLAS & Lapack subset of the API of the Intel Math Kernel Library (MKL) is very well implemented in open source projects such as OpenBLAS and BLIS:


Both are well optimized for AMD CPUs.

I work in this space... and let's just say that MKL is definitely NOT well optimized for AMD's chips. You'll be lucky to get 10-20% efficiency. Nevermind openblas.

This is well documented: https://www.agner.org/optimize/blog/read.php?i=49#49, https://www.agner.org/optimize/blog/read.php?i=49#112.

It goes very far back to MMX: https://yro.slashdot.org/comments.pl?sid=155593&cid=13042922

tldr: Intel's compiler doesn't optimize using standardized instructions on non-Intel hardware.

Intel optimizes these libraries down to the stepping level of the CPUs. So not surprising if they are not optimized at all for AMD

Is it anything like the way their compiler detected SSEn in a way that guaranteed it wouldn't use those instructions on AMD processors even if they supported them?

Of course. It's very much intentional, and "not optimized for AMD" is putting it very very mildly. They don't need to optimize purely for stepping level, they could provide sane codepaths for when the CPU flags indicate certain features.

See my other comment on this topic.

Yes but it wasn't a (if not amd). It was a series of checks based on specific families of Intel cpus, such as haswell, sandy bridge, etc. So it was never actually querying whether the cpu supported instruction x, it was asking what family it belonged to and then applying static rules based on that. Maybe nuance, but it also has the potential to hurt their processors if not kept up on so maybe less malice and more convenience?

They've been explicit about their motivations in this regard (claiming innocence). Then they backtracked when convenient (surprise!), but in a way that still broke AMD processors. See here: https://www.agner.org/optimize/blog/read.php?i=49#49

By the way, it's interesting to note that Intel has a disclaimer on every MKL documentation page about this; my speculation: this was required by terms of a settlement.

From the above link:

>The Intel CPU dispatcher does not only check the vendor ID string and the instruction sets supported. It also checks for specific processor models. In fact, it will fail to recognize future Intel processors with a family number different from 6. When I mentioned this to the Intel engineers they replied:

> > You mentioned we will not support future Intel processors with non-'6' family designations without a compiler update. Yes, that is correct and intentional. Our compiler produces code which we have high confidence will continue to run in the future. This has the effect of not assuming anything about future Intel or AMD or other processors. You have noted we could be more aggressive. We believe that would not be wise for our customers, who want a level of security that their code (built with our compiler) will continue to run far into the future. Your suggested methods, while they may sound reasonable, are not conservative enough for our highly optimizing compiler. Our experience steers us to issue code conservatively, and update the compiler when we have had a chance to verify functionality with new Intel and new AMD processors. That means there is a lag sometime in our production release support for new processors.

> In other words, they claim that they are optimizing for specific processor models rather than for specific instruction sets. If true, this gives Intel an argument for not supporting AMD processors properly. But it also means that all software developers who use an Intel compiler have to recompile their code and distribute new versions to their customers every time a new Intel processor appears on the market. Now, this was three years ago. What happens if I try to run a program compiled with an old version of Intel's compiler on the newest Intel processors? You guessed it: It still runs the optimal code path. But the reason is more difficult to guess: Intel have manipulated the CPUID family numbers on new processors in such a way that they appear as known models to older Intel software. I have described the technical details elsewhere.

Parent said OpenBLAS and Blis, not MKL, are optimized for AMD.

I feel like at this point if you use an intel library or compiler you should know its Intel only. If you aren’t using it in a controlled environment stick to clang/gcc.

I can’t really blame them. Why support your competitor?

does this remain true on the zen2 cpus which finally do avx properly?

Intel is famous for checking for 'IntelInside' instead of cpu feature bits, and taking a generic and slow code path if it's not IntelInside.

I think most server operators look at overall performance. Once you start buying hardware specifically for one purpose you're cornering yourself.

Besides, who spends $15,000 on a mid-high end server to run single threaded applications anyway?

I happen to know of several companies doing physics problems that scale poorly across cores that spend far north of that, usually building out small clusters. Then you run 100s of independent simulations since each individual one doesn't really scale.

You seem to be say that both

a) Single instance of application doesn't scale over multiple cores, and

b) Multiple instances of application scales well over multiple independent servers

Can you explain why they are unable to efficiently run multiple instances of the application on the same CPU (with multiple cores)?

The only thing I could think of would be running up against IO/Memory bandwidth limits.

They can, what I'm saying is that a single application doesn't scale well over multiple cores. Multiple instances on a single cpu generally works fine, but the biggest impact on performance is per core speed.

Edit: I was really just responding to "who spends $15,000 on a mid-high end server to run single threaded applications anyway?". I would absolutely consider this a "single threaded application".

What are the physics problems?

Fluid flow and most particle simulations with a large number of particles. The limiting factor is the inter particle interactions, so all the calculations have to feed back into each other.

Both of those problems are well worn and can scale to as many cores as we can put in a single computer.

Whether it is a navier-stokes grid/image fluid simulation, arbitrary points in space that work off of nearest neighbors or a combination of both (by rasterizing into a grid and using that to move the particles), there are many straightforward ways to use lots of CPUs.

Fork join parallelism is a start. Sorting particles into a kd-tree is done by recursively partitioning and the partitions can be distributed amount cores. The sorting structure can be read but not written by as many cores as you want, and thus their neighbors can be searched and found by all cores at once.

simulations that don't scale, that do scale after all.

If you spawn 100 independent instances, it's not really the problem itself scaling. The point is that given a single set of operating conditions you won't see any meaningful gains going from 2 to 100 cores. Using idle resources for other simulations doesn't make the problem itself scale.

I would say most apps don't benefit from multiple cores. So single threaded performance is still important.

I hear a lot about AVX-512 being really good.

Is there any software that's commonly used that has a measurable performance boost with it? Or is it more specialised stuff?

> I hear a lot about AVX-512 being really good.

Its a great instruction set, Absolutely great, AVX512 supports gather/scatter, a whole slew of efficient processing instructions, etc. etc.

However, AVX512 has poor implementations right now. Skylake-X is one of the only implementations, and running it drops the clock-rate in ways that are difficult to predict. (One core running AVX512 drops the clock of other cores, slowing down the throughput of the entire server).

Traditionally, the first implementation of these instruction sets always a degree of "emulated". For example, the gather/scatter instructions aren't much faster than load/stores in practice.

So while the AVX512 instruction set could theoretically be efficiently implemented, it seem like Skylake-X's implementation leaves much to be desired. Hopefully future implementations will be better.


The other major implementation of AVX512 is Xeon Phi, which has been deemed end-of-life. I like the idea of Xeon Phi, but it just didn't seem to work out in practice.

the clock rate has nothing to do with a bad implementation, all the computation just makes a lot of heat. the performance increase is still massive, often more than 2x avx, which also throttles btw

The clock rate issue isn't the fact that it downclocks a core when moving to AVX512, it's that it downclocked all the other cores on the processor at the same time.

From what I have read, AVX512 only affects the one core (downclocking License level L1 or L2), it is older CPUs with AVX2 that affected all cores.

Independently thermal throttling can occur which would affect all cores, although presumably the CPU is generating heat per numeric operation, so AVX512 is neutral versus other instructions per numeric operation.

on intel cpus the license levels basically are discrete thermal throttles. vs and which doesn’t do that, just constantly monitors thermals and adjusts clock.

intels method makes benchmarking simpler! but may leave performance on the table.

i understand why downclocks are an issue, and i understand that on some intel cpus the while chip downclocks with certain instructions. i was commenting on the supposed reason the down clocks happen, and mentioning that performance is spectacular despite then (assuming you schedule your workload appropriately).

One challenge with AVX-512 is that it can actually _slow down_ your code. It's so power hungry that if you're using it on more than one core it almost immediately incurs significant throttling. Now, if everything you're doing is 512 bits at a time, you're still winning. But if you're interleaving scalar and vector arithmetic, the drop in clock speeds could slow down the scalar code quite substantially.


Also see https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-us...

* The processor does not immediately downclock when encountering heavy AVX512 instructions: it will first execute these instructions with reduced performance (say 4x slower) and only when there are many of them will the processor change its frequency. Light 512-bit instructions will move the core to a slightly lower clock. * Downclocking is per core and for a short time after you have used particular instructions (e.g., ~2ms). * The downclocking of a core is based on: the current license level of that core, and also the total number of active cores on the same CPU socket (irrespective of the license level of the other cores).

Latest kernel as an api for knowing at runtime if AVX 512 create throttle, allowing to dynamically disable it when it decrease performance

How fast does the CPU step up and down the throttling caused by AVX-512?

Basically, if you are interleaving like you suggest, does the processor detect this and reduce the throttling by the "duty cycle" of 512-bit operations?

If not, could there be a way to tell the CPU to do this?

>How fast does the CPU step up and down the throttling caused by AVX-512?

It's actually really, really slow. On newer (I think around Skylake-X, which is when AVX-512 was introduced) CPUs it takes up to 500 microseconds i.e. millions of cycles to activate AVX-512. This can't really be made faster because they actually need to give the voltage regulators time to adjust or the chip literally brows out. During this time AVX-512 instructions execute on the AVX-256 datapath¹.

Once AVX-512 is activated the clock of that core is reduced by about 25% and it starts a 2ms timer which is reset whenever another AVX-512 instruction is issued. AFAIK Intel doesn't say how long it takes to raise the frequency again once the timer expires.

(This is something of a simplification because there are actually two AVX power licenses, the first allowing AVX-256 and a limited set² of AVX-512 instructions, reducing clock by about 15%, and the second allowing everything. Also, executing a single AVX-512 instruction doesn't immediately request a higher power license, you have to execute a certain number of them.)

This is actually the better version. On Haswell executing any AVX-256 instruction would reduce the frequency of every core by about 15-20%. But hey, at least it only takes about 150k cycles to activate (not much of a consolation, I know). Beats me how long it stays throttled for.

(I don't know what exactly Broadwell did. I don't think it throttled all cores, but it didn't have the additional power license with reduced throttling that Skylake has.)

¹ Or the 128-bit datapath if the core is at the lowest power license (which still lets you use 128-bit SSE instructions, and basic AVX-256 instructions).

² Basically anything that doesn't execute on the floating-point unit, which means no floating point and no integer multiplication (which uses the FPU). This is actually kinda the saving grace of the whole thing, since it means you can vectorize things like memcmp and strlen without requiring the highest power license.

From what I've read on AVX-512, the big disadvantage is that the AVX-512 instructions are very CPU intensive so the max clock speeds is reduced if you heavily use AVX-512 instructions.

Another disadvantage is that you have to recompile your code to use AVX512 but it seems general enough that compilers will use the instructions (to an extent) without specialized code [1].

[1] https://www.phoronix.com/scan.php?page=news_item&px=GCC-8-AV...

The problem appears to be that whether you get a performance increase or a performance penalty depends on the duty cycle of your AVX512 instructions.

What is especially deceiving is profiling a function in a loop for more than say 50ms, when the normal function execution takes say 0.5ms. Long running functions get the most gain, while short running functions cause the most pain.

That is because downclocking AVX512 lasts 2ms (with a 0.5ms setup). Certain instruction mixes will cause a general slowdown (10% degradation measured by CloudFlare under actual usage) even though the test profiling might predicts a performance gain. Single AVX512 instructions when the CPU is running at full speed have a counterintuitive perverse performance penalty - apparently running 4x slower than when the CPU changes to the slower L1 or L2 clocks.

Sustained AVX512 usage has predictable performance.

“ Intel made more aggressive use of AVX-512 instructions in earlier versions of the icc compiler, but has since removed most use unless the user asks for it with a special command line option.” is a strong indication that you need to be very careful about where you use the instructions.

Running an encoder for 1 second - likely candidate. Occasional 1ms functions or single AVX512 instructions on a web server - likely penalty.

I've used it once but on Intel the clock is throttled with AVX-512 so the overall program performance improved only 1% because the non-SIMD code was running at a slower clock.

Newer versions of the x265 encoder for h.265 / HEVC video standard get significant speed-ups. I believe Handbrake was recently updated for it.

I think RPCS3 also uses it in the JIT

I just built a new workstation and went AMD. I'm amazed by the performances. I have a threadripper 1920 and it performs really well under many different loads (compiling, gaming, video editing). It also handles my dev vms very well. Of course this is purely based on my feeling, but I had an i7-9700 before and while I had faster FPS in some games, it was really bad at handling vms.

This is subjective, but I have a really good feeling about AMD, both on CPU and GPU side.

Only thing I could ask is for more open source (open source CPU firmware), but I think this is only a dream.

I have the original Threadripper 1950X overclocked to 4.1ghz and I’ve been pretty happy with it. As a developer I can’t find much reason to upgrade 2 years after its release. Single core improvements (from AMD or Intel) aren’t game-changing. More cores won’t do anything for me (.NET Core, angular, etc) I’ve considered switching back to intel for their 5ghz processors but based on benchmarks, I wouldn’t see anything but marginal improvements. All in all, competition is good, but I wish they’d focus on single core performance improvements.

They're doing both. AMD has ramped single-core performance, at least from an IPC perspective, massively. From Excavator (2015) to Ryzen we saw a 52% increase in IPC. From Ryzen 1000 to 2000 (Zen+) we saw a 3% increase in IPC. From Ryzen 2000 to 3000 (Zen2) we saw a 15% increase in IPC.

From 2015 to 2019 we saw a total IPC boost on the AMD side of over 80%, and an increase in max core count in the desktop line of 433%, from 6 to 32.

I'm okay with this progress :)

To be fair, Excavator was pretty poor compared to the Intel offerings that generation.

But 15% the last generation is definitely impressive, especially when compared to the 3-5% gains from 14++++ that Intel has been offering.

The progress is ok, just incremental. AMD had been playing catch up, until Ryzen/Threadripper.

Just for comparison, here are the numbers from my GeekBench 4 run:

Single-Threaded: 4,746 Multi-Threaded: 34,586

According to a leaked benchmark, the Threadripper 3000's numbers are:

Single-Threaded: 5,519 Multi-Threaded: 68,279

The multithreaded benchmark is 2x, that's a no-brainer since it's likely to have 32 cores vs the 1950x's 16 cores. Now, I will say that TR3000 benchmark is not overclocked (3.6ghz.) But from what I've read, it seems like there's not much room for these latest chips to be overclocked. So, despite being the 3rd iteration, the TR3000 is only 16% faster in single-threaded benchmarks than my 1950X.

Intel's 9900K (overclocked) gets a single-threaded GB4 score of ~7,000. That (or its successor) may be my next machine.

Honestly, that seems rather low. It looks like a lot of Ryzen 3900X's are doing well over 6000 on Geekbench? Ie. https://browser.geekbench.com/v4/cpu/14151649

Is the Threadripper 3000 test also from Geekbench V4? I noticed they added V5.

I don't see why it should be so far behind the Ryzen in single thread... unless the boost isn't working properly or something, could be disabled if it's a test chip.

Yep, it's from GBv4. It could be that the 3900X is clocked higher than the TR3000. Given that they're packing more cores on the chip, it's possible that they can't dissipate heat as well and limit the single-core top-end speed on the TR3000s more than on the Ryzen 3000 series.

The biggest upgrade over Threadripper 1950X is that the new ones are no longer NUMA. So performance across the cores will be much more consistent even if the peaks are not that much bigger.

If your development involves compiling code then the 2x larger L3 is also going to result in huge improvements to code compilation speed as seen in the Epyc Rome reviews. Example: https://www.phoronix.com/scan.php?page=article&item=amd-epyc... - compare the 7601 vs. the 7502. Both are 32c/64t parts. Both have close-enough base clocks & turbo clocks. But the new one (7502) absolutely smokes the old one (7601) at both kernel compile in GCC and LLVM compile. Compilers love that L3 it seems.

All signs point to Threadripper 3000 still being sTR4 compatible, so that means you could drop in a CPU that gives you +15% single core performance vs. previous gen's best case, no more NUMA domains bringing more consistent performance, lower power usage, and double the L3 for much faster compiles.

That's pretty awesome particularly given we're only talking a ~2 year gap between the products.

Wait in what sense are the new Threadrippers no longer NUMA?

Edit: Found the Zen 2 Epyc marketing materials that describe this. Yes, apparently memory access on a single socket is uniform, in that all memory access is indirected over the IO chiplet[1]! This may hurt best-case access latency for NUMA-aware workloads? Just speculating.

It's not like non-local caches go away with uniform DRAM access latencies — unless L1/L2/L3 are also non-local to the core and indirected behind the IO chiplet. Which would be really surprising.

[1]: https://www.servethehome.com/amd-epyc-7002-series-rome-deliv...

Threadripper 1000/2000 (and corresponding Epyc) were true NUMA: each chiplet had IO for PCIe and memory onboard, and each die handled part of it. Thus you have "near" and "far" memory like a traditional NUMA.

On Zen2 everything goes through the IO die. In a sense this means everything is "far" now, but it is uniform, and performance seems to be very good despite this (perhaps due to the insane amount of cache meaning less need to hit memory as frequently).

In the definition sense? It's a single memory controller on the IO die & one memory domain. The chiplets don't have their own memory controllers.

By contrast Threadipper 1 & 2 had multiple memory controllers. It was 2x dual-channel controllers. As such they were full on real NUMA, just like a multi-processor system.

I was just looking for a source for the claim. Found my own pretty low-detail source; if you've got something better, I'd love to learn more.

If you ever get into Docker/Kubernetes, being able to spin up a local cluster for testing is a pretty big deal. Having that running in the background and not impacting your usual workload is huge.

I've been testing a Kubernetes cluster built on 32-core AMD servers and it's unreal the workloads you can throw at them. I'm used to 4 or 8-core Xeon chips and this is a whole different game.

That's true, but memory/IO is a much bigger issue in this scenario than processing power. While running a cluster in VMs, I'm frequently limited by IO (and if your VMs use more memory than it's physically available, you'll quickly descent into swap hell) well before my CPUs register a significant uptick in usage.

And, mind you, my workstation is 4 year old Xeon with 64 gigs of memory and reasonably fast (but not amazing) SATA SSD.

On a side note, I work more and more from a couple of i3 and i5 laptops and only use the workstation to do heavy lifting tasks, such as replicating these more exotic setups.

I really hope the 3000-series Threadripper is 8-channel like its server counterpart. 8 16GB DIMMs gets you 128GB, and a significant amount of memory bandwidth. You also have enough PCI-E lanes to either stripe or partition your I/O in such a way to work around any major bottlenecks.

It's definitely going to start at quad-channel + 64 PCI-E lanes like the existing sRT4 platform as it's almost certainly going to have sTR4 socket-compatible SKUs. Rumors are there will also be an 8-channel + 128 lane one as well. Supposedly the quad-channel will be the platform with overclocking support, the more "enthusiast" platform, and the 8-channel one will really more be the workstation platform.


There will be a new socket for 8 channel ones. sWRX8. These are not unlocked and essentially baby version of Epyc.

My biggest hope is that you will be able to get 256 or 512GB of RAM with those. The biggest limitation for the previous one for me in comparison to Xeons was that you could only got 128GB of RAM.

Consider: NVMe makes SSD look like HDD.

More PCIe channels means more NVMe as most servers don't use those channels for GPU.

>NVMe makes SSD look like HDD.

why? Its not true. Program/data load times change by single digit % between SATA and NVME SSDs.

2GB+/s vs. ~600MB/s is a pretty big deal for some workloads.

For precisely one "lets copy a lot of data around" workload. In real world you wont feel the difference even when editing 4K movies.

How much of that is core count though? I've used minikube several times and it didn't seem to matter much how many cores it had (beyond 4)

For simple things it doesn't matter, but if you're bringing up a complex application that has a dozen server components and putting it under load, it does matter.

I'd love to have even a 1st gen ryzen to know what it's like to be able to compile massively without feeling it.

Right now my poor x201 overheats to death if I use rustc carelessly.

I can still put more cores to work, compiling stuff.

I have a 6 core quad channel intel and I've been waiting for this threadripper to come out.

Maybe, but how often do people compile their software?

If you're editing it, a lot.

But will they actually be generally available for purchase? I was super excited to grab an AMD Ryzen 3900x when they came out mid-July and they've been consistently sold out since then. That's over 3 months with the only reliable way to get one being to buy one for almost twice the MSRP from 3rd party resellers at $800-900 vs MSRP of $500.

demand in US is really high. in my country (SEA), 3900x is available almost every big computer shop.

Rumor has it there's a 64 core chip coming ~ this year too! With 128 threads, of course.

The following article calls it a "one-two punch"... wow!


I'm ignorant in stock trading, I admit. But I dont understand why AMD stock isn't blowing up. It's done a recovery after a market wide dip, but hasn't all this Intel beating news raised the hype? So many other tech companies are so severly overvalued and hyped daily. Why is AMD a steady 28-32?

With a p/e of 166, there’s a lot of optimism priced in.

Even more amazing when you compare it to Intel with a p/e of 11

Considering how undervalued they have been for the last decade, things have moved definitively in the right direction. The recent releases and developments are still fresh news and i suspect the people/computers responsible for most of their volume operate at a longer timescale

It took a while to get the jump from 10 to 30. I actually was unlucky for selling just before, as it took too long ( holding it for a year).

I see it going to it's old peak soon. And it will just be as swift as it was before

It was below 3 in 2016.

This. I bought at $1.56 and sold at $29 for an unbelievably awesome trade. Will pick up more anytime it dips below $30.

TL;DR: industry momentum

I've read this elsewhere on HN, basically it boils down to having great tech doesn't mean everyone drops all their existing Intel tooling or Intel-optimized source code. If you care a lot about performance, then you'll care about those things. For everyone else, they just want a reasonably priced processor, and most PC manufacturers have large contracts to get those CPUs at a good enough price from Intel to also not want to switch all their motherboards and factory setups and driver in order.

By what mechanism would the multicore performance rise 2x without increased single-core performance? I'm kind of confused. Eliminating intra-CPU concurrency bottlenecks, I guess?

Which benchmark or comparison are you referring to?

Pretty much the only numbers in TFA:

> The single-core score of 1,275 is pretty much the same as for the current flagship Threadripper 2990WX, ...

> But when it comes to multi-core, the Threadripper 3000's score of 23,015 absolutely destroys the Threadripper 2990WX's score of 13,400, ...

2990WX was a weird ultra-NUMA setup where half the dies had no direct memory access at all. Part of the gain will have been fixing that - particularly on Windows where the scheduler did not understand this configuration at all (Linux results have always been much better).

Another part is the power consumption and resulting higher clocks. The 2990WX was basically power limited, the new chips are on 7nm which will drastically reduce their power consumption and allow higher clocks inside the same power envelope.

Wouldn't the higher clocks impact single-core performance? The article claims that's about the same. I suppose the userbenchmark single-core perf test[1] (the article links to some userbenchmark.com results) could just be a bad test; I'm unfamiliar with it. (It could also be that 2990WX wasn't power/heat limited on a single core, and neither is 3xxx; but say, at 16 core 100% clock, 3xxx can go higher than 2990WX due to a lower power process?)

Anyway, it's still 32 cores. While eliminating the ultra weird NUMA effect and increasing clocks could explain a significant benefit, I'm still really struggling to intuit 72% higher performance. But it is Windows 10, so your remarks about the Windows scheduler may explain the gap.[2]

[1]: https://cpu.userbenchmark.com/Faq/What-is-single-core-intege...

[2]: https://www.userbenchmark.com/UserRun/19698768

Zen2 is moving to a somewhat deceptive advertising model for clocks.

AMD is advertising the absolute highest that any core on the die can hit under extremely light load for an absolute instant. Sustained single-core clocks will be 100-200 MHz less and sustained all-core will be significantly less. Most cores on a chip are not capable of sustaining the advertised clockrate even under severe voltage and even for instant loads, only the "preferred core" on a chip. There is a significant "binning effect" not just between chips, but between individual cores on a chip.

Intel and previous generations of AMD cores used to advertise the sustained single-core rate, which could be achieved on any of the cores. This is a changeup to how the clockrate has been advertised.

original: https://www.youtube.com/watch?v=DgSoZAdk_E8

followup after a patch: https://www.youtube.com/watch?v=3LesYlfhv3o

As such the clockrates may be significantly different from what you're intuiting based on the advertising.

I don't actually recall any Zen2 clock advertising in particular. My comment was purely in response to your earlier statement:

> Another part is the power consumption and resulting higher clocks. The 2990WX was basically power limited, the new chips are on 7nm which will drastically reduce their power consumption and allow higher clocks inside the same power envelope.

Indeed, it makes it difficult to know what you're actually buying when considering something like a 3600 vs 3600x

All the cores need to synchronize memory between each other, I guess you can improve speed on that part.

It's not likely but you could do a 32 core 128 thread version.

My understanding is that the underlying CCXs and cores are identical across the Zen 2 product line, and all have exactly 2 threads per core. No?

But what processor is best for Dwarf Fortress?

Dwarf Fortress is IIRC mainly limited by RAM latency. So you really should be trying to get 3600 MHz RAM with as tight timings (maybe CL16 timings?) as possible.

Dwarf Fortress's simulation is all about pointer-indirection and jumping around memory. The CPU doesn't really do much except wait for RAM most of the time. It takes ~50ns to talk to RAM, but the CPU is clocked at 4GHz (0.25 nanoseconds), giving you an idea of scale. The RAM tightening can bring your latency anywhere from 50ns to 200ns depending on how well you tune your RAM parameters, and depending on chips and stuff. (Servers usually have lots of slow LRDIMM RAM over multiple-sockets that can be 200ns latency or worse).

AMD takes their design out of the server-playbook, and seems to have ~100ns main DDR4 RAM Latency. So Dwarf Fortress probably will be faster on Intel i9-9900k (monolithic design with integrated memory controller and 50ns main memory latency).

Interesting! Is there a simple way to measure memory latency?

Create 1-billion 32-bit integers numbers between 0 to 1-billion. Knuth shuffle the integers.

"Linked list" traverse the integers as follows:

    //array is full of 1-billion numbers, randomly sorted
    uint32_t idx = 0;
    for(int i=0; i<200000000; i++){
        idx = array[idx]; // Random traversal
Pull out a stopwatch (or use Linux's "time" functionality). Divide the time by 200000000. You've now measured memory latency.

I suggest using Sattolo’ algorithm instead.

Here is a relevant blog post https://lemire.me/blog/2018/11/13/memory-level-parallelism-i...

Awesome! Thanks for the write up and good info.

Probably the Mill CPU once the hardware actually exists.

Is the Mill CPU still happening? Haven't heard about it in a while...

There's activity on the forums, though rare; last posts from about a month ago. But still, Ivan is there dutifully responding to questions. I think interest has just died down a bit until they get more benchmarks (which, to be fair, the first ones just got released! https://millcomputing.com/topic/benchmarks/)

The article references a Geekbench score that's now taken offline.

Google Cache: http://webcache.googleusercontent.com/search?hl=de&ei=ewaGXa...

Results (also in the article): Geekbench v5.0.1 Tryout for Windows x86 (64-bit); 1275 single core, 23015 multi-core for 32C/64T @ 3.59GHz Base Frequency (and with 32GB DDR4, no info on 4 or 8 channel).

As more people move to city cores from rural areas... I think computers are becoming the next generations 'car', in terms of an object to soup up with new parts.

Interesting metaphor. It has been this way since the 1980s. For instance, in 1985/86 it was popular to upgrade your 386sx with an i387 math co-processor which is something like adding a cat back exhaust system on your car. Later on increasing RAM, installing QEMM for boosting available memory for DOS games and over clocking and then add on 3D video accelerator cards. There was even a physics add-on card at one point. Mostly these focused on performance gains. The shift towards LED bling and aesthetics is more recent though.

Now if only programming multi-core work could catch up to this hardware. Honestly chips with more cores is starting to sound like cars with more wheels or I guess more valves lately.

Doesn't matter the horsepower the speed limit is the same until we can make software better.

This is a very outdated complaint, we are using all cores these days, just not on all systems. Backend software is concurrent & distributed, and although utilization is lower than we’d like, the extra cores are not going to waste.

However, if you put this in a machine you use for playing games or checking Facebook then you might think that it’s a waste, and you’d be right. There is plenty of software out there which is still single-threaded, but it’s not dominating our utilization.

Honestly with the way chrome and firefox behave nowadays, even grandma looking at facebook benefits from multiple cores

Not really, any window/tab that is not in focus gets heavily throttled by the browser, gradma is probably not running any extensions and single websites are still single threaded for most parts (more and more are adding web workers for various parts though)

>not running any extensions and single websites are still single threaded for most parts

Such a shame that all that JS botnet is still single threaded, isn't it ?

> Doesn't matter the horsepower the speed limit is the same until we can make software better.

For developers code compilations benefit from more cores. For hobby 3d artists their renderings benefit from more cores. For hobby video creators their editing & rendering benefits from more cores. For hobby IT usages the extra PCI-E lanes and ECC RAM gets you the capability of doing server-like virtualization without spending server money.

Threadripper is an HEDT platform, same as Skylake-X. This is not targeting mass market usage, so complaining about mass-market viability is rather irrelevant.

Depends on the problem domain. Some problems/algorithms, like genetic algorithms, are simple to implement on multicore systems with thread pools with significant gains in overall performance.

"or I guess more valves lately."

Extra valves can increase efficiency, which is probably why the valve count is increasing, can't say I've noticed though.

Extra wheels can improve aerodynamics also.


I'll stop undermining your point now :)

Efficiency does not mean you're going faster though. My point still stands.

Multi-core processors can invariably let us do more. I get it - valid point. But we're not doing more. We're just cruising along at the same speed I think.

Since the beginning I have always wondered when I came across complaints about there not being enough multi-core software. Am I the only one with more than 1 process running on my system?

At work our homegrown programming language is still single threaded only. Feels bad man.

> The single-core score of 1,275 is pretty much the same as for the current flagship Threadripper 2990WX, and is actually slightly lower than the 1,334 that the Intel i9-9900K scores.

In what world is this a valid comparison? No one shopping for a $500 8 core consumer cpu is looking at a $1700 32 core hedt chip. They're entirely different markets. If you can use 32 cores you'd be looking at intels x series at a minimum, if not xeons. This is silly at best, if not intentionally misleading.

I'd rather choose really big caches per core than really big number of cores.

Zen2 is 16MB L3 per ccx across the entire stack (hence why even the Ryzen 5 3600 is 32MB of L3 - it has 2 CCXs). There's no reason to assume this won't be true on Threadripper 3000. So a 32-core Threadripper 3000 would have at least 128MB of L3.

L2 cache is what matters. Zen2 has just 512 KiB per core. My old Pentium-M had 2MB. I wish modern CPUs had at least this much.

Performance is what matters, not cache size. Your Pentium-M didn't even have L3, hence why its L2 is so big.

Why do y'all hate Intel so much?

Because they're the dominant player (and the history of getting to that dominance is very sketchy, including $billion fines in the EU for bribing OEMs to use their processors)

Why is AMD making these chips with 14nm transistors instead of using TSMC's 7nm stuff? AMD is only catching up to Intel because Intel's 10nm has been delayed so much, but if Intel can ever figure out their 10nm process they'll be in the lead again.

AMD use 14nm for the IO die and 7nm for the compute chiplets. 10nm won't help much - after all the security fixes that Intel have had to implement, AMD have the IPC advantage.

And further, "nm" these days are marketing numbers. It's totally possible that AMD's 7nm isn't directly comparable to a 7nm (or 10nm) Intel process.

Their biggest win over Intel was using 14nm IO bridge to connect 7nm CPU cores ("chiplets") in the Ryzen 3000 era. They use 14nm for IO where there's little benefit to the process shrink so they can save the lower-yielding 7nm process for where it counts: cores.

And I think you're missing the biggest bit, there - each set of 8 cores is its own die. That means that if they have a yield of 95% for that one die, they'll have a yield of 95% for a 64 core monster, too. Compare with Intel, where octupling the die area means their yields would go from 95% to 0.95^8 = 66%. (And if they had 50% yields to start, Intel would end up with 0.4% yields after octupling - AMD would still have 50% total)

This can be a bit confusing, basically some parts within the CPU package 7nm dies and other parts are 14nm dies. The chiplets that are the CPU core are the prior, interconnects (or the I/O die) is the latter.

What are you talking about? Zen 2(Ryzen 3000) uses 7 nm.

It's a composite chip (chiplets) with different parts made using different processes for various reasons.

As a fun fact, IBM's newest (and quite impressive) mainframe, the z15, is still using the same Global Foundries 14nm process the z14 did. The next generation (including POWER10) will probably be 7nm, but not come from GF.

It runs above 5 GHz.

>if Intel can ever figure out their 10nm process they'll be in the lead again.

What are you basing this claim on?

All we know Zen2 targeted beating an hypothetical 10nm Intel CPU which has not shown up.

As it has not shown up, it remains hypothetical. Wild what-ifs.

AFAIK is Threadripper 2 12nm. I guess its because they can deliver very high speed with that process, without cannibalizing the already overtaxed TSMC-7nm Lines.

Threadripper-users dont care that much about energy efficiency - and its a small market, which gets cannibalized by the 3950X in the new generation.

It's 7nm, similar to Rome.

Threadripper 2 was 12nm, Threadripper 14. The new TR3 will be 7nm

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact