Hacker News new | comments | ask | show | jobs | submit login
AMD Threadripper 2990WX 32-Core and 2950X 16-Core Review (anandtech.com)
322 points by MikusR 6 months ago | hide | past | web | favorite | 189 comments

The biggest benefit I've found to running a 1950X isn't something I would have expected, but which perhaps I should have. True, it's much faster than my old system for batch processing, but most of the time it's still idle. Even if it's running Chrome flat-out, as far as the 1950X is concerned, that's idle.

Because the two NUMA nodes are ~entirely independent, it's capable of running two independent processes at full speed. In practice, that means lower latencies and less jitter, and it's been noticeable. Folklore would have it that single-thread performance is the most important aspect of desktop performance, but that isn't what I've observed.

...it's also useful when I, e.g, decide to run a Factorio server on my desktop.

> Because the two NUMA nodes are ~entirely independent, it's capable of running two independent processes at full speed.

I don't understand. From my (admittedly little better than layperson's) knowledge, I'm guessing the cores of most multicore processors have to compete for memory access...? Is there a good search term I can use to help me understand what's going on here?

There are 2 dies in the 1950X, each one has 2 memory channels. Thus, it's possible to run a process on one (8-core) die that maxes out the memory bandwidth to it's two local DDR4 channels while the other die still has full bandwidth access to it's own DDR4 channels.

Threadripper is able to switch between NUMA (non-uniform memory access) mode and "regular" mode. In NUMA, the OS knows that 2 channels are attached to 1 die and 2 channels on the other, thus allowing lower latencies because the OS knows what RAM to allocate based on which core the process is running on.

As a bonus, if you are explicitly NUMA & the OS/code does a good job, there's little line contention or resource sharing (e.g., caches) between die.

I found a significant performance benefit to keeping NUMA turned on when running Linux, for basically every workload.

For Windows, it is the other way around. I hope they'll improve their NUMA handling, but I'm not holding my breath.

The Linux kernel is clever about this. You can get some idea of what it does by looking at numactl, which lists the various scheduling modes -- though in practice the kernel does a great job without any user overrides, and actually using the command is likely to slow things down.

Which is not to say that it can't occasionally be helpful, if you're trying to optimize the speed of a single thread. At a minimum, you can choose between optimizing for bandwidth (interleaving data on all four memory channels) or latency (putting everything in the local node). Usually you want the latter.

Does it really work this way (with automatic memory and core pinning)? Both Windows and Linux can do that?

Linux does that. Windows...

Judging by the performance I'm (not) getting, Windows does a very poor job with NUMA.

Does Linux do that OOB or do I need to do some configuration?

I never configured anything.

NUMA means "(Explicitly) Non-Uniform Memory Access"; this means that some cores have easier (lower latency, higher bandwidth) access to some memory regions than others.

In practice this means that memory controllers are partitioned amongst groups of cores, with some slower and often otherwise busy interconnect between those groups.

The software implication is that if task X uses some bit of memory a lot, then that bit of memory better be node-local, i.e. easy to access for the core where task X is running.

Threadripper and Epyc are essentially multi-socket-in-a-package. There is an inter-processor link which is analogous to Intel QPI or DMI, it just runs between dies within a single socket instead of dies in separate sockets.

Threadripper and Epyc present themselves as 2 or 4 separate NUMA nodes depending on model. Spreading a single task across multiple NUMA nodes usually hurts performance significantly (often slower than just running it on a single node using fewer threads), but you can run 2/4 separate tasks at pretty much full speed.

The new WX processors are a little weird because two of the NUMA nodes have no direct access to RAM at all, they have to ask the other 2 dies to do it for them and pass it over.

>> Folklore would have it that single-thread performance is the most important aspect of desktop performance

There are a few places where it matters. We talk of "gaming" but that is too broad a term imho. Many AAA games do leverage AMD chips well. They know how to split loads across threads to maximize performance. But in recent years the hot games have come from indi developers. Pre-release titles from tiny developers aren't properly optimized. If you want to play kerbal space program, single-thread performance is vital (KSP's physics is still single-threaded iirc). If you want to surf the web while running a factorio server, AMD will work fine. But if you want to maximize the performance of that factorio server, the faster single-thread performance of Intel chips would be better.

Minecraft reigns king in that category. I don't know if it still does, but a couple years ago Minecraft actually managed to somehow still use the OpenGL immediate mode. Back then the larger Minecraft servers required ridiculous amounts of memory and dedicated a dozen cores just to Java's GC.

That's still pretty much the case today. Though the underlying java GL library the desktop client uses has been updated a bit and the newer versions have migrated to higher versions of OpenGL; breaking a lot of mods pretty hard with the changes to the rendering engine.

I hear the (Windows/console only?) version is faster for all the reasons you'd expect, but I'll probably end up never playing it.

Windows version has way better performance, but it doesn't support mods so nobody cares about it.

Was the "Windows version" the one released 1-2 years ago that looks really pretty, with god rays etc?

I wondered if it would be a hit, since 8 bit graphics is part of the charm and concept of minecraft, but I didn't consider the mod ecosystem.

Yeah, the one based on the mobile version.

Minecraft has made some very, uh, interesting performance decisions.

For example they decided to forcibly split everything into server thread(s) and play thread(s), even when you're doing single player. This made things more consistent, and sometimes helped performance, even though it broke a huge amount of mods.

While making this clean break it would have been easy to put world generation into its own thread, or even let it use multiple threads. Minecraft's world generation is inherently very parallel, and it bogs down servers all the time. Responsiveness could be improved by a huge amount by cutting it out of the simulation loop.

They didn't.

Even when I'm playing Kerbal Space Program, there's a marked difference between just being able to play KSP and being able to play KSP while having reference materials open on another screen (the KSP wiki has a great delta-V map of the Kerbol system, for example). Long before my computer is old enough that games like KSP become unusably slow, it gets old enough that the only way I can run games like KSP is to close my browser and free up all of my resources.

KSP 1.1 finally added multithreading btw

Only for different spacecraft, so it only really matters for some planetary bases. A single spacecraft is still processed in a single thread.

When launching a rocket, that rocket's physics engine is still on a single thread. Given that the physics engine is the most CPU-intensive part of KSP, I'd say that KSP's support of multithreading is nominal at best.

Multithreading aside: KSP's approach to physics also doesn't help them. Rather than make creative decisions re what should and shouldn't be calculated, they just hammer the physics thread with lots of calculations that have zero impact on the user experience. In a larger dev these would be optimized away.

KSP is now owned by Take Two Interactive (since May). Here's hoping they can get those dev resources!

I run 4 simultaneous factorio servers on my old 32gb laptop. How many servers is your 1950x capable of? How much ram do you have?

The 1950x is looking like a viable upgrade right now as it’s becoming cheaper.

I have 32 GB of ram, but you could easily fit 128 GB if needed. Above that, I think sourcing the chips becomes difficult.

If you aren't concerned with getting absolute maximum speed from the servers, then I'd think you could run 16 of 'em with only moderate degradation; perhaps 20-30% slower (per server) than a single server. Feel free to imagine the Clusterio applications.

We’re running 43 clusterio-like connected worlds here. Worlds allocated 8gb ram each. Less if we can get away with it by deleting unused chunks.

How do you deal with biters?

That's been the real blocker for me. I don't want to disable them entirely, but worlds need to keep running while there are no players in them. So...?

Use deliberate antagonism. Create an artillery cluster with flamethrowers and uranium ammo turrets around outside. Space these outside your main base defences. Make them high priority for replenishment. Self contained solar power and accumulators as well as grid power. Don’t use laser turrets except for very inside of cluster.

Use thick walls with mixture of solid walls on inside and every-other spacing on outer.

Main base exterior walls use a mixture of uranium turrets, flamethrowers and laser turrets.

Use laser turrets wrapped around large power pole within base. But no interior walls. Redundant connections using large power poles. Substations are for ordinary structures.

> Folklore would have it that single-thread performance is the most important aspect of desktop performance, but that isn't what I've observed.

It still is. It is only that we have approach the end of the performance limit, or it has approached the same as its competitor that you want to trade double the core count vs ~20% more maximum single thread performance.

I just upgraded from a 13" MacBook Pro with 2 cores to a 15" 2018 MacBook Pro with 6 cores (Intel cough, cough), the difference in performance is substantial and very noticeable. The folklore that increasing cores does not translate, feels outdated now that operating systems and applications are designed to utilize multiple cores.

I do wonder if there is a point of diminishing returns though. 24, 32 cores, is it really worth the extra money?

One instance of this might be "running a performance-intensive game fullscreen on one virtual desktop while having a browser open to read strategy guides on another virtual desktop". This is a common use case for me that also seems to be the first place where performance starts to diminish.

That's probably a lot to do with two instances of virtual desktops running, though.

Having a browser open uses approximately 0% CPU though.

Streaming would be a better example.

If you have "a browser" open, sure, if by "a browser" you mean a single browser tab on a not-resource-intensive page or with Javascript disabled. If by "a browser" you mean the dozens of tabs and half-dozen browser plugins the typical user is actually running, though...well, there's a reason Chrome comes with its own process monitor.

But "strategy guides" are in the former category, aren't they?

And for reference all my background tabs in chrome are using plenty of memory but <1% CPU total. It helps that the browser has aggressive throttling for background tabs.

Edit: it's even more aggressive than I thought https://arstechnica.com/information-technology/2017/03/chrom...

I dunno what to say, man. It's a real problem that I've run into.

I believe you had performance issues caused by chrome. What I'm wondering is whether all three of the following apply:

1. The problem was caused by Chrome actively spending CPU and memory cycles, not just having so much memory allocated you run low, because NUMA itself can't change memory consumption.

Caveat: It's possible the NUMA setting would also cause Chrome's ram use to be capped. If that's causes an improvement, then you could do the same capping on a non-NUMA machine.

2. The foreground page in chrome was something like a strategy guide, not something more intense like video streaming. Are there pages that are both strategy guides and intensive to calculate?

3. This happened in the last year or so since version 57 came out.

I switched back from Chrome to Safari (for totally unrelated reasons) between the last time I had this problem and the current time. Maybe Chrome is better about throttling background tabs these days (and better than Safari). Also, this time, wikis and strategy pages seem to load fine (though noticeably slower), but it's a little too hard for me to order a pizza on Grubhub, especially if I'm context switching back and forth between the browser and the game. All I can actually say with confidence is that I have run into a performance bottleneck well before hitting the single-thread will-it-run-at-all barrier.

>Having a browser open uses approximately 0% CPU though.

Not on the modern web it doesn't. A lot of pages are continuously running background tasks and refreshing over time.

It seems performance under Linux is significantly better: https://www.phoronix.com/scan.php?page=article&item=amd-linu...

Phoronix now has Windows 10 vs Linux tests: https://www.phoronix.com/scan.php?page=article&item=2990wx-l...

Sure enough, many of the benchmarks feature dramatically better performance on Linux. Michael also writes:

"The Windows 10 vs. Linux tests were done out of opportunity with having that Windows installation around without giving it too much thought, but on this 2990WX launch-day it's been surprising to see some of the performance results from some of the Windows publications. Had I known how poorly Windows 10 works on current high core count NUMA environments under some workloads, I would have certainly ran more benchmarks. But that will come in another article then as well as possibly looking at the Windows Server 2016 vs. Linux performance on the 2990WX to see if Windows behaves better there for this NUMA box. So treat this as the introductory article and more Windows vs. Linux benchmarks will be on the way as time allows."

I wonder if this comes down to Linux being more adept at handling "exotic" NUMA configurations compared to (desktop) Windows. Even if the "server" editions of Windows could handle them Microsoft may have left that functionality out of the "desktop" kernels for product segmentation purposes.

There are now Windows server benchmarks available as well (it's just as bad) : https://www.phoronix.com/scan.php?page=article&item=windows-...

There is also a direct comparison with Windows: https://www.phoronix.com/scan.php?page=article&item=2990wx-l...

For those wondering like I was after reading this comment, both the anandtech and phoronix test setups are patched for Spectre.

Maybe Michael Larabel removed the plastic cover first.

It seems that they left the plastic cover for at least some of the benchmarks [1]. I can imagine that would limit performance since the CPU would be throttling itself down to keep cool.

[1] https://www.anandtech.com/comments/13124/the-amd-threadrippe...

You think they'd either pull the review or place a disclaimer at the start about this, but I agree with what you're thinking (throttling).

Ian re-ran all of the affected tests after discovering the mistake, but he also kept the data for later analysis. Those results aren't in the review yet.

Fair enough, mistakes happen but the comment didn't make it very clear which benchmarks were included.

"Me being an idiot and leaving the plastic cover on my cooler, but it completed a set of benchmarks. I pick through the data to see if it was as bad as I expected"

> but the comment didn't make it very clear which benchmarks were included

It made it perfectly clear. You were reading from a list of what wasn't in the review yet: "But here's what there is to look forward to:"

Two additional benchmarks I like to see from Anandtech:

1) some VM base tests for these kind of CPU - 32, 64, 128 VM all running some kind of web/db/redis benchmarks inside.

2) Some compilation testing - time the clean build of AOSP, BSD, some very complex linux app - use max jobs setting for parallel compile and time how long it take to finish the jobs. ( measure the over CPU usages at the same time. )

Phoronix had it compiling the Linux kernel in 32 seconds, compared to 37.5 for the 7980XE.

Wow. I wonder how fast it could do a full bootstrapped build of gcc...

Even more generally I would like to see some benchmarks being done of the VM extensions or other processor features. I don't know any benchmarks that helps you understand if the AES native instructions are any good or improved between models. There's a lot of chatter about the memory architecture but knowing that features like nested page table work well would be really useful for me personally.

For these kind of test I think Servethehome.com and https://www.phoronix.com does a better job.

They do a compilation of Chrome in the "Office Tests" section. The 32Core takes about an hour per compile, but the 16Core is actually a bit faster (my guess would be that linking takes longer on the 32 core machine as it's very memory intensive).

FWIW, I believe the version they use relies on a version of MSVC where the linker is predominently single-threaded, and linking the final binary ends up as a significant proportion of the time.

Yes, the issue is the LTCG build, which by default uses 4 threads and with Chromium the compiler hits some pathological issues with such a huge program and it ends up using mostly 1 core and swapping lots of memory.

Does anyone do multithreaded LTO yet?

GCC and clang do. As a matter of fact, MSVC does too, to some extent (CGTHREADS option, iirc).

It's funny to me how much range there is in these tests. Some of these are seemingly backwards with better CPUs ranking so much worse and some look completely random. But wow, the performance on these new TR chips is just insane. Never would've guessed that AMD would be this competitive a few years ago, especially not on the CPU front. Looking forward to the upcoming 7nm launch, can't wait to see what's in store.

The test results on Phoronix paint a rather different picture (i.e., the 2990WX being consistently and markedly faster than any other tested processor).

My initial gut feeling reading the results on Anandtech is that Window's scheduler may not be able yet to exploit the processor's architecture effectively.

2990WX has a very complex NUMA architecture that costs it a lot of performance if the scheduler doesn't get things right. Linux has been running on and has been heavily optimized for even more complex NUMA systems.

I'm eager to see the AIDA64-fpu or any other scientific calculation benchmark on linux, as it is the area where the i9 seems to dominate the TR. Sadly, the few benchmarks I've found were on win10.

Wow, I didn't even realize how big an impact it makes. It literally outperforms everything else on a large margin except for one test, and gets like 117% more performance in ebizzy than the 7980xe.

Yeah, I remember AMD's bulldozer architecture coming out much worse on Windows based benchmarks.

Indeed, as noted in the conclusion there is very little middle ground as performance with this particular NUMA structure is bimodal depending on memory access patterns and power consumption of all the supporting hardware.

I suppose it makes the most sense to frame this as an iteration of the original Threadripper with all the design compromises that entails to provide backward compatibility with existing motherboards. I'm definitely more excited for the next generation HEDT chips that hopefully have a more balanced layout like Epyc.

AMD have certainly come out of the Bulldozer era swinging, and we all benefit from their aggressive pricing.

> After core counts, the next battle will be on the interconnect. Low power, scalable, and high performance: process node scaling will mean nothing if the interconnect becomes 90% of the total chip power.

I think this means that scaling processors to more cores (nodes) in a lightweight-NUMA scheme just isn't going to work.

My guess going forward is that the core counts will stay at the current level for quite a while, and that both AMD and Intel will optimize the heck out of their interconnects, using most power budget gains to bolster the actual cores.

I was hoping they would at least make a passing reference to Amdahl's Law.

As you optimize one part of the system so that it is blazingly fast or lower power or whatever, then the amount taken up by the rest of the system increases proportionally.

(In other words, real performance wins happen gradually as all the parts receive their own little optimizations which add up.)

Sounds a lot like how to play factorio.

It will also means many consumer software can now optimise from low core count to a 64 Core / 128 Thread scenario. And this is likely be a decade worth of work.

I was amazed at how much power the interconnect was taking on these parts. Even with all the other downsides I was convinced it was the right move, but those power requirements are giving me second (third?) thoughts.

The 7nm parts (incl. Ryzen on up) are rumored to use 16-core CCXes, so we should see 64-core EPYC and possibly Threadripper with no worse IC power use. Hopefully they'll make the IC better as well.

7nm EPYC being fabbed on TSMC does give some chance for a big monolithic die. TSMC has a long history making large dies and their 7nm process should be mature enough for decent yields at that point. I am speculating out of my depth, though.

The larger the die the worse the yields, and 7nm will have worse yields than previous generations. AMD will go even harder on MCMs because it makes 7nm cost-effective.

Any wins they make in interconnect efficiency could either be used to boost the power budget of existing cores or add even more cores.

The Chromium compile being faster on 16 cores than 32 cores is pretty weird, given how embarrassingly parallelizable that should be. Wonder if it's out of memory, or if the bottleneck is actually linking.

The Chrome compilation looks really strange indeed, especially with having the results from Phoronix on Linux which show that it's the fastest compiling the Linux kernel. I wonder what kind of Chrome build it's doing - does it use Clang-cl or Visual C++? Does it have LTO (LTCG for VC++) enabled? If it's VC++ with LTCG, for example, the entire code generation and linking is limited to 4 cores by default.

It's Chromium 56 with the default Chromium build-chain.

At that point, Chromium required Visual Studio 2015 Update 3 or later. Anand use VS Community 2015.3. IIRC, by default it doesn't have LTCG enabled.

By default, release builds are almost entirely statically linked (with a few shared libraries, platform dependent), which makes linking time pretty significant. As a result, it ends up more as a linking benchmark than a compilation benchmark.

Ian Cutress replied on the article comments that LTCG is indeed used. With LTCG those strange results make sense - it's spending a lot of time on just 4 threads by default - actually majority of the time is on one thread for the Chromium case, it hits some current limitations of the VC++ compiler regarding CPU/memory usage that makes scaling worse for Chromium (but not for smaller programs or with non-LTCG builds). Increasing the number of threads from the default of 4 is possible, but will not help here. The frontend (parsing) work is well parallelized by Ninja, it's probably the reason why the Threadrippers do end up ahead of the faster single-core Intel CPUs.

There's a significant workload difference between compiling the Linux kernel (C code base) and Chromium (C++ code base). I'd very much like to see a Chromium build benchmark performed under Linux.

See my answer above, it's because the build was LTCG and in the best case at most 4 cores were used during optimization/code geneation/linking.

That's not what I'm talking about, though. Compiling large C++ code requires much more memory than compiling C, which, on a TR and its NUMA architecture, can make significant differences. And we do know that Linux tends to work better on NUMA than Windows, so comparing a C compile on Linux to a C++ compile on Windows, even if it didn't have the LTCG setup problem, would still be apples to oranges.

That Phoronix says the opposite makes me think Anandtech is right....

Personally, I'm more astounded someone outside Google was actually able to build a Google project.

Honest q: What home-use workstation would use 32 cores? (Excluding home labs or servers).

For compiling, having many cores is fantastic. Granted, on a workstation, compilation normally just involves a few files (the ones that have changed since the previous build and their dependencies), but when you have to do a full rebuild, it is fantastic to be able to do `make -j16` and watch it chug through 16 files simultaneously. Interestingly, the benchmark in this review shows that the 16-core 2950X compiles Chromium faster than the 32-core 2990WX, presumably this means something other than the thread count becomes a bottleneck after 16 threads or so.

"this review shows that the 16-core 2950X compiles Chromium faster than the 32-core 2990WX, presumably this means something other than the thread count becomes a bottleneck after 16 threads"

The article mentions that, due to the die packaging, only 16 of the cores have direct access to RAM. So for the 32-core version, half the cores are memory-starved and have to go through the 'connected' cores (also impacting these), while the 16-core version doesn't have that problem and can be at 100% for all process loads.

Might memory access model (UMA vs NUMA) play a role here? AFAIK the TR2950wx has configurable model (can be configured to work in either uma or numa mode) whereas the 2950x only has one mode (can't recall which one at the moment)

It's the opposite, the 2950X can be configured in (fake-)UMA ("distributed" mode in AMD's terms) or NUMA mode but the WX chips are NUMA only.

I would think that compilation is faster on the 32Core, but linking is much slower.

See my other answer in this threaad, the main reason for the strange result is doing an LTCG build, not really the CPU, which scales quite nicely in the Linux tests from Phoronix.


One of the projects I compile at work can take an hour running on 4 threads, jack that up to 16 and you take that down to not much over 17-18 minutes. That's a whole heap of developer time you just got back that would have been wasted on compiler swords.

The other one is running VM's/a docker swarm locally for development.

I make games and can't use the incredibuild server at home, so a workstation with even 16 cores would be amazing.

I think that most of us who are interested in high core counts are at least hobbyists. For example, I use Monte Carlo simulation to compute the pagerank vector for all of the biomedical literature on PubMed. 32 cores is either 32x faster than 1 core, or lets me improve the precision of my results. Sure, this isn't browsing the web, but it's also not a real research project or a business.

It takes roughly 20-minutes to make a SINGLE good frame using Cycles on Blender. Cycles is a raytracer for 3d modeling.

If you are making a 30-second animation at 24-frames-per-second, that would be 720 frames, or roughly 240 hours (10 days) of rendering. 30-seconds would be roughly the length of a standard commercial.

If you have a computer that is 2x or 4x faster, that cuts the time down to 5 days or 2.5 days. Which is dramatically different. Its mostly a CPU-intensive problem with relatively low RAM bandwidth. (Its RAM-heavy, especially with HDR skymaps. So you need lots of RAM but not necessarily fast RAM).

Or you buy good graphics card and you blow CPU out of the water. But it is true for render engines like Arnold or Corona CPU is main thing. For cycles get gpu.

You might be surprised.


The Threadripper 1950x (16-core) is faster than the 1080 Ti in several tests. Fishy Cat for instance is faster on Threadripper, as well as the difficult "Barbershop Interior".

With all the updates to Zen2, higher clocks, and now 32-cores, I bet that the 2990wx will be incredible and give GPUs a run for their money.

Besides, you'll need a good CPU to handle physics (cloth, fluid, etc. etc.). Not everything can be done on the GPU yet.

CPUs also have the benefit that RAM is super-cheap. You can get 64GB of DDR4, but its basically impossible to get that amount of RAM on a GPU. This allows you to run multiple blender instances to handle multiple frames quite easily. A portion of rendering is still single-thread bound, so an animation can be rendered slightly faster if you allocate a blender-instance per NUMA node.

If you do have a GPU, you can still have CPU+GPU rendering by simply running Blender twice, once with GPU rendering and a 2nd time for CPU Rendering. With the proper settings, you'll generate .png files for each animation independently, which allows for nearly perfect scaling.

Every x399 board I've seen supports quad-GPUs. So you can totally build a beast rig with 4x GPUs + 32 CPU Cores for the best rendering speed possible.

You make very good point. I do 3D rendering from time to time and i dont actualy play games anymore. Most of my work would greatly benefit from powrful cpu not so much gpu. Interesting i should reconsider.

Digital audio workstation workloads are massively multithreaded, with hundreds or thousands of DSP processes. Performance scales almost linearly with core count.

These recentish AMD core improvements alone have made me considering rebuilding my VST collection for Windows and moving off mac for production. I'm not looking forward to tracking down windows VST versions of tiny apps. I can't imagine spending 3k for another Macbook when I can get way more interesting performance in Windows these days. I'd love to have a desktop for very heavy synth and processing work, freeze those tracks, and then be able to take it on the go with a similarly set up laptop (set up DAW that is, not 32 cores).

I was in a similar position, really tied to all my tools on OS X but didn't want to spent 3k on their desktop line.

Instead I built an awesome Hackintosh with 16gb ram, 8 cores, nvme drive and 1080 for like $1600 that runs High Sierra. It's definitely more work to set up initially but pretty low hassle afterwards. No regrets.

Yes I'm at the crossroads but I want to be able to have a laptop to take as well and the new mac line up is just not for me. Thats great info that the Hackintosh computers are still kicking because I had mostly ruled that out after not hearing much about them over the past few years. Mind sharing your build? :D

My build:

  - Intel 7700k
  - Geforce 1080
  - Asus ROG Strix Z270E
  - Samsung Evo 960 NVMe
The process is much easier than it was years ago - especially if you can find a few people that got it working with the same motherboard.

1) Make a standard install USB

2) Run Clover Configurator on the USB with standard settings + tweaks based on your GPU and motherboard (you can find suggestions on /r/hackintosh and the TonyMac forums)

3) Install and boot

4) Tweak the Clover configuration on the EFI partition to fix any random remaining issues you find like USB or audio.

Not my experience. I've had stutters with Ableton Live on a 4 core machine with 2 cores still idling around. Ableton cannot multi-thread a single track (at least in Version 9), and if you're using a single-threaded VST that does not matter anyway.

If a single instance of a VST can use 100% of a thread, you're using a woefully underpowered processor or a ludicrously inefficient plugin. Many composers regularly work on projects with hundreds of tracks and thousands of plugin instances. Projects of that scale used to require multiple computers and a bunch of DSP accelerator cards, but they're now entirely feasible on one high-end workstation.

It's not hard to find VSTs which will easily max out an i7, especially if they run on Max/MSP. Also, stuff from u-he, like Diva. If you run several instances on high-quality you'll usually need to start freezing tracks.

> Ableton cannot multi-thread a single track (at least in Version 9)

This is still the case in 10, and one of the reasons I have been seriously looking at Bitwig (Linux support being the other).

Is situation in bitwig better?

When I tested them out with identical sessions, I was able to get higher track/VST counts without dropouts in Bitwig.

I'm literally sitting here waiting for glibc to compile, because I need a version with debug symbols, which the version from Arch Linux' repos lacks. Right before that, I compiled valgrind from the git head, because the current release (3.13) doesn't support glibc 2.28. I have compiled Chromium a couple of times for work.

It's very apparent that my 2-core 4-thread i5 5200U is a bit too weak; I'd love to be using a 16- or 32-core machine.

8-core Ryzen is a huge upgrade from an Intel U and doesn't cost nearly as much as 16- and 32-core machines

Does video editing or After Effects work could as home use yet? Video editing, in terms of cutting and splicing clips, is not going to benefit much from this chip, but a lot of effects rendering will benefit, and basic video editing uses a lot of effects these days.

That’s all I’ve got.

Yes for roughly 8-threads or so.

16-thread Threadripper is mostly idle in my video-editing tests. I mean, its a great processor. But video editing isn't "heavy enough" for me to recommend a 16-core or bigger processor.

Ah, not surprised. I have the Intel 8-core.

The best upgrade I’ve made for my video editing workstation has been going to 4x GPUs for DaVinci Resolve.

Most plugins/filters/effects are single-threaded :-(

Yeah don’t get me wrong. They pretty much all are. But, a few key ones aren’t, depending on the host app, and aftermarket plugins have a lot of multi threading support.

I wondered that as well. I guess my imagination for uses isn't particularly good, but all the uses i bought my 16 core workstation for (mostly research computing development) would really suffer with the memory performance of the 32 core chip.

I don't know enough about video games, but I would naively think that the memory latency would be a big deal there as well.

At any rate, I learned a lot from this article. Anandtech's reviews always seem well-written and well-researched.

> I don't know enough about video games, but I would naively think that the memory latency would be a big deal there as well.

It really depends on what else you're doing. If you're just playing a game then Disk and GPU tend to be the biggest bottlenecks in video gaming. Even a reasonably fast modern CPU is sufficient for most games.

One niche application is music composition. When you're writing a score for full orchestra, you need lots of RAM and lots of cores for accurate playback.

I'd use it for home labs or servers.

Home servers? In what situation would you ever need 32 cores for home usage? I'm genuinely curious.

VMs, lots of them.

Atm I'm running about 15 VMs on about 8 cores in a dedicated box somewhere and it's definitely noticable. I would love to shove some core services at home and have 32 cores to play with to give some more headroom

Why not use containers instead of VMs? You can run about 10x more Docker instances than VMs on the same hardware.

Because all of the containers may not be of the same operating system? Networking on containers is also a bit different.

There are also reasons for having some more isolation between guest OSes.

On my ESXi box at home I have:

* A VM that hosts my NAS shares. This does nothing other than host the NAS shares, as I want to be sure no silly experiment of mine interferes with that.

* A general-purpose VM, where I do run some containers out of (UniFi controller, Plex, etc)

* A VM running Windows Server for my Domain Controller

* A secondary vSwtich with isolated no uplink to the rest of the network. This is my mini malware testing lab.

* A VM running pfSense that I'll sometimes use to allow selective access out of the isolated vSwtich out to the internet, but not to the rest of the network.

Can't do all that with containers.

I have many use-cases where containers are simply unsuitable.

I'm using FreeBSD, but these apply just as well to Linux. I wanted to run ZoneMinder, which is not available for FreeBSD, so I simply spun up a CentOS VM and installed it.

On the flip side, I wanted to run Home Assistant, Node-RED, and some related utility programs. All of these are happy to run on FreeBSD, so they can live happily in a Jail (FreeBSD's equivalent to a container).

Some people virtualize their router by dedicating a NIC to the appropriate VM. I don't know if this would even be possible in a container.

I run proxmox on my 16 thread ryzen and would love more cores.

I currently run 4 linux vms for my kubernetes cluster and a 4 core macOS vm with passthrough for my gtx 1080i. I have 64 gb of memory so the only thing stopping me from running my windows 10 and arch desktop vms at the same time is more cores.

Because contrary to the hype, containers aren't the right solution to everything.

While you are correct that they are not a one size fits all solution, would you care to elaborate to the specifics of this instance?

Because not everything I want to run in best suited (or even available) to linux.

Not everything runs great on containers. My internal firewall is a pfSense, BSD based which doesn't run on a linux kernel.

Atleast 3 VMs need patched kernels or more recent kernels/regular kernel updates than the host provides.

Additionally VMs provide a bit more isolation than a simple container (atleast unless you do unpriv'd container).

I do have containers too, about 20 of them, half of them unpriv'd, all of them LXC. Docker is not suitable for my use case at all and frankly I don't think you should suggest someone should switch to Docker without knowing their use cases.

If you want to run multiple different OS's (or even different distributions of the same OS) containers don't work.

There is nothing preventing you from mixing a couple of VMs and have containers on top of some of them.

Build servers, video conversion/streaming, and hosting game servers are use cases that would certainly benefit from this in a hobbyist/home environment.


And just to expand on that, I'd like to (for instance) run multiple remote desktops (including a photo editing station), probably a decent plex/emby VM, probably something to do transcoding etc etc. Not to mention dev VMs etc.

You can do a lot with one thing and administer that one thing without having lots of individual boxes doing stuff, and for me it'd be way faster and a single cost, so it'd work out as a big improvement.

In a 'money is no object have all the time in the world' it would probably be better to have something dedicated to do each task, but that's not that flexible on top of the other drawbacks (cost in money/time).

I would use it to mine bitcoin and heat my house in winter, I already have a slogan :

"Don't be a patsy who pay for heating your place, be paid for it. Order our heating device for just $2999!"

I think more CPU performance is attractive if you edit photos or movies - something that is typical for a home computer. The reason AMD provides CPU performance through a high core count is because it is power efficient, otherwise a single-core is easier to program.

Rendering is something I could see people doing at home if that's their hobby.

Then the tests in this article are hardly convincing.

How is the $1800 2990WX outperforming the $1980 i9-7980XE in every rendering test they performed not convincing?

There are clearly workloads where the 32 core TR chip does not perform well (probably due to the memory configuration) but it seems pretty good at rendering.

The 2990wx blew everything out of the water in rendering. It's like 37% faster than the 7980xe in the Blender benchmark.

Faster with 2x more cores and twice more power consumption. Hardly a win in my book - it means each core is way weaker and power hungry. Intel will easily match that sometimes soon without sweating.

Check your numbers. In that Blender benchmark, AMD is actually 58% faster (152/96), has 78% more cores (32/18), and its TDP is 52% higher (250/165).¹ https://www.anandtech.com/show/13124/the-amd-threadripper-29...

This means AMD manages to execute more work per watt (more energy efficient), and each AMD core uses less power than Intel.

¹ Anantech wrongly lists the TDP as 140W. It's in fact 165W: https://ark.intel.com/products/126699/Intel-Core-i9-7980XE-E...

A Corona 1.3 benchmark between the i9-7980XE and the 2990WX saw, in that workload, the 2990WX 28% faster than the i9.


Its power consumption was 19% higher than the i9-7980XE.


Tom's Hardware saw a stock 2990WX at a lower power consumption than a stock i9-7980XE during a Prime95 "torture loop". Overclocked, the AMD part was higher than the Intel one, but only slightly.


Where have you seen that it has double the power consumption? Under what workloads?

Personally, I don't care about "weaker cores". If a system has 2048 cores clocked at 7 THz and it is 20% faster at my workload than a single-core CPU at 700 MHz, it is faster.

The fact that the "weaker cored" system is cheaper than the "burly muscly" single core system is a bonus.

Power consumption doesn't even matter that much either. It is the equivalent to a single 60W light bulb (or several of those new-fangled LED bulbs). Big whoop.

If I were Pixar and was running one of these 24/7 then, well, it's theoretically possible that the extra you pay in electricity would make up for the lower capital price. But for a hobbyist running this at most 10 hours a week I really doubt that that's a consideration.

But more importantly, The Tech Report looked at task energy for the Threadripper in rendering tasks and found that it took less energy to finish a render than competitorys. It's power was higher but the time was shorter to an even greater extent.


So if you're so serious about rendering that you're willing to spend thousands on a good rig for it there really isn't any reason not to use this boy.

If you check Phoronix, they found it doesn't actually take much more power than the 7980XE.

2x core count hardly ever offers anywhere near a 100% performance gain, even across Intel's lineup [1]. Very few workloads are that parallelizeable, and almost everything (from the program to the OS to the CPU itself) introduces some form of overhead when running in parallel.

[1] https://www.cpubenchmark.net/compare/Intel-Core-i9-7960X-vs-...

How much power would an Intel based system need to get that 40% performance boost?

What metric are you considering? The rendering tests show the 2990WX to be both faster and cheaper than the i9.

Slightly faster yes. Cheaper? For now. Price is elastic. The i9 could be priced at any price point because there was no competition until now. Do you think Intel will keep it overpriced for long?

AMD has manufacturing cost advantage - CPUs are built from 4- core CCXes that can be binned separately.

Intel on the other hand needs to manufacture a monolithic CPU that not only is fault free in enough cores, but performs well. That's harder and yields are way lower.

80% yield on a 4 core block is a 16.7% yield on a 32 core block - and that's before binning

If AMD's method scales so much easier, why didn't Intel just... do that? Honest question.

Intel could do that. But AMD came out with this method first.

AMD has only been doing this "infinity fabric" thing for a year. Intel was caught with their pants down. It seems like Intel is researching chiplet technology and trying to recreate AMD's success here.

It takes several years to create chips. So Intel realistically won't be able to copy the strategy until 2020 or later. But you better bet that Intel is going to be investing heavily into chiplet technology, now that AMD demonstrated how successful it can be.

AMD introduced HyperTransport in 2001.

HyperTransport and Intel QuickPath isn't chiplet technology.

AMD "upgraded" HyperTransport to Infinity Fabric. Which IIRC uses a bit less power (taking advantage of the shorter, more efficient die-to-die interposer).

Intel has UPI (upgrade over Intel QuickPath), but it hasn't been "shrunk" to chiplet level yet. Intel has EMIB as a physical technology to connect chiplets together... but Intel still needs to create dies and a lower-power protocol for interposer (or maybe EMIB-based) communications.

So Intel has a lot of the technology ready to create a chiplet (like AMD's Zeppelin dies). But Intel wasn't gunning for chiplets as hard as AMD was. Still, Intel demonstrated their chiplet prowess with the Xeon+FPGA over EMIB. So Intel definitely "can" do the chiplet thing, they just are a little bit behind AMD for now.

Intel has done that in the past, actually, their first "dual core" chip (2005) was actually two chips in a package.


There was a major difference though. Intel's chips communicated over the front-side-bus (not a great solution considering how FSB was already far inferior to HyperTransport).

Sure, that's why the startup I was one of the founders of in that era built a HyperTransport-attached InfiniBand adapter. Intel wasn't very competitive in the supercomputing space back then.

Because it's not free. Communication between cores in different CCXes (and memory access - there is a single memory controller per CCX) has slightly higher latency than within a single CCX (or a monolithic CPU, but here Intel's advantage decreases with core count due to a different interconnect).

Also because they didn't have to innovate - no competition since early Opterons.

I'm thinking of buying one to run multiple instances of Selenium with headless Firefox for crawling reasons. Pair it with 128GB RAM and I could easily run 40-50 instances simultaneously.

At the very least I guess it would be useful for running multiple VMs

Any (well written) data analysis code could benefit from the increased core count and high number of memory channels.

Any kind of process that's batchable too.

with 32 cores, I can run 32 small simulations (e.g. CFD, FEA, etc) in parallel for optimization problems or run 1 medium sized simulation. And anything in between.

My ideal workstation likely actually uses ~128 cores but that isn't practical for home use yet. A board with 4 2990wx would be heaven.

Video editing or 3D rendering.

You are right, uncovering potential of a system like that is almost impossible for home use. With most software unable to use efficiently even a couple of cores, 32 is certainly something for the future.

Personally I'm looking forward to something based on 2200GE for home use.

It is interested that the review is not yet finished. The author writes in the comments section:

Hey everyone, sorry for leaving a few pages blank right now. Jet lag hit me hard over the weekend from Flash Memory Summit. Will be filling in the blanks and the analysis throughout today.

I am disappointed in that, as I was looking forward to reading the test setup and power draw sections, as I have a 2990WX on order, and I'm dithering over what motherboard to get (I'd prefer an older one, which matches the features that I want better... eg, clear support for ECC and no bling) but there is some concern that older motherboards will be too close to the edge in terms of the power draw of the 2990WX.

Shouldn't they not publish until the review is done? This is really unprofessional - even bloggers have the concept of "only publish finished drafts" or use the scheduled publishing feature that all blogging platforms have nowadays.

In the PC hardware space there is a big push to have something out by the embargo since that's when 80% of the traffic is. It's unfortunate, but I'd rather have an incomplete article than an inaccurate or shallow one. I'd rather have a review posted too soon than no more Anandtech.

Presumably the embargo lifted and they were afraid of getting scooped. I agree it's unprofessional but hey probably 90% of their readers are hobbyists.

Cores without memory channels being useless: Unsuprising. Interconnect power, that was a big suprise. Would we be able to get some kind of comparison between this on die interconnect approach and multi-socket power consumption? Two things seem apparent: Multi socket would have fewer thermal issues, but also lower inter-socket performance.

For raw performance though, I would guess we will see some rather extreme cooling actually becoming more mainstream in future in the workstation space.

This is the perfect chip for continuous, low-memory number crunching. For everything else... not so much. I mean, this chip consumes 74W when idle, of which almost 90% are spend on the interconnect. That's insane. Most important bit in the review for me:

"After core counts, the next battle will be on the interconnect. Low power, scalable, and high performance: process node scaling will mean nothing if the interconnect becomes 90% of the total chip power."

If that number crunching is highly vectorizable, avx-512 of skylake-x will likely still be faster. Eg, an Intel 7900x is over twice as fast at multiplying big matrices (multithreaded) as an AMD 1950x. Of course, a similarly priced GPU would stomp both whenever CPU-GPU latency isn't much of an issue.

Meaning "highly vectorizable number crunching that you can't practically run on a GPU for some reason" is probably fairly niche. In LINPACK the even the 1950x beat the 7980XE on phoronix: https://www.phoronix.com/scan.php?page=article&item=amd-linu... and I'd have thought LINPACK would really reward AVX...

Maybe it isn't optimized to take advantage of it? I love wasting time playing with number crunchinng microbenchmarks, and when you do that avx-512 seems so unassailably impressive, it's surprising to see this not bourne out in larger benchmarks. (Also worth pointing out you need to tell gcc -mprefer-vector-width=512 on top of march=skylake-avx512, otherwise it'll prefer 256 bit vectors).

Anyway, those benchmarks definitely support compiling and number crunching. Pretty ideal for stuff like monte carlo.

Most people used TR for day to day desktop use. This really pushes it into the workstation segment. While efficiency is important, results are even more so.

That's what it consumes when one core is active. Idle should be much lower, though you're right that to the extent you're doing anything on most desktops it's going to be just a few threads.

Why do you say low-memory? I got the opposite feeling. I've been drooling over EPYC and TR2 ever since they released the specs for my memory bandwidth limited projects.

The NUMA configuration on TR2 results in two 8-core CCX's without direct memory access, so you still "only" have four memory channels and half the cores have to hop over the infinity fabric to access any of them.

I wonder how many of these tests (if any) are limited by bad multithreading patterns (see: https://www.arangodb.com/2015/02/comparing-atomic-mutex-rwlo...).

I would love AMD commercials using Judas Priest's "The Ripper" [1] as music for selling those beasts! :-D

[1] https://www.youtube.com/watch?v=lriWlHZAy8A

As a long time Gentoo user, switching to even a Ryzen 7 1700 was a big difference, @world recompiles in ~8 hrs instead of 24+ hrs on 4C/8T CPUs.

Is there any board that allows 256GB ECC RAM for Threadripper 2? So far for TR I've seen only 128GB; for more EPYC was necessary.

How many cores must a CPU have until it's competitive with GPUs? Those beasts could pretty well open the hell-gates to fully interactive raytracing soon, or not so much? Of course am not expecting THIS processor to be competitive, but with the interest in raytracing, things might get interesting if this trend of more CPU cores keep on piling up.

Not a chance. By the time you get there, GPUs would do it (much) faster. Not to mention that GPUs are likely going to get specialized cores for raytracing (see RTX).

Will there be motherboards to fit two of these chips like in my current dual 6-core Xeon? 64 cores / 128 thread in a single workstation would be insane, and fit in very nicely with my lab :)

Dual socket setups are limited to the EPYC line of processors where 64 core/128 thread setups are already possible.


This looks like a nice part for a server

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact