
AMD Threadripper 2990WX 32-Core and 2950X 16-Core Review - MikusR
https://www.anandtech.com/show/13124/the-amd-threadripper-2990wx-and-2950x-review
======
Filligree
The biggest benefit I've found to running a 1950X isn't something I would have
expected, but which perhaps I should have. True, it's much faster than my old
system for batch processing, but most of the time it's still idle. Even if
it's running Chrome flat-out, as far as the 1950X is concerned, that's idle.

Because the two NUMA nodes are ~entirely independent, it's capable of running
two independent processes at _full_ speed. In practice, that means lower
latencies and less jitter, and it's been noticeable. Folklore would have it
that single-thread performance is the most important aspect of desktop
performance, but that isn't what I've observed.

...it's also useful when I, e.g, decide to run a Factorio server on my
desktop.

~~~
mrkgnao
> Because the two NUMA nodes are ~entirely independent, it's capable of
> running two independent processes at full speed.

I don't understand. From my (admittedly little better than layperson's)
knowledge, I'm guessing the cores of most multicore processors have to compete
for memory access...? Is there a good search term I can use to help me
understand what's going on here?

~~~
Osiris
There are 2 dies in the 1950X, each one has 2 memory channels. Thus, it's
possible to run a process on one (8-core) die that maxes out the memory
bandwidth to it's two local DDR4 channels while the other die still has full
bandwidth access to it's own DDR4 channels.

Threadripper is able to switch between NUMA (non-uniform memory access) mode
and "regular" mode. In NUMA, the OS knows that 2 channels are attached to 1
die and 2 channels on the other, thus allowing lower latencies because the OS
knows what RAM to allocate based on which core the process is running on.

~~~
gascan
As a bonus, if you are explicitly NUMA & the OS/code does a good job, there's
little line contention or resource sharing (e.g., caches) between die.

~~~
Filligree
I found a significant performance benefit to keeping NUMA turned on when
running Linux, for basically every workload.

For Windows, it is the other way around. I hope they'll improve their NUMA
handling, but I'm not holding my breath.

The Linux kernel is clever about this. You can get some idea of what it does
by looking at numactl, which lists the various scheduling modes -- though in
practice the kernel does a great job without any user overrides, and actually
_using_ the command is likely to slow things down.

Which is not to say that it can't occasionally be helpful, if you're trying to
optimize the speed of a single thread. At a minimum, you can choose between
optimizing for bandwidth (interleaving data on all four memory channels) or
latency (putting everything in the local node). Usually you want the latter.

------
jjuhl
It seems performance under Linux is significantly better:
[https://www.phoronix.com/scan.php?page=article&item=amd-
linu...](https://www.phoronix.com/scan.php?page=article&item=amd-
linux-2990wx&num=1)

~~~
MrRadar
I wonder if this comes down to Linux being more adept at handling "exotic"
NUMA configurations compared to (desktop) Windows. Even if the "server"
editions of Windows could handle them Microsoft may have left that
functionality out of the "desktop" kernels for product segmentation purposes.

~~~
jjuhl
There are now Windows server benchmarks available as well (it's just as bad) :
[https://www.phoronix.com/scan.php?page=article&item=windows-...](https://www.phoronix.com/scan.php?page=article&item=windows-
server-2990wx&num=1)

------
xvf22
It seems that they left the plastic cover for at least some of the benchmarks
[1]. I can imagine that would limit performance since the CPU would be
throttling itself down to keep cool.

[1] [https://www.anandtech.com/comments/13124/the-amd-
threadrippe...](https://www.anandtech.com/comments/13124/the-amd-
threadripper-2990wx-and-2950x-review/611648)

~~~
vesrah
You think they'd either pull the review or place a disclaimer at the start
about this, but I agree with what you're thinking (throttling).

~~~
wtallis
Ian re-ran all of the affected tests after discovering the mistake, but he
also kept the data for later analysis. Those results aren't in the review yet.

~~~
xvf22
Fair enough, mistakes happen but the comment didn't make it very clear which
benchmarks were included.

"Me being an idiot and leaving the plastic cover on my cooler, but it
completed a set of benchmarks. I pick through the data to see if it was as bad
as I expected"

~~~
wtallis
> but the comment didn't make it very clear which benchmarks were included

It made it perfectly clear. You were reading from a list of what wasn't in the
review yet: "But here's what there is to look forward to:"

------
srcmap
Two additional benchmarks I like to see from Anandtech:

1) some VM base tests for these kind of CPU - 32, 64, 128 VM all running some
kind of web/db/redis benchmarks inside.

2) Some compilation testing - time the clean build of AOSP, BSD, some very
complex linux app - use max jobs setting for parallel compile and time how
long it take to finish the jobs. ( measure the over CPU usages at the same
time. )

~~~
CarVac
Phoronix had it compiling the Linux kernel in 32 seconds, compared to 37.5 for
the 7980XE.

~~~
sp332
Link [https://www.phoronix.com/scan.php?page=article&item=amd-
linu...](https://www.phoronix.com/scan.php?page=article&item=amd-
linux-2990wx&num=6)

------
nv-vn
It's funny to me how much range there is in these tests. Some of these are
seemingly backwards with better CPUs ranking so much worse and some look
completely random. But wow, the performance on these new TR chips is just
insane. Never would've guessed that AMD would be this competitive a few years
ago, especially not on the CPU front. Looking forward to the upcoming 7nm
launch, can't wait to see what's in store.

~~~
bluescarni
The test results on Phoronix paint a rather different picture (i.e., the
2990WX being consistently and _markedly_ faster than any other tested
processor).

My initial gut feeling reading the results on Anandtech is that Window's
scheduler may not be able yet to exploit the processor's architecture
effectively.

~~~
Tuna-Fish
2990WX has a very complex NUMA architecture that costs it a lot of performance
if the scheduler doesn't get things right. Linux has been running on and has
been heavily optimized for even more complex NUMA systems.

~~~
131012
I'm eager to see the AIDA64-fpu or any other scientific calculation benchmark
on linux, as it is the area where the i9 seems to dominate the TR. Sadly, the
few benchmarks I've found were on win10.

------
blattimwind
> After core counts, the next battle will be on the interconnect. Low power,
> scalable, and high performance: process node scaling will mean nothing if
> the interconnect becomes 90% of the total chip power.

I think this means that scaling processors to more cores (nodes) in a
lightweight-NUMA scheme just isn't going to work.

My guess going forward is that the core counts will stay at the current level
for quite a while, and that both AMD and Intel will optimize the heck out of
their interconnects, using most power budget gains to bolster the actual
cores.

~~~
sounds
I was hoping they would at least make a passing reference to Amdahl's Law.

As you optimize one part of the system so that it is blazingly fast or lower
power or whatever, then the amount taken up by the rest of the system
increases proportionally.

(In other words, real performance wins happen gradually as all the parts
receive their own little optimizations which add up.)

~~~
aurailious
Sounds a lot like how to play factorio.

------
gok
The Chromium compile being faster on 16 cores than 32 cores is pretty weird,
given how embarrassingly parallelizable that should be. Wonder if it's out of
memory, or if the bottleneck is actually linking.

~~~
compilerdev
The Chrome compilation looks really strange indeed, especially with having the
results from Phoronix on Linux which show that it's the fastest compiling the
Linux kernel. I wonder what kind of Chrome build it's doing - does it use
Clang-cl or Visual C++? Does it have LTO (LTCG for VC++) enabled? If it's VC++
with LTCG, for example, the entire code generation and linking is limited to 4
cores by default.

~~~
gsnedders
It's Chromium 56 with the default Chromium build-chain.

At that point, Chromium required Visual Studio 2015 Update 3 or later. Anand
use VS Community 2015.3. IIRC, by default it doesn't have LTCG enabled.

By default, release builds are almost entirely statically linked (with a few
shared libraries, platform dependent), which makes linking time pretty
significant. As a result, it ends up more as a linking benchmark than a
compilation benchmark.

~~~
compilerdev
Ian Cutress replied on the article comments that LTCG is indeed used. With
LTCG those strange results make sense - it's spending a lot of time on just 4
threads by default - actually majority of the time is on one thread for the
Chromium case, it hits some current limitations of the VC++ compiler regarding
CPU/memory usage that makes scaling worse for Chromium (but not for smaller
programs or with non-LTCG builds). Increasing the number of threads from the
default of 4 is possible, but will not help here. The frontend (parsing) work
is well parallelized by Ninja, it's probably the reason why the Threadrippers
do end up ahead of the faster single-core Intel CPUs.

------
karangoeluw
Honest q: What home-use workstation would use 32 cores? (Excluding home labs
or servers).

~~~
MrUnderhill
For compiling, having many cores is fantastic. Granted, on a workstation,
compilation normally just involves a few files (the ones that have changed
since the previous build and their dependencies), but when you have to do a
full rebuild, it is fantastic to be able to do `make -j16` and watch it chug
through 16 files simultaneously. Interestingly, the benchmark in this review
shows that the 16-core 2950X compiles Chromium faster than the 32-core 2990WX,
presumably this means something other than the thread count becomes a
bottleneck after 16 threads or so.

~~~
Hasknewbie
"this review shows that the 16-core 2950X compiles Chromium faster than the
32-core 2990WX, presumably this means something other than the thread count
becomes a bottleneck after 16 threads"

The article mentions that, due to the die packaging, only 16 of the cores have
direct access to RAM. So for the 32-core version, half the cores are memory-
starved and have to go through the 'connected' cores (also impacting these),
while the 16-core version doesn't have that problem and can be at 100% for all
process loads.

~~~
znpy
Might memory access model (UMA vs NUMA) play a role here? AFAIK the TR2950wx
has configurable model (can be configured to work in either uma or numa mode)
whereas the 2950x only has one mode (can't recall which one at the moment)

~~~
snuxoll
It's the opposite, the 2950X can be configured in (fake-)UMA ("distributed"
mode in AMD's terms) or NUMA mode but the WX chips are NUMA only.

------
drewg123
It is interested that the review is not yet finished. The author writes in the
comments section:

 _Hey everyone, sorry for leaving a few pages blank right now. Jet lag hit me
hard over the weekend from Flash Memory Summit. Will be filling in the blanks
and the analysis throughout today._

I am disappointed in that, as I was looking forward to reading the test setup
and power draw sections, as I have a 2990WX on order, and I'm dithering over
what motherboard to get (I'd prefer an older one, which matches the features
that I want better... eg, clear support for ECC and no bling) but there is
some concern that older motherboards will be too close to the edge in terms of
the power draw of the 2990WX.

~~~
barkingcat
Shouldn't they not publish until the review is done? This is really
unprofessional - even bloggers have the concept of "only publish finished
drafts" or use the scheduled publishing feature that all blogging platforms
have nowadays.

~~~
mastax
In the PC hardware space there is a big push to have something out by the
embargo since that's when 80% of the traffic is. It's unfortunate, but I'd
rather have an incomplete article than an inaccurate or shallow one. I'd
rather have a review posted too soon than no more Anandtech.

------
Arbalest
Cores without memory channels being useless: Unsuprising. Interconnect power,
that was a big suprise. Would we be able to get some kind of comparison
between this on die interconnect approach and multi-socket power consumption?
Two things seem apparent: Multi socket would have fewer thermal issues, but
also lower inter-socket performance.

For raw performance though, I would guess we will see some rather extreme
cooling actually becoming more mainstream in future in the workstation space.

------
deng
This is the perfect chip for continuous, low-memory number crunching. For
everything else... not so much. I mean, this chip consumes 74W when idle, of
which almost 90% are spend on the interconnect. That's insane. Most important
bit in the review for me:

"After core counts, the next battle will be on the interconnect. Low power,
scalable, and high performance: process node scaling will mean nothing if the
interconnect becomes 90% of the total chip power."

~~~
Coding_Cat
Why do you say low-memory? I got the opposite feeling. I've been drooling over
EPYC and TR2 ever since they released the specs for my memory bandwidth
limited projects.

~~~
snuxoll
The NUMA configuration on TR2 results in two 8-core CCX's without direct
memory access, so you still "only" have four memory channels and half the
cores have to hop over the infinity fabric to access any of them.

------
gavanwoolery
I wonder how many of these tests (if any) are limited by bad multithreading
patterns (see: [https://www.arangodb.com/2015/02/comparing-atomic-mutex-
rwlo...](https://www.arangodb.com/2015/02/comparing-atomic-mutex-rwlocks/)).

------
faragon
I would love AMD commercials using Judas Priest's "The Ripper" [1] as music
for selling those beasts! :-D

[1]
[https://www.youtube.com/watch?v=lriWlHZAy8A](https://www.youtube.com/watch?v=lriWlHZAy8A)

------
ece
As a long time Gentoo user, switching to even a Ryzen 7 1700 was a big
difference, @world recompiles in ~8 hrs instead of 24+ hrs on 4C/8T CPUs.

------
bitL
Is there any board that allows 256GB ECC RAM for Threadripper 2? So far for TR
I've seen only 128GB; for more EPYC was necessary.

------
arithma
How many cores must a CPU have until it's competitive with GPUs? Those beasts
could pretty well open the hell-gates to fully interactive raytracing soon, or
not so much? Of course am not expecting THIS processor to be competitive, but
with the interest in raytracing, things might get interesting if this trend of
more CPU cores keep on piling up.

~~~
Baal
Not a chance. By the time you get there, GPUs would do it (much) faster. Not
to mention that GPUs are likely going to get specialized cores for raytracing
(see RTX).

------
SubiculumCode
Will there be motherboards to fit two of these chips like in my current dual
6-core Xeon? 64 cores / 128 thread in a single workstation would be insane,
and fit in very nicely with my lab :)

~~~
tempay
Dual socket setups are limited to the EPYC line of processors where 64
core/128 thread setups are already possible.

------
qaq
This looks like a nice part for a server

