
64 Core Threadripper 3990X CPU Review - smacktoward
https://www.anandtech.com/show/15483/amd-threadripper-3990x-review/3
======
axilmar
The article presents the situation as if limiting the number of threads by the
operating system is not a conscious choice by Microsoft but a natural law that
cannot be avoided.

Microsoft deliberately does these limitations in order to force people to pay
more for its sofware. It's a shame, really.

~~~
daemin
Supporting only 64 processors in a group on a 64bit operating system seems
like a reasonable and sane technical solution. It means you can use a single
64bit variable as a bitmask for various processor related functions in a
process.

I would bet that many other bits of software also have this limitation because
they too thought that using a 64bit value for a processor mask would be
sufficient.

~~~
kijin
A 6-bit variable would be enough to hold 64 values.

~~~
Adrock
That doesn’t allow you to use it as a bit mask, which is what OP was saying.

~~~
phamilton
Another way to express this is that a 64-bit value can express every
combination of processors, so if you want to say something like "this process
should run on these 12 specific processors" you can do so in a single 64-bit
value.

------
logicchains
I'm curious, to anyone working on Windows kernel development, at what point
does a feature/scheduler improvement become so good that people decide "nah,
this is way too cool to put in Windows Home Edition, let's feature-gate it to
the enterprise version instead!"?

~~~
sbergot
I guess it depends on their segmentation strategy. If you have not read it
already there is a great article by Joel Spolsky on this topic:
[https://www.joelonsoftware.com/2004/12/15/camels-and-
rubber-...](https://www.joelonsoftware.com/2004/12/15/camels-and-rubber-
duckies/)

In this case it seems pretty justified to include this feature in the
enterprise edition only.

~~~
CamperBob2
Great. How do I, a sole proprietor, buy this "enterprise edition?"

~~~
jsjohnst
Sign up for a Microsoft small business account. Via that, you get a MSDN (or
whatever they call it these days) subscription. Then you can download any
version of the OS and get a license key from the same webpage to activate it.

~~~
sixothree
Any chance you have a link to this offering? Not seeing MSDN being listed as
included with anything buyable.

~~~
jsjohnst
Apologies, it seems the program was discontinued in 2018. Lookup details on
Microsoft’s BizSpark program for more info.

This program replaced it, but I have zero info besides knowing it replaced
BizSpark so might not be comparable:

[https://startups.microsoft.com/en-us/](https://startups.microsoft.com/en-us/)

------
tbenst
Wish they’d review on Linux. Windows does not seem like the target audience
given all the limitations.

~~~
rwmj
I've played with the AMD Daytona Rome Server (two EPYC sockets, 2*64 = 128
cores, 256 threads), with RHEL, and it rocks. However it's quite hard to find
workloads that keep all 256 threads busy at once. Most builds aren't nearly
parallel enough, most programs can't find work for 256 threads. So as a
personal machine 128 or 256 threads aren't really worth it unless money is no
object. Likely the best current use for these is as servers for running large
numbers of virtual machines or containers.

~~~
glangdale
I am craving one of these for my superoptimizer. The level of task parallelism
I have is north of 100M independent jobs; my last run took a single-core
machine 20 days. It's pretty rare to have a workload like this but as more
machines ship with >16 cores, I think more developers will look at the order-
of-magnitude improvements of parallelizing their tasks where possible.

~~~
0-_-0
Could you run it on the GPU then? Or does it have a lot of branches?

~~~
glangdale
It uses an SMT solver (can use Z3, Yices or Boolector) all of which are very
complex and branchy. So no GPGPU - maybe some specialist in SAT solving or SMT
could build that one day, but that person would not be me.

------
nxc18
Not sure who needs to hear this, but another _huge_ Windows 10 limitation: no
support for nested virtualization on amd processors. This means Ryzen users
can’t benefit from a bunch of security improvements and also things like the
new Windows 10X emulator can’t work.

Kind of off topic, but it’s the kind of nasty surprise I wouldn’t want to get
after deciding to buy, so hope this helps someone.

[https://windowsserver.uservoice.com/forums/295047-general-
fe...](https://windowsserver.uservoice.com/forums/295047-general-
feedback/suggestions/31734808-nested-virtualization-for-amd-epyc-and-ryzen)

~~~
sebazzz
Is this an limitation on the processor or on Windows itself? Since
virtualization is not part of x86 but rather vendor specific, is the AMD
implementation difficult?

~~~
nxc18
It appears to be a Windows limitation - I'm not an expert, but my
understanding is that AMD supports all the same extensions for it (under AMD
rather than Intel branding) and it appears the Windows team is (finally)
working on it. I understand that other hypervisors do support nested virt on
AMD.

To be fair to the Windows team, AMD in the data center / pro desktops wasn't
really viable for a very long time, so its understandable that it wasn't
prioritized.

------
numlock86
The last benchmark is interesting:

> 1080p60 HEVC at 3500 bitrate with "Fast" preset - 319 fps

Ok, why were these parameters chosen? What's the application? I recommend
everyone to look at 1080p60 video footage encoded with h265 with the "fast"
encoder preset at 3500 bitrate. Calling it terrible would be a compliment.
Unless you encode really slow and visual easy motion, which brings up the
question why you would need 60 fps in the first place. Even at the "medium"
preset with 1080p60 you should - regardless of application - be at least in
the 5000+ range with your bitrate. And even that comes with a lot of trade
offs, because that's just where live streaming starts.

~~~
Sesse__
I believe most of these benchmarks were set at a time when you simply couldn't
run x265 on “slow” on any reasonable CPU if you ever wanted it to complete.
But yes, I'd really like CPU benchmarkers to move to higher-quality presets
for video encoding, because they do tend to have different kinds of
performance curves.

Fun fact: There's no point in running x265 on the fastest presets unless you
absolutely need to have HEVC; x264 on slow is faster _and_ gives better
quality per bit. See the second graph on
[https://blogs.gnome.org/rbultje/2015/09/28/vp9-encodingdecod...](https://blogs.gnome.org/rbultje/2015/09/28/vp9-encodingdecoding-
performance-vs-hevch-264/) (a few years old, the situation is likely to look
similar but not identical).

~~~
ksec
>(a few years old, the situation is likely to look similar but not identical).

x265 has made massive improvement over the years, 2015 x265 wasn't even
considered good; despite all of its hype, or another way to think about it is
how well x264 managed to squeeze every last bit of detail possible.

~~~
Sesse__
IIRC, I redid this graph in early 2019 (using Tears of Steel), and it looked
pretty similar.

------
govg
Unsure how to tag people like dang, but is it possible to change the link or
the title? The link is for page 3 of the review, which is a broader discussion
about multi-threading on Windows.

------
rbanffy
64 threads ought to be enough for anyone.

If I am investing USD 4000 on a CPU, I'd probably go for the EPYC part for 500
more and twice the memory bandwidth. It'd be interesting to run these
benchmarks under perf to see how many L3 cache misses happen and how much they
cost in cycles.

~~~
tiernano
the Epyc also allows you to use more RAM... seems the Threadripper tops out at
256GB due to the memory type
([https://www.youtube.com/watch?v=1LaKH5etJoE](https://www.youtube.com/watch?v=1LaKH5etJoE))
but Epyc would allow 2TB...

~~~
Tepix
There were rumors about TRX80 and WRX80 chipsets that raised the 256GB RAM
limit to 2TB for Threadripper CPUs. Alas, they never went past the rumors
stage.

~~~
close04
They don't exist [0] and the chipset wouldn't really influence memory support
since you have the IMC in the CPU.

[0] [https://www.anandtech.com/show/15359/trx80-and-wrx80-dont-
ex...](https://www.anandtech.com/show/15359/trx80-and-wrx80-dont-exist-
neither-does-the-intel-lga1159-socket)

------
pella
[https://news.ycombinator.com/item?id=22266386](https://news.ycombinator.com/item?id=22266386)

------
tasubotadas
Reading this makes me mad that Python still hasn't sorted its business out
with GIL.

~~~
glangdale
Don't know why the downvotes; this limitation of Python is really hard to
fathom. IIRC they had a project to remove it and it went into the weeds
somehow (I think it made performance worse?)

~~~
bildung
Because e.g. nodejs has a GIL, too, and apparently no one thinks this is a
problem.

For web applications one usually has a software chain like web server <-> wsgi
server <-> dozens of python instances.

Standalone processes just implement threading, which also is fairly easy (as
far as theading itself can be easy).

Scientific libraries like scipy can use parallel processes automatically in
the background (using things like BLAS), as long as the data is modelled
correctly.

~~~
tasubotadas
I just gonna repost what I've written before about this before:

The current state of threading and parallel processing in Python is a joke.
While they are still clinging to the GIL and single-core performance, the rest
of the world is moving to 32 core (consumer) CPUs.

Python's performance, in general, is a crappy[1] and is beaten even by PHP
these days. All the people that suggest relying on multiprocessing probably
haven't done anything that's CPU and Memory intensive because if you have a
code that operates on a "world-state" each new process will have to copy that
from a parent. If the state takes ~10GB each process will multiply that.

Others keep suggesting Cython. Well, guess what? If I am required to use
another programming language to use threads, I might as well go with
Go/Rust/Java instead and save the trouble of dabbling with two languages.

So where does that leave (pure-)Python? It can only be used in I/O bound
applications where the performance of the VM itself doesn't matter. So it's
basically only used by web/desktop applications that CRUD the databases.

It's really amazing that the machine learning community has managed to hack
around that with C-based libraries like SciPy and NumPy. However, my
suggestion would be to drop GIL and copy the whatever model has been working
for Go/Java/C#. If you can't drop GIL because some esoteric features depend on
that, then drop them as well.

[1] [https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/fastest/python.html)

~~~
kortex
Every single project which has tried to drop the GIL has failed in some way.
It's not some "esoteric features", it's fundamentally a hard problem that
implicates the entirety of the python object model, python C api, scoping,
imports, and GC.

I think multi-interpreting is the way to go, but that still would require a
framework for ensuring safe memory access.

Speaking of Go, I always thought it would be neat to write a python
implementation in Go, but leverage Go's GC, and implement the 'go'
keyword/function for easy parallelism. But you still have the problem of
scoping and memory safety. Or similar idea but with Rust. Something tells me
that isn't a trivial undertaking, especially if you want all the libraries,
which is 75% the point of python.

------
jgaa
The easy solution is to run Linux.

~~~
trasz
Or FreeBSD. There have been some huge improvements to scalability of network,
filesystems, and memory management recently.

~~~
fullstop
Any details on that? We're spinning up a 48 core epyc soon on 12.1. The last
"beefy" server we did was a 32 core intel on 11.x, and I'll be happy if these
changes are in 12.1.

~~~
drewg123
You want 13-current. A lot of the NUMA work that has been done is pretty
invasive, and not suitable for backporting because it changes KBIs

~~~
trasz
Definitely. Some of the network changes (epoch(9), kind of like RCU) went into
12, IIRC, but you mostly want 13-CURRENT.

------
ageofwant
Honestly. Windows is just a waste on that architecture as it stands. Run Linux
as base OS and give 8 cores to the Windows VM should you have use for it.

------
thomasahle
288MB Cache. I wonder if we'll ever be given some control about how and what
computations are cached. It seems like a lot of memory to leave to simple
heuristics.

~~~
szatkus
It's 32MB L3 per die and 512KB L2 per core, they sum them for marketing
effect. Effectively one core can access "only" 32.5MB.

~~~
szatkus
Errata: as on one die there are two CCXs, one core can access 16.5MB

------
sgt
Now imagine a Beowulf cluster of these.

------
wewake
This article is more about Windows 10 limitations than 3990K. Good read
though.

------
tareqak
What is the latest on just turning of SMT/Hyperthreading? Then you don't run
into the greater than 64 threads issue with this CPU? I remember there being a
reason to turn it off unrelated to performance, but I do not remember if there
was more than one reason [0].

[0] [https://marc.info/?l=openbsd-
tech&m=153504937925732&w=2](https://marc.info/?l=openbsd-
tech&m=153504937925732&w=2)

~~~
bluedino
Or save $1990 and just buy the 3970X

~~~
masklinn
The 3970X has half the threads because it has half the cores, not because SMT
is disabled.

You may want to avoid the issue by giving up 10% if your performances, less so
by giving up 50%.

~~~
gameswithgo
3970X also has much higher clock rates though! Might be a net win for some
workloads.

~~~
tareqak
Yes, but if you care about 64 threads without the possible side-channel issues
that SMT currently/theoretically has, then you are back to the 3990X.

For what it is worth, I have an AMD Ryzen 2700X Eight-Core Processor that I
got in 2018, and I keep SMT off. I do some light gaming with it, and I am
happy. I did not notice a big drop in performance, but I did not truly measure
the difference.

------
alg0rith
Meanwhile, AMD's Navi doesnt have stable drivers...

------
exabrial
The article is more about a bug in Windows and it's scheduler than actual
benchmarks.

~~~
ajross
To be fair, scheduler misfeatures are a whole lot more interesting than
benchmarks that scale in expected ways.

