
Ryzen Threadripper Pro 3995WX Spotted - stambros
https://www.guru3d.com/news-story/spotted-ryzen-threadripper-pro-3995wx-processor-with-8-channel-ddr4,2.html
======
Uehreka
I feel like 64 cores is getting rather close to a tipping point: What
workloads are so massively parallel that they _can_ use 64 cores of x86 but
_can’t_ use the thousands of CUDA cores on a Quadro card? I’m sure right now
there are workloads that just need particular x86 instructions, but that feels
like a temporary problem.

Am I wrong about that being a temporary problem (that would feel frustrating)?
Are these cores just that much faster than CUDA cores? What else am I missing
here?

I often hear people talk about getting CPUs like this for deep learning
research, but all the deep learning work I’ve done goes straight to CUDA and
lands on my GPU.

~~~
ajross
> What workloads are so massively parallel that they can use 64 cores of x86
> but can’t use the thousands of CUDA cores on a Quadro card?

Building software, for one. C compilers and python interpreters don't run on a
GPU. Lots of stuff doesn't run on a GPU. In fact in practice the only things
that run on a GPU are the tiny handful of known subproblems that the industry
has collectively decided are "GPU problems".

Like it or not general purpose scalar software is, has always been, and will
always remain the standard mechanism by which computing hardware is applied to
new problems. Everything else is an optimization around the edges.

~~~
waltpad
_Disclaimer: I am not a HW designer, I could very well be wrong._

It is true that there are tasks where threading matters, but still require a
CPU rather than a GPU. I wonder however if these tasks do need full SSE/AVX
etc. Couldn't these extensions be removed of the CPU cores and instead have
the necessary work performed by the GPU?

It would be interesting to produce statistics on how much these extensions are
used in these scenario. Imagine how much space and complexity could be saved
on a CPU die by making stripped down versions. That space could in turn be
used for more cores!

I read a little about the Xeon PHI cpus, which iirc, is a multicore CPU with a
very small ISA, but I wonder why x86 makers aren't trying to go in that
direction: isn't there plenty of dedicated workloads which would happily run
on these (eg, web servers), or is this just a (too) simplistic view?

~~~
TinkersW
I think the opposite is where things need to go. Having a wide SIMD ALU
quickly accessible from your CPU core is very useful, especially as it shares
the same memory system and a much more flexible programming model that allows
you to do everything in a single source.

~~~
waltpad
The programming model is not very flexible at the lowest level: one has to
create all the software infrastructure to communicate with the GPU (which
boils down to sending commands and receiving response). There are languages
(like futhark, julia, or even python), which handle all that boilerplate
transparently.

The main problem is, afaik, that there is not enough control about where the
code will run in these languages. At some point, one will want to describe all
the algorithms using a single language, and somehow describe how the workload
will have to be distributed across all the processors, or at least that's what
I've been thinking about for a while. Once you have that level of control, the
need for a versatile CPU is less clear. Note that nowadays people seems happy
with hybrid solutions where the code is scattered across several languages
(eg, one for the main program and one for the shaders, or for the client side
UI), so my position is maybe not very strong.

HW-wise, is it possible that integrated GPUs are the first steps toward an
architecture where CPU and GPU have better interconnections (ie, larger
communication bandwidth and smaller latency) to the point where SIMD becomes
moot? There is also the SWAR approach, where one doesn't rely on intrinsic
SIMD instructions, but instead emulate them (though it's probably not very
realistic for floating point computation).

Some other ideas:

\- Apple has this neural engine in their latest chips, which is basically
dedicated HW for neural networks

\- In the wild, people are getting more and more interested in building their
custom ASICs to cut software's middle-man cost: for them, the CPU solution is
not good enough

\- Intel recently introduced a new matrix ops extension in their CPUs: maybe
at some point they'll introduce full GPU capabilities directly baked in the
CPU? I am a little worried about the resulting ISA.

Anyway, I am not an HW engineer, nor a very good software one. I only have a
limited view of the difficulties in writing good, CPU or GPU efficient code.
My first post was prompted by remembering the first "large scale" multicores
CPUs 15 years ago (specifically the Ultrasparc T1) which wheren't SIMD heavy.
The direction naturally shifted as progress was made on SIMD to try to compete
with GPUs, when it seems to me that originally CPUs and GPUs were
complementary.

I tend to support modular solutions, but I don't know how costly that would be
in term of efficiency at the HW level.

------
fomine3
> The big news is that the Asia website reported the 3995WX is an offering
> 8-channel DDR4 memory interface, up to 2 TB of it

This indicates 3995WX supports not only 8ch but also RDIMM/LRDIMM that's not
supported on current Threadripper. It should need a new Socket rather than
current TRX40 (TRX80 was rumored a year ago). Threadripper gets closer to
EPYC.

I expect PCIe lanes limitation still remains for make difference (It's many
even limited).

~~~
tinco
That would be insane. I hope motherboard manufacturers take notice and release
an IPMI board for it. We currently run a cluster that has some 3970X's in it
and it's super awkward that we couldn't get remote management for them. Asrock
has announced availability of one starting august, but it's a little too late
for us.

~~~
MaKey
What about the X399D8A-2T from AsRock Rack? It's available since October 2019.

~~~
tinco
Hm shit, now that you mention it, I didn't even consider X399. The only
downside seems to be the lack of support for 3200mhz ddr. In our case that
board specifically doesn't have room for the 4th GPU so it's not an option for
us.

We've got a weird use case, we're using (black box) software that's sensitive
to clock speed, scaling only to about 8 threads until it starts levelling off
aggressively (to the point where a 3970X benchmarks as faster than a 3990X).
So what we do is we run 4 instances, each with their own GPU and 8 cores
assigned to it. I'd be surprised if this makes sense anywhere else (as EPYC
has all sorts of other advantages in HPC).

------
soygul
A lot of people question the usefulness of 64+ cores on a workstation CPU. It
is true that audio/visual tools like Adobe After Effects stop benefiting from
extra cores at around 32 cores. [1]

But if you want to simulate production workloads on your machine say with
Apache Spark or Flink, you are in the prime. You couple that CPU with 256+ GB
memory, and you can start exploring the bottlenecks in your batch algorithms
using as many cores as you want.

[1] [https://www.pugetsystems.com/labs/articles/After-Effects-
CPU...](https://www.pugetsystems.com/labs/articles/After-Effects-CPU-
performance-AMD-Threadripper-3990X-64-Core-1658/)

~~~
cameron_b
I think we will be contending with this sort of barrier for a while on desktop
software. 32 cores used to be massively parallel 10 years ago, 15 years ago it
was nuts to think that we'd have that in a get-it-from-Newegg offering.

Massively parallel operations take a different focus on your MPI/bus/queue
architecture than we've ever needed for desktop software. BUT I think what
we've already seen is that in general purpose computing there will always be
someone ready to consume resources as they are available.

Both feet don't move at the same time, one takes a step and then the other.

------
Roritharr
I wonder how much the silicon quality differs on a single Threadripper Chip.

It would really be interesting to see if it's feasible to Overclock the best
numa node to a degree that gives this a couple of high single thread
performance cores comparable to the 3900X or 3950X while retaining stability
overall.

~~~
kllrnohj
> Overclock the best numa node to a degree

Do you mean the best ccx? Or best chiplet? There aren't any numa nodes here,
and the cache & memory architecture are already the same as a 3900x.

Per-core & per-ccx overclocking does already exist on ryzen though, should
work the same on threadripper.

~~~
Roritharr
Interesting, is it already feasible to pin tasks to such OCed cores?

It would be interesting to turn the 3990/3995 into a does-it-all chip with
best in class single-core performance on certain cores and just lots of a tad
slower cores in general.

~~~
kllrnohj
That's essentially what it's supposed to do out of the box. The chip already
communicates to the OS what its "best" and "good" cores are for single- and
low-thread count workloads: [https://www.anandtech.com/show/15137/amd-
clarifies-best-core...](https://www.anandtech.com/show/15137/amd-clarifies-
best-cores-vs-preferred-cores)

These are the cores that can turbo the highest.

You can attempt to manually OC higher, but that by & large doesn't work
without extreme cooling. Typically instead it's about achieving higher all-
core frequencies than the built in all-core turbo, typically via higher
voltages & power limits (and of course much better-than-stock cooling)

Indeed if you look at single-thread cinebench numbers you'll find the 3960x &
3970x right in the middle of the 3700x, 3800x, and 3900x pack:
[https://www.guru3d.com/articles_pages/amd_ryzen_threadripper...](https://www.guru3d.com/articles_pages/amd_ryzen_threadripper_3960x_review,7.html)

They all hit around that same 4.5ghz single core turbo mark without any
overclocking. And Zen2 by & large doesn't really go much beyond 4.5ghz anyway,
so there's not all that much manual overclocking you can do without going sub-
ambient cooling anyway.

The only real downside to Threadripper 3rd gen is the price. It's already a
pretty killer jack-of-all-trades CPUs otherwise. It's a very competent gaming
CPU without any tweaks at all right out of the box. It doesn't at all have the
cons of the 1st & 2nd gen Threadrippers, which were actually NUMA and
therefore came with huge gaming downsides.

------
Keyframe
Does that mean more than 256GB of RAM without having to go for an EPYC?

~~~
bkolobara
The article mentions support for up to 2 TB of RAM.

~~~
Keyframe
True, and 3990x supports more than 256, but not in practice (motherboards)

~~~
heelix
The (bit of) article mentioned 2T, but the thing that was interesting was the
8 channel memory. My current TR4 boards are doing quad, so 2 banks of 4 slots.
Sounds like a single set of 8... so wonder if this is a potential BIOS update
on the sTR4 or if there is a new chipset on route?

~~~
AdrianB1
The traces are wired to CPU pins, there is no potential BIOS update to make it
8 channels from CPU pins for only 4 channels. You need different socket with
double the number of pins for memory to double the number of channels.

Your 4 channels with 2 slots each have the 4 channels daisy chained to 2
slots, but they are just 4 channels. With 8 channels and 2 slots per channel
you can have 16 slots, with 32 GB unbuffered (regular) DIMMs you are limited
to 512 GB of RAM, for 2 TB you need registered or load reduced RAM.

~~~
heelix
Ah... that answers my real question then. sTR4 will likely be a short lived
series.

------
hellofunk
It’s unclear to me what eight channel memory actually means, I’m not used to
seeing that mentioned in casual architecture overviews. What’s the benefit of
that? Is that something that a programmer must explicitly take advantage of,
or is that just an automatic part of the CPU?

~~~
kllrnohj
Think RAID 0 but for RAM done by the CPU.

Nearly every consumer CPU for the last 20 years is dual channel, meaning "2
drive RAID 0". If you've seen things like recommendations to get paired DDR
memory sticks, this is why. You only get this RAID-like benefit with multiple
RAM sticks (just like RAID 0 of a single drive doesn't do anything). It's also
why some consumer products have unexpectedly bad performance for the CPU specs
- cheaper laptops may skimp here and just run a single stick of memory instead
of 2. Which means half the memory bandwidth.

8 channel then means the CPU can do this up to 8 sticks for 8x the bandwidth
instead of the more common 2x.

~~~
Hello71
also notably this doesn't help with latency, which is more often the sticking
point.

~~~
gameswithgo
with 64 cores and 4 channels, throughput may often be a sticking point.

that is like have 32 cores on a normal 2 channel desktop, throughput would
often be a limiter there.

------
quyleanh
I can't wait to see how powerful Ryzen 4 desktop be.

------
bitL
So TRX80/WRX80 were real after all...

------
MuffinFlavored
64 cores? how much would that cost? around $1k or more? is this for the
consumer line or server line of products?

~~~
forrestthewoods
The 64-core 3990x is MSRP $4000. I’d expect this new fangled 3995WX to be that
much or more.

~~~
mcraiha
CPU and motherboard prices aren't really relevant since RAM is so expensive.
And only real reason for getting this instead of 3990x is that you need e.g. 1
terabyte of RAM. [https://www.amazon.com/Tech-12x128GB-2933MHz-
PC4-23400-288-P...](https://www.amazon.com/Tech-12x128GB-2933MHz-
PC4-23400-288-Pin/dp/B07ZZMSWJX/ref=pd_lpo_147_t_0/135-5978031-6451344)

~~~
simcop2387
There might be some other cases for less ram. If the rumor that it supports
rdimm or lrdimm modules is true then it could enable much easier ecc support
since there are not a lot of ecc udimm options out there

~~~
amarshall
Mostly ECC UDIMMs are limited in clock speed, but many can be comfortably
overclocked. I presently have four 2666 MHz ECC UDIMMs overclocked to 3200 MHz
in my TRX40 workstation (M391A2K43BB1-CTD if anyone is curious).

~~~
pdimitar
Do the RAM sticks require separate cooling?

~~~
amarshall
No. For the most part heat spreaders on RAM are a gimmick (unless you’re
pushing really high OC, which this is not).

------
runawaybottle
Guru3d seriously not update their website design in 20 years? Looks the same
since the 2000s.

~~~
derision
if it ain't broke..

