
AMD Zen 2 Microarchitecture Analysis: Ryzen 3000 and EPYC Rome - pella
https://www.anandtech.com/print/14525/amd-zen-2-microarchitecture-analysis-ryzen-3000-and-epyc-rome
======
walterbell
_> ... for the issues to which AMD is vulnerable, it has implemented a full
hardware-based security platform for them. The change here comes for the
Speculative Store Bypass, known as Spectre v4, which AMD now has additional
hardware to work in conjunction with the OS or virtual memory managers such as
hypervisors in order to control. AMD doesn’t expect any performance change
from these updates._

Which Intel CPU generation will have hardware fixes for these Spectre
variants?

~~~
baybal2
I guess it will score it a lot of datacentre business. Not so few big "cloud"
providers still run completely bare with regards to microcode patches, and
some very likely do so _intentionally_.

~~~
cm2187
Is it still the case that an AMD core is less powerful than an Intel core for
the same frequency? I understand that AMD is making it up with more cores but
in a cloud you get charged per core. Can a cloud substitute an Intel for an
AMD cpu?

~~~
imtringued
That hasn't been true since the release of Ryzen in 2017. The reason why AMD
is lagging behind in single core performance is because Intel CPUs can be
clocked higher. Often to 5GHz. Whereas AMD usually only boosts to somewhere
around 4.4GHz. Gamers care about a 12% difference. Servers usually don't even
go beyond 3GHz.

~~~
lagadu
IPC varies according to the specific task but Ryzen 1xxx and 2xxx have always
had IPC on average comparable to Broadwell CPUs (excluding AVX workloads). So
intel has had a slight lead there from Skylake onwards.

According to what we're seeing, the situation seems to be reversed with the
3xxx series, where AMD seems to have a small but significant lead; we'll have
to wait for independent benchmarks.

~~~
aphextim
Regardless who comes out ahead on the benchmarks, competition is always a good
thing!

~~~
penagwin
It kinda matters because AMD hasn't been able to outperform Intel at single
threaded tasks for like 5+ years.

They aren't competition if they fall of the map. So a slightly edge would be
great because it at least puts them back in the game.

------
psnosignaluk
$499 for 12C/24T on the 3900X? $399 for 8C/16T on the 3800X? Those prices make
this a very tempting chipset on top of the promised performance being touted
by AMD. Looking forward to seeing the reviews and news on X570 boards.
Hopefully the manufacturers will be plugging in decent features to match their
threats of massive price hikes on AM4 boards with the new chipset.

~~~
CoolGuySteve
For a single GPU machine, I’m not sure what the use case is for X570 over a
much cheaper B450 board. Most games currently aren’t GPU-bus bandwidth limited
(or rather the GPU itself is the bottleneck) so I suspect PCIe 4.0 won’t
impact benchmarks much.

~~~
kllrnohj
On all of these the GPU is directly connected to the CPU and there is no
chipset in the way.

The benefits of X570 over B450 therefore have nothing to do with GPU
performance but instead would be either overclocking capability or, more
significantly, I/O to everything else.

B450 only provides 6x PCI-E 2.0 lanes and 2 USB 3.0 gen 2. That's not a lot of
expansion capability, especially with nvme drives. Want 10gbe? Or a second
nvme drive? Good luck.

X570 gets to leverage double the bandwidth to the CPU in addition to being
more capable internally. So you'll see more boards with more M.2 nvme slots as
a result, for example. And thunderbolt 3 support. Check out some of the x570
boards shown off - the amount of connectivity they have is awesome. That's why
you'd get x570 over b450.

~~~
paulmd
10 GbE or M.2 NVMe performance is already significantly degraded by being on a
PCH in the first place. More hops, higher latency, much lower IOPS. Don't do
it if you can avoid it.

The thing is that most things aren't (currently) bottlenecked by PCIe 3.0. A
2080 Ti shows about 3% performance degradation by running in 3.0x8 mode. 4
lanes of PCIe 3.0 is 4 GB/s (32 Gb/s) which is plenty for 10 Gb/s
networking... or even 40 Gb/s networking like Infiniband QDR (which runs at 32
Gb/s real speed after encoding overhead). So you can reasonably run graphics,
10 GbE, and one NVMe device off your 3.0x16 PEG lanes.

And AMD also provides an extra 3.0x4 for NVMe devices, so you can run
graphics, 10 GbE, _and_ NVMe RAID without touching the PCH at all.

The real use-case that I see is SuperCarrier-style motherboards that have
PEX/PLX switches and shitloads of x16 slots multiplexed into a few fast
physical lanes, like a 7-slot board or something. Or NVMe RAID/JBOD cards that
put 4 NVMe drives onto a single slot. But right now there are no PEX/PLX
switch chips that run at PCIe 4.0 speeds anyway, so you can't do that.

~~~
kllrnohj
> So you can reasonably run graphics, 10 GbE, and one NVMe device off your
> 3.0x16 PEG lanes

Sure but you won't find any board with a setup like that. You can also
reasonably split the x4 nvme lanes into 2x x2 but again you won't find a such
a setup.

You'll find no shortage of boards with everything wired up to the PCH, though,
and it's "good enough" even if it isn't ideal. The extra bandwidth will
certainly not be unwanted. Especially when you're also sharing that bandwidth
with USB and sata connections.

> The real use-case that I see is SuperCarrier-style motherboards that have
> PEX/PLX switches and shitloads of x16 slots multiplexed into a few fast
> physical lanes, like a 7-slot board or something.

I think those use cases would instead just use threadripper or epyc. Epyc in
particular with its borderline stupid 128 lanes off of the CPU.

------
microcolonel
It was fun seeing their materials refer to their branch predictor as a TAGE
predictor. I remember hearing about the original paper for that when I was too
young and inexperienced to focus long enough to understand it; then I saw a
TAGE predictor show up in Chris Celio's BOOM repository, and I read it
through.

If what AMD says is true, the new (for them) TAGE predictor in their industry-
leading microarchitecture having 30% fewer branch mispredictions than the
last, it feels very cool that one can read and somewhat anderstand the
operation of a similar predictor in the leisure hours of a few days.

Also those caches are huge, wow.

~~~
gpderetta
Interesting; Intel is rumored to use TAGE but it was never confirmed
officially. AMD claimed to use a perceptron based predictor in the past.

~~~
emn13
They still will, just for L1 only; TAGE is for the lower levels.

------
shereadsthenews
The BTB was the real problem with Naples and the reason it sucked on non-
trivial, branchy, pointer-chasing workloads like MySQL. With the improved
branch prediction resources I’ll be interested in head-to-head of this rig vs.
a Kaby Lake (or later).

~~~
jnordwick
Also AMDs often smaller cache sizes, which still seems to be a problem if you
compare to ice lake and above.

~~~
opencl
What? Zen/Zen+ had larger L1 icache than Intel, same size L1 dcache and L2 and
L3. Zen 2 actually decreases the L1 icache to 32KB but in exchange increases
L1 associativity, micro-op cache size, BTB size, and L3 size.

~~~
shereadsthenews
Zen’s gigantic cache wasn’t very effective on account of the way it is
arranged in itty bitty little shards.

~~~
zrm
8MB is itty bitty?

There are a few narrow workloads where having a huge unified cache is an
advantage, but it generally isn't. If you have many independent processes or
VMs it can actually be worse, because when you have one thrashing the caches
it would ruin performance across the whole processor rather than being
isolated to a subset.

Meanwhile most working sets either fit into 8MB or don't fit into 64MB. When
you have a 4MB working set it makes no difference and when you have a 500GB
one it's the difference between a >99% miss rate and a marginally better but
still >99% miss rate.

Where it really matters is when you have a working set which is ~16MB and then
the whole thing fits in one case but not the other. But that's not actually
that common, and even in that case it's no help if you're running multiple
independent processes because then they each only get their proportionate
share of the cache anyway.

So the difference is really limited to a narrow class of applications with a
very specific working set size _and_ little cache contention between separate
threads/processes.

~~~
jnordwick
The L3 cache is still unified across all sockets. I unsure what the previous
comment was taking about, but how does amds differ from Intel in that is
prevents one bad process from blowing cache?

And most people don't run a bunch of vms. Single thread performance still
dominates and latency cannot be improved by adding cpus.

~~~
zrm
> I unsure what the previous comment was taking about, but how does amds
> differ from Intel in that is prevents one bad process from blowing cache?

Ryzen/Epyc has cores organized into groups called a CCX, up to four cores with
up to 8MB of L3 cache for the original Ryzen/Epyc. So Ryzen 5 2500X has one
CCX, Ryzen 7 2700X has two, Threadripper 1950X has four, Epyc 7601 has eight.

Suppose you have a 1950X and a thread with a 500MB+ working set size which is
continuously thrashing the caches because all its data won't fit. You have a
total of 32MB L3 cache but each CCX really has its own 8MB. That's not as good
for that one thread (it can't have the whole 32MB), but it's much better for
all the threads on the other CCXs that aren't having that one thread
constantly evict their data to make room for its own which will never all fit
anyway.

This can matter even for lightly-threaded workloads. You take that thread on a
2700X or 1950X and it runs on one CCX while any other processes can run
unmolested on another CCX, even if there are only one or two others.

In particular, that misbehaving thread is often some inefficient javascript
running in a browser tab in the background while you're doing something else.
And that rarely gets benchmarked but is common in practice.

> And most people don't run a bunch of vms.

That is precisely what many of the people who buy Epyc will do with it, and
it's the one where there are the highest number of partitions. The desktop
quad cores with a single CCX have their entire L3 available to any thread.

> Single thread performance still dominates

If your workloads are all single-threaded then why buy a 16+ thread processor?

~~~
jnordwick
I didn't know the l3 wasn't shared across complexes. From what I understand,
it is 4 cores per ccx and 2mb per core, so up to 8 mb per complex.

While that might prevent one bad process from evicting things, it seems like
it might almost lead to substandard cache utilization, especially on servers
that might just want to run one related thing well.

Also sharing between l3s would seem to be a huge issue, but I wasn't able to
find info on how that is handled (multiple copies?). But this would seem to
help cloud systems to isolate cache writes.

I work on mostly hpc and latency sensitive things where I try to run a bunch
in single threads with as little communication as possible, but still need to
share data (eg, our logging goes to shm, our network ingress and outgres hits
a shared queue, etc).

I would probably buy as a desktop, but not for the servers. Also no avx512
which besides the wider instructions the real gain seems to be in an improved
instruction set for them.

~~~
zrm
> While that might prevent one bad process from evicting things, it seems like
> it might almost lead to substandard cache utilization, especially on servers
> that might just want to run one related thing well.

Right, that's the trade off. Note that it's the same one both Intel and AMD
make with the L2, and also what happens between sockets in multi-socket
systems. And separation reduces the cache latency a bit because it costs a
couple of cycles to unify the cache. But it's not as good when you have
multiple threads fighting over the same data.

I should also correct what I said earlier about the Ryzen 5 2500X having one
CCX, I had assumed that it did based on core count and cache size but it looks
like it has two with half the cores and cache disabled. Which is of course
good ( _can_ isolate that crap javascript thread) and bad (single thread can
only use 4MB of L3, might be worth getting the 2600X instead).

> I would probably buy as a desktop, but not for the servers. Also no avx512
> which besides the wider instructions the real gain seems to be in an
> improved instruction set for them.

If you're buying multiple servers the thing to do is to buy one of each first
and actually test it for yourself. We can argue all day about cache
hierarchies and instruction sets, and that stuff can be important when you're
optimizing the code, but it's a complex calculation. If you have the workload
where a unified cache is better, but so is having more cores, which factor
dominates? How does a 2S Xeon compare with a 1S Epyc with the same total
number of cores? What if you populate the second socket for both? How much
power does each system use in practice on your actual workload? How does that
impact the clock speed they can sustain? What happens with and without SMT in
each case?

When it comes down to it there is no substitute for empirical testing.

------
BuckRogers
I bought the Ryzen 1700, 1800X and 2700X. Looking to upgrade to one of these,
but will wait for benchmarks, AMD is really on a roll. The 3950X is extremely
tempting especially considering I can drop it right in my existing AM4
motherboard. Allowing backwards compatibility where possible was the best move
they made. I would've bought 1 CPU from them instead of 3 (and soon 4, and
then 5 with Zen3), without it.

~~~
simplyinfinity
zen3 would most likely be on a new socket, AFAIK. Zen2+ (4xxx series) would
most likely be the last cpu for AM4, as amd have stated they will keep sockets
for 4 years but that's OK since intel change it about every year or so.. I
myself will be upgrading from 1700X to Either the new 8 core... but the 16
core looks so sweet...

~~~
BuckRogers
AMD's slides showed Zen3 as the next release (Ryzen 4000 series), as 7nm+/Zen3
in 2020. Zen2+ isn't a thing, unless AMD is simply calling it Zen3. I'm not
sure the names matter, as Zen+ was better than I was expecting with the XFR2
changes, so I bought one. From what I can tell, they tied support to DDR4
support. I'm expecting Zen3 next year to be the last AM4 CPU.

Doesn't really matter to me, with the value they've been delivering since the
original Ryzen launch, I see no reason to not buy them all. People appreciate
a good discount on a desirable CPU on Craigslist when its time to upgrade.
It's just an easy swap, especially if you use an IC Graphite thermal pad
instead of thermal paste.

------
netrikare
Does anyone know if AMD is working on supporting transactional memory in their
cpus?

[https://en.wikipedia.org/wiki/Advanced_Synchronization_Facil...](https://en.wikipedia.org/wiki/Advanced_Synchronization_Facility)

------
craz8
It looks like Microsoft is adding code to Windows to schedule threads in a way
that works with this configuration of cores and CCXs

Is/has Linux added similar code

Are we calling this Mini-Numa or something else?

~~~
microcolonel
It's NUMA, just like usual. The difference is the exact topology, and the real
difference in latency (which changes the cost function of accessing from a
different node).

~~~
floatboth
The CCXes aren't NUMA, they all have the same access to memory, no system
would show multiple NUMA domains on a desktop Ryzen.

Only Threadripper and EPYC are NUMA.

~~~
TazeTSchnitzel
Zen 2 drops the NUMA because of the single I/O die with memory controller
right?

~~~
floatboth
I can't find confirmation, but that would make sense, single-socket EPYC and
TR with Zen2 should be UMA

------
choudanu4
_AMD’s primary advertised improvement here is the use of a TAGE predictor,
although it is only used for non-L1 fetches. This might not sound too
impressive: AMD is still using a hashed perceptron prefetch engine for L1
fetches, which is going to be as many fetches as possible, but the TAGE L2
branch predictor uses additional tagging to enable longer branch histories for
better prediction pathways. This becomes more important for the L2 prefetches
and beyond, with the hashed perceptron preferred for short prefetches in the
L1 based on power._

I found this paragraph confusing, is it talking about data prefetchers (Which
would make sense b/c of the mention of short prefetches) or branch predictors?
(Which would make sense b/c of the mention of TAGE and Perceptron)

~~~
derefr
A little of both. My understanding of the above paragraph is that the L1
predictor is trying to predict which _code-containing_ cache lines need to
stay loaded in L1, and which can be released to L2, by determining which
branches _from_ L1 cache-lines _to_ L1 cache-lines are likely to be taken in
the near future. Since L1 cache lines are so small, the types of jumps that
can even be analyzed successfully have very short jump distances—i.e. either
jumps within the same code cache-line, or to its immediate neighbours. The L1
predictor doesn’t bother to guess the behaviour of jumps that would move the
code-pointer more than one full cache-line in distance.

Or, to put that another way, this reads to me like the probabilistic
equivalent of a compiler doing dead code elimination on unconnected basic
blocks. The L1 predictor is marking L1 cache lines as “dead” (i.e. LRU) when
no recently-visited L1 cache line branch-predicts into them.

~~~
BeeOnRope
I was also confused by this, but my reading is this is entirely about _branch
prediction_ nothing about caching. In that context L1 and L2 simply refer to
"first" and "second" level branch prediction strategies, and are not related
to the L1 and L2 cache (in the same way that L1 and L2 BTB and L1 and L2 TLB
are not related to L1 and L2 cache).

The way this works is there a fast predictor (L1) that can make a prediction
every cycle, or at worst every two cycles, which initially steers the front
end. At the same time, the slow (L2) predictor is also working on a
prediction, but it takes longer: either throughput limit (e.g., one prediction
every 4 cycles) or with a long latency (e.g., takes 4 cycles from the last
update to make a new one). If the slow predictor ends up disagreeing with the
fast one, the front end if "re-steered", i.e., repointed to the new path
predicted by the slow predictor.

This happens only in a few cycles so it is much better than a branch
misprediction: the new instructions haven't started executing yet, so it is
possible the bubble is entirely hidden, especially if IPC isn't close to the
max (as it usually is not).

Just a guess though - performance counter events indicate that Intel may use a
similar fast/slow mechanism.

------
Ragib_Zaman
I guess I was being ridiculously optimistic hoping for the 16 core chip to be
around the $600 mark :(

~~~
reitzensteinm
They're not just adding more cores on because the processes have improved
allowing for more on the same silicon.

The complexity of the chip is higher than the previous models, with three dies
under the hood instead of one. The high end chips are closer to Threadripper
than they are to the models they're replacing.

I think $750 is still a ridiculously good price, and Intel's feet are being
held to the fire.

~~~
dis-sys
> I think $750 is still a ridiculously good price

Threadripper 1950x comes with the same core count, more memory channels, more
PCI-E lanes and more memory. You can grab one for $499 from amazon.

~~~
lhoff
But you have to pay around 150$ more for the Mainboard and a threadripper
compatible cooler is also quite expensive due to the huge size of the CPU.

So you're not going to save more then a few bucks but get a slower and
outdated CPU.

~~~
consp
Most high end AM4 Motherboards have sufficient clearance to allow a TR cooler
with an adapter plate on the AM4-MB so buying one for a later upgrade might be
possible.

Note: I have a TR cooler running on my AM4 board (custom loop though so not
completely comparable) and there is more than sufficient space to place it.

~~~
sangnoir
You can't use an AM4 motherboard[1] with the 1950X - you have to use an
X399-chipset/TR4 motherboard, which cost more than AM4 boards (and likely have
adequate room for TR coolers)

~~~
consp
This was as a response to the idea of using a ryzen as alternative to a 1950
and solving possible thermal issues if they would occur. I never mentioned
using a TR on an AM4.

~~~
sangnoir
It appears you may have misunderstood the comment you were replying to
upthread - the original debate was if buying a 1950X at $499 would be
cheaper/better than a $750 Ryzen. @lhoff pointed out that even when the 1950X
is cheaper, you'd still need to buy relatively expensive coolers and mobo (for
TR), meaning you won't be saving (much) on older tech. Thermal issues weren't
the subject (except as an explanation on why TR4 coolers are expensive).

In turn, I misunderstood your reply to @lhoff, because in that context, I read
it as a rebuttal of the idea that TR parts being expensive by suggesting an
AM4 mobo + TR4 cooler as substitutes on a 1950X system.

------
wahern
Anybody expect 7nm processes to result in longevity issues? As far as I
understand (and IME) the first components to fail are capacitors. Might that
begin to change?

Apropos the article, I'm trying to convince myself to build an EPYC 3201
server now rather than waiting for the Zen 2 version, for which I presume I'd
have to wait until October or November at the earliest.

~~~
baybal2
I think, electromigration will still kill the chip earlier than individual
device failures.

It was Intel is said to switched to cobalt wiring in latest node, and seems to
be paying dearly for that. TSMC and others seem to go the conventional road
and continued to perfect the salicide for smaller nodes without any issues.

~~~
dfrage
Officially, Intel says 10nm has lithography problems. They did try a more
aggressive node than TSMC's first "7nm", entirely using 193nm UV, and were the
only company to attempt Self-Aligned Quadruple Patterning (SAQP) for the top
metal layers.

------
sfink
That was a great article. I wonder if fixing the reliability of their
performance counters to work with rr ( [https://rr-project.org/](https://rr-
project.org/) ) is anywhere on their radar.

Sadly, until that happens, AMD CPUs are dead to me. For a C++ (or C or Rust)
developer, rr is just too much of a productivity boost to give up.

------
caycep
I wonder if I'm misinterpreting TDP.

For the 105W TDP chip vs. say the 65 W one. If there is a lesser task not
saturating the cores, the power/heat generation would be similar, and the
bigger chip doesn't really ramp up the heat/wattage unless heavier loads are
thrown at it?

~~~
dsr_
For sibling chips like this, yes. Two cores being run at 90% utilization each
will draw about the same amount of power regardless of whether you bought the
6-core version or the 8-core version.

Similarly, 4 cores running on the 12 or 16 core chips should eat about the
same amount of power as each other.

------
sorenjan
How does new instructions, register renaming, etc work with different
compilers? Say I'm using Visual Studio to compile C++, will it take advantage
of the new processor features by default? What about if the binary runs on a
different CPU, will the compiler include feature checks and multiple code
versions?

~~~
horyzen
Not sure about VC++, but in gcc you can use -march=native to let the compiler
compile the code with all instruction sets available on your CPU, I think
there is a VC++ equivalent.

As for already compiled binary, depending on how it was compiled it may or may
not work of a different CPU. Also the compiler doesn't do the runtime checks.

~~~
jjuhl
GCC can do Function Multi Versioning
([https://gcc.gnu.org/wiki/FunctionMultiVersioning](https://gcc.gnu.org/wiki/FunctionMultiVersioning)
, [https://lwn.net/Articles/691932/](https://lwn.net/Articles/691932/)) where
it will generate code for multiple CPUs and select the best at run time.

------
keyle
I find these articles incredible, but then I remember I work and live on a
Mac. And I can't wait to pay 5K for the content of one of those articles
published 2 years ago.

~~~
thirdsun
Same here. I'd love to build a Ryzen box, and actually I'll do so for gaming,
but my primary computer has to run macOS and unfortunately that requirement
comes with a significant price tag, in addition to outdated hardware.

~~~
heyoni
Can’t you do hackintoshes with Ryzen?

~~~
unicornfinder
Not without a lot of messing around. With Intel you can pretty much just
straight install macOS + Clover + FakeSMC and you're golden. With Ryzen you
have to start messing around with custom kernels, which means updates will
break things and also means a lot of programs won't work.

Last I checked a lot of progress had been made, but you're unable to run any
32-bit applications and some software such as the Adobe Creative Suite simply
won't run.

~~~
mrgreenfur
I too thought it was hard, but recently found this that supposedly makes it
way easier: [https://kb.amd-osx.com/guides/HS/](https://kb.amd-
osx.com/guides/HS/)

I'm considering a new ryzen hackintosh build in july!

~~~
tracker1
Going to give it one, single try... then I'm on to Linux as my primary. Will
have a Windows VM for some work, and may keep a mac VM as well. Most of the
stuff I do works fine in Linux, and it really looks like Manjaro and Pop_OS!
have made a _lot_ of progress beyond the general dev stuff I work on (mostly
via Docker/Linux anyway).

------
nottorp
Now... when will AMD fix power consumption on their video cards as well?

~~~
nottorp
Funny about those downvotes.

AMD can deliver 8C/16T in 65 W but their GPUs need 50%+ more power than
nvidia's for the same performance (up to 100% more at the 1080 lower end).
You're saying I'm not right and they don't have a problem?

~~~
dave7
At this same event they also announced their 1st gen RDNA GPUs, called Navi.
Also releasing to retail on July 7th. They supposedly go quite some way to
reducing the gulf in GPU power usage/performance.

~~~
officeplant
Going by their E3 presentation of the 5700 and 5700 XT last night they haven't
done enough to curb power use, but only testing will tell once we have these
cards in hand.

~~~
tracker1
Agreed... I'm a bit torn on this, since they've done a lot for Linux support,
may go 5700XT or Radeon VII to pair with 3950X

