
A Look at the AMD Zen 2 Core - matt_d
https://fuse.wikichip.org/news/2458/a-look-at-the-amd-zen-2-core/
======
rrss
This isn't directly related to Zen 2 (sorry), but it's something I've been
wondering about:

How do processors that split ops into uops implement precise interrupts? I
sort of understand how the ROB is used to implement precise interrupts even
with pipelining and OOO, but I don't quite see how processors map uops back to
the original instruction sequence.

~~~
_chris_
First, most uops can't throw exceptions, so fusing a shift and an add
instruction together doesn't require any complex tracking here.

If a uop throws an exception (let's say a fused add+ld), each uop can have a
tag that helps you backtrace to its PC (instruction address) of let's say the
start of the sequence, so you know what to inform the Privileged Architecture
as to what "instruction" excepted. For many reasons, you need to store a list
of PCs of the inflight instructions somewhere (although it is heavily
compressed), so having a small ID tag to help reconstruct a given uop's PC
isn't too onerous.

Ideally, multiple instructions may map to a single uop, but either none (or up
to one) can throw an exception. The hard one here is something like load-pair
uops; since each load can throw an exception. Some machines, if a fault is
encountered, will refetch the pair and re-execute as independent/unfused
loads. Other designs will just pay the pain of tracking _which_ of the pair
excepted and do some simple arithmetic off of that.

~~~
fulafel
Instructions such as shift and add that have memory operands can throw
exceptions on x86/amd64. (This was some of the motivations of RISC, separating
loads/stores from ALU ops made exception handling cleaner).

Heh, random note, I just looked up shift instructions on x86, there are 6
different ones, not RISC. But today there's a lot over 1000 instructions so a
few shift variants are peanuts.

~~~
fulafel
Correcting myself, the issue was uops and not instructions. But this turns out
to be still similar: at least intel nowadays keeps the memory addressing part
in uops (most cases) and doesn't split instructions into load/store uops + alu
ops.

~~~
monocasa
I thought there was two different uop ISAs on big x86 cores with two different
purposes these days. One is pretty close to the original instructions, just
decoded and fixed width (on AMD at least, this is what's in the uCode ROM).
Then those are cracked to another ISA that the ROB knows about because
instructions will cross functional unit boundaries.

So in say 'rol mem_addr, shift', your inner ISA would be cracked to something
like.

    
    
        ld  reg_temp0, mem_addr
        rol reg_temp0, shift
        st  reg_temp0, mem_addr
    

This is all hearsay though; I could have certainly misheard/misremembered.

------
garkin
Zen 2 is very good in number crunching and synthetics. But it has a problem -
terrible memory latency. 70ns with 3600cl16.
([https://www.userbenchmark.com/UserRun/18168279](https://www.userbenchmark.com/UserRun/18168279))
It distills to a not-so-good gaming frame times. It's 64mb L3 cache
([https://en.wikichip.org/wiki/amd/ryzen_9/3900x](https://en.wikichip.org/wiki/amd/ryzen_9/3900x))
helps only partially.

Few games will suffer greatly from it, but there are several titles with RAM
bottlenecks, like PUBG and FarCry.

Anyway, AMD has a much better price/performance offer than Intel. For general
puprose Intel is totaly anihilated, but for the games they are still more than
competitive.

~~~
demilicious
As far as I know, these processors are not yet released. What's the confidence
level that this user benchmark will be indicative of real life expected
performance?

It seems implausible that this user benchmark is a good indicator. The Zen 1
architecture exhibited nothing of the sort[0] -- it would be an order of
magnitude performance regression.

I expect we'll start to see more accurate tests once the processors are
actually released into the wild.

[0][https://www.tomshardware.com/reviews/amd-
ryzen-7-2700x-revie...](https://www.tomshardware.com/reviews/amd-
ryzen-7-2700x-review,5571-3.html)

~~~
jchw
Well, to be devil's advocate: just because AMD improved their memory latency
does not mean it was state-of-the-art. Indeed, looking at some random user
benchmarks of Zen 1, it looks like the memory latency is similar with Zen 1.
(See my reply to parent for a couple links, but I think you could also just
find random Ryzen 2700X benches and see the same.)

Of course, the effects of ~20ns more latency on system memory accesses may not
be as easy to observe in practice, especially if throughput is excellent. But
we'll see, I guess; 'gaming' benchmarks should be a good test. (Meanwhile, I'm
just going to be happy with Zen 2 if it provides a large speedup in compile
times.)

~~~
vbezhenar
But Zen 2 has HUGE L3 cache. I think that this cache would compensate for that
latency.

~~~
neilmovva
Actually, the L3 cache is also sharded across chiplets, so there's a small
(~8MB) local portion of L3 that is fast, while remote slices will have to go
over AMD's interdie connection fabric and incur a serious latency penalty. On
first gen Epyc/Threadripper, nonlocal L3 hits were almost as slow as DRAM at
~100ns (!).

~~~
sitkack
Does that local vs remote L3 show up in the NUMA information?

------
ademup
"Zen employs a dynamic predictor known as a hashed perceptron." When will
Hollywood tap into this wealth of cool vocabulary?

Is this a gimmicky marketing term, or is it logical\descriptive\rational?

~~~
superpermutat0r
Hashed probably means that some features are mapped to a desired perceptron
weight via a hash function. This serves as an implicit regularization and can
be much more efficient (no need for a predetermined feature vector, sparse
representation etc.). It's called hash trick in ML. Perceptron is a single
layer NN. Dynamic predictor, a branch predictor that adapts to program input,
not just some predetermined state (like a formula that was shown to be good
enough for a collection of programs).

~~~
jnordwick
In this case, I'm pretty sure it means that multiple instruction addreses are
hashed to the same perceptron (or is what you said effectively the same thing
in ML-speak).

Since you don't have branch predictors for each address, you share the same
prediction data for multiple addresses (this is one often underlooked cost to
heavily branchy code).

------
mtgx
I wish AMD enabled Secure Encrypted Memory by default for all devices, and
also bring Secure Encrypted Virtualization (or something similar) to consumer
Ryzen chips, too.

Even Microsoft has made VM-sandboxing a consumer feature. It's time for AMD to
join the bandwagon and help along the way, not keep the virtualization/other
security features exclusive to server customers.

~~~
transpute
SEV is claimed to be present on Ryzen Embedded,
[https://en.wikichip.org/wiki/amd/ryzen_embedded/v1605b#Featu...](https://en.wikichip.org/wiki/amd/ryzen_embedded/v1605b#Features)

Available on UDOO Bolt, [https://www.kickstarter.com/projects/udoo/udoo-bolt-
raising-...](https://www.kickstarter.com/projects/udoo/udoo-bolt-raising-the-
maker-world-to-the-next-leve)

At least AMD made DRTM (SKINIT) available on consumer CPUs, where Intel still
keeps DRTM (TXT) limited to business vPro chipsets. ECC is available on some
CPU/motherboard/bios combinations for Ryzen, although ECC and low-power for
Ryzen APUs is segmented to "Pro" models only available to OEMs.

It's still near impossible to buy devices with Ryzen Embedded, despite several
system announcements from ASRock, but SEV is available today on Supermicro
Epyc 3000 mITX motherboards.

------
gen3
This is a really good article. I am finally at the point in my college
education where I can understand what they are talking about!

Are there any articles like this about Apple or Intel chips? I like hearing
about actual implementations, not just educational processors. (I am looking
through the other articles on this site.)

~~~
mtone
[https://www.anandtech.com](https://www.anandtech.com) (Ian Cutress et al.)
has good regular "deep dive" pieces. David Kanter at
[https://www.realworldtech.com](https://www.realworldtech.com) is worth a
read, although his site is not as active.

------
bogomipz
The post states:

>"Perceptrons are the simplest form of machine learning and lend themselves to
somewhat easier hardware implementations compared to some of the other machine
learning algorithms."

Can someone explain what is it about perceptrons that make them easier to
implement in hardware?

~~~
Der_Einzige
Wait, how is K-NN harder to implement than a perception... Even in hardware?
What about linear regression? What about decision trees? What about naive
Bayes?

~~~
jefft255
Performing nearest neighbors search is much more expensive than a simple
multiply-add. I don’t know about the other, my idea is that a perceptron is
much easier to implement in hardware/low-level operations.

------
m3kw9
Man, just tell me how fast it runs on practical apps.

~~~
topspin
The verdict is in. Aside from games Ryzen 3000 is a big win; "practical"
application run substantially faster on these CPUs. Intel still has a small
absolute lead for a bunch of traditional gaming benchmarks, but work loads
like compression, rendering, compiling, transcoding are nearly all faster than
anything Intel offers.

Expect large price cuts from Intel soon.

~~~
dispat0r
And except games means if you are CPU limited at 1080p or lower with old game
engines you are losing maybe 5% performance.

