
A Deep Dive into AMD’s Rome Epyc Architecture - lamchob
https://www.nextplatform.com/2019/08/15/a-deep-dive-into-amds-rome-epyc-architecture/
======
mjw1007
Up until around 2012, realworldtech.com and anandtech.com used to publish
rather more detailed descriptions of the microarchitecture inside each core.

Is anyone publishing things like that these days? I mean pages like these:

[https://www.realworldtech.com/haswell-
cpu/4/](https://www.realworldtech.com/haswell-cpu/4/)
[https://www.anandtech.com/show/6355/intels-haswell-
architect...](https://www.anandtech.com/show/6355/intels-haswell-
architecture/8)

(I noticed that Agner Fog's chapter on Ryzen is conspicuously missing a
"Literature" section.)

~~~
ksec
Anandtech still does that, just no longer written by Anand himself ( He is
working in Apple now ). So the writing aren't as good. Even though the
technical details are still there.

One of the problem is that the market for these kind of review are very much a
niche. And just like all forms of free media, if there aren't enough page view
they stop doing it.

I have always thought some of these media will consolidate, I mean I only ever
read Anandtech, Servethehome and some Ars, and that is about it. I have RSS
Header news feed from a few other sources such as Tom's hardware, Engadget but
if Anandtech cover the same topic I always go there first.

Not only has that not happen, most of these website manage to stay afloat
catering for different market. But I have no idea how the market segmentation
works. I could tell site like Wcctech is sort of 100% rumours site with very
little if any technical knowledge in writing. And yet it gathers huge amount
of audience.

While others like Tom's Hardware seems to have retain enough of its news
reader to become sustainable.

~~~
close04
Unfortunately while AT still has some great deep-dives for mobile SoCs (top
marks to Andrei Frumusanu), the x86 articles have become a bit shallow. And if
that wasn't bad enough, they also suggest some bias.

They tend to bang the drum when it comes to Intel but in AMD reviews you'll
get things like _" Due to bad luck and timing issues we have not been able to
test the latest Intel and AMD servers CPU in our most demanding workloads"_.
It's a lot like reviewing a Ferrari but due to bad luck you could only test it
in city traffic.

2 years ago they forgot to cover the Threadripper launch for 2 weeks while the
front page was flooded with dozens of uninteresting half page articles about
Intel motherboards being launched around the same time. I love a good tech
article regardless of which brand they're talking about but bias will always
kill the experience for me. YMMV I guess.

~~~
toast0
From watching these sites for years, I think you can see whatever bias you
want to see. Some sites/authors do have clear bias, but a lot of it is just
time pressure.

Often, review parts are shipped to sites with a review embargo until a certain
date -- if you don't ship your review on that date, you lose out; if the
shipment is late because of the vendor, or the shipping service, or the
reviewer is sick, or out of town, or the shipped firmware isn't great and
interim firmware makes a big difference the choices are:

a) take the time to do a full review, but publish late b) do a cursory review,
apologize and publish on time c) do b, but follow up with a full review as
time permits

If C happens more with AMD than Intel, it could be bias, it could be bad luck,
or it could be Intel has been delivering more finished things to reviewers.

~~~
close04
AT shouldn't get worse treatment than any other review site. If all the others
can post detailed benchmarks or cover an event and only AT has consistent
issues and bad luck at some point a pattern emerges.

I get that I can also be biased. But bias should be like noise, taking all of
the articles together should average it out. In AT's case it's more like the
signal rather than the noise. What really capped it off for me was not
covering a public event that every other website covered, like the 2017
Threadripper launch. The signal was that they are even willing to ignore one
of the most interesting launches in years to post articles about trivial
motherboard announcements. I would never mind if Intel launched some awesome
new CPU.

Then the confirmation came the following year, coincidentally also during a
Threadripper event when they wrote multiple articles touting Intel's new 5GHz
28 core CPU. They missed the fact that it was a massive overclock chilled by
an (admittedly hidden) 1HP chiller and their experience raised no red flags
where even the comments did. But worse, when the bubble burst unlike every
other publication AT's response was an anemic piece excusing Intel and with
the literal conclusion that _" the 28-core announcement was not ideally
communicated"_.

I understand Intel's shenanigans to try to steal some of the attention that TR
is getting. But as a journalist being played like that should trigger a more
visible reaction. Consistently painting them in a good light just raises
suspicions for me. And while I still read their articles I no longer take them
or the conclusion at face value unless another big site confirms it.

------
mmrezaie
There must be a simulation for this kind of architectures to see what is the
best combination of size and components while making it practical! I wonder if
anyone knows something like that? A tool to minmax these choices and estimate
if this can be done with resources they have got.

~~~
ajross
Tools like that are a core part of the design process. You write that software
along with the choice of parametrization of the design. It's not an off the
shelf thing. But yes, that's how it works.

It's also important to note that decisions like this are hugely workload-
specific. There's no single best processor for all applications. In extreme
examples: almost every transistor on a vector SIMD unit is wasted when trying
to optimize for a client Javascript benchmark; streaming symmetric encryption
gets no benefit from L3 cache (which is like half the chip these days!);
etc...

~~~
zazagura
Maybe the time has come for applications specific CPU variations?

One optimized for node.js tasks, one for databases, ...

~~~
tempguy9999
> one for databases

as a DB guy, there's no 'one task' for DBs . The only thing I can think that
is nearly characteristic of DBs I've worked on is that they're IO bound.

That's possibly true of most things except Floating point and graphics.

~~~
snaky
While that's true, you can achieve impressive results offloading particular
parts of particular DBMS code to the specialized CPUs,
[https://www.ibm.com/support/knowledgecenter/en/SSEPEK_11.0.0...](https://www.ibm.com/support/knowledgecenter/en/SSEPEK_11.0.0/perf/src/tpc/db2z_ibmziip.html)

------
MayeulC
> “We like features that improve both power and performance,” Clark
> elaborated. “Being on the right path more often is important because the
> worst use of power is executing instructions that you are just going to
> throw away. We are not throwing work away after we figure out dynamically
> that we were wrong to do it. This definitely burns more power on the front
> end, but it pays dividends on the back end.”

Every documentation I've seen is quite light on the branch prediction
improvements. Going by the slides, they improved is accuracy by 1/3; I'd be
curious to know how. Side note: if your superscalar is big enough (yeah, those
registers use power), couldn't you just get rid of branch prediction at no
performance cost (doing something else while waiting for the data)?

My only grudge against Zen (as a consumer) is that the AM4 socket is intended
for both APUs and CPUs. While this is a good thing, I have a couple utterly
useless video outputs on my motherboard. I would have liked AMD to include
some display driver circuitry on every chip. Maybe in the I/O die, if they use
such a thing in all of their designs going forward? I mean, I would be quite
content with using software rendering when I need to drive a screen, or even
spare a bit of memory bandwidth and CPU cycles to drive an extra display from
my desktop's graphics card.

~~~
piadodjanho
> Every documentation I've seen is quite light on the branch prediction
> improvements.

In one of the pictures in the article, it says the new architecture uses the
TAGE Branche predictor. This is likely based on the work of Andre Seznec.
There are many articles on the implementation (but they can be difficult to
understand if you are not already familiar with his work).

I've implemented the bare bone predictor on a computer architecture course,
you can see an abridged version of my presentation slides here[1]. Note this
only describes the bare bone predictor, in recent work Andre Senzec added a
Loop predictor and a Statistical Correlation Unit to increase the accuracy.

There are some work using TAGE with perceptions in the Statistical Correlation
unit.

[1] [https://docs.google.com/presentation/d/1aUrwD-
ENYPB7pMrCoYmE...](https://docs.google.com/presentation/d/1aUrwD-
ENYPB7pMrCoYmEcLamuwyk_WfrggBT_3vOZBs/edit?usp=sharing)

~~~
MayeulC
Thank you, I hadn't realized those branch predictors were actually documented,
and thought that they were referring to internal names.

It is nice to see research being applied to new mainstream chips relatively
quickly. In complement to your slides, there is a short overview here [1]
(this is actually the first search result).

~~~
MayeulC
I forgot to put the link. _One of_ the first search results.

[1]: [https://comparch.net/2013/06/30/why-tage-is-the-
best/](https://comparch.net/2013/06/30/why-tage-is-the-best/)

------
shaklee3
This didn't really seem like a deep dive compared to the anandtech article. I
was hoping for some memory bandwidth benchmarks, since this should be the
first chip that has 8 channels without caveats (looking at you power 9). It's
also not clear if it's 16 channels with 2S, but I suspect not.

Edit: the picture from AMD in this review makes me think it can hit 16 memory
channels with the two socket version. Does anyone know if this is true?

~~~
wtallis
> the picture from AMD in this review makes me think it can hit 16 memory
> channels with the two socket version. Does anyone know if this is true?

Yes, if the motherboard provides all the necessary slots. The inter-socket
communication is achieved by re-purposing CPU pins used for PCIe, not pins
used for DRAM. Each CPU has the full 8 DRAM channels of its own.

------
thinkersilver
The poster is holding a line of bash to the standard of code and is
illustrating that readability should be the goal and a way of bringing bash
commands to a standard of readability for something like a PR. Readability is
really there to show _intent_

I would say though that if you are bringing this to the code standards of
today then this should really be wrapped up in some kind of unit test
([https://github.com/sstephenson/bats](https://github.com/sstephenson/bats)
)for it to pass the PR. That would make the code a bit more maintainable and
can be integrated as a stage in your CI/CD pipeline.

If we do that then the intent would be clarified by the input and the expected
output of the test. Then then the code would at least be maintainable and the
readability problem becomes less of an issue when it comes to technical debt.

I've done this plenty of times with my teams and its certainly helped.

~~~
insulanus
Are you replying to this thread?
[https://news.ycombinator.com/item?id=20724679](https://news.ycombinator.com/item?id=20724679)

~~~
thinkersilver
Yes I was. I've posted the comment to the correct story now. I don't how that
happened.

------
ramshanker
My gut feeling is that Intel also lays out / develops IO block and cores
seperately. It's just that they are all put on single silicon.

~~~
chx
But separate silicon is what gives AMD an almost insurmountable cost
advantage. They can bin each chiplet separately, their yields are much higher
because each die is smaller and the cherry on top is the different, cheaper
process for the I/O die.

