
Instructions per cycle: AMD Zen 2 versus Intel - another
https://lemire.me/blog/2019/12/05/instructions-per-cycle-amd-versus-intel/
======
NohatCoder
In case anyone is not aware: This is a very small sample of microbenchmarks.
When benchmarking very simple tasks like these performance tend to vary wildly
between architectures.

For instance instructions are assigned to one of a handful of ports when
executed, certain instructions may only be assigned to certain ports, what
ports an instruction may be assigned to differ between architectures. If an
inner loop use only a few different instructions one architecture may be
unlucky in that most of the instructions need the same ports, and so it can
execute fewer instruction overall.

For real benchmarking use lots of different complicated jobs. It is not
perfect, but it is the best way we have of comparing different processors head
to head.

~~~
mrb
Indeed. Back in 1999 the AMD K7 was a full _3 times faster_ than Intel on
microbenchmarks measuring the performance of ROR/ROL instructions, because the
throughput per clock of these rotate instructions was exactly 3 times higher
than on Intel. Obviously this did not mean that AMD was 3 times faster than
Intel.

Picking 1 or 2 random microbenchmarks like the blog post author did is not
useful to categorize overall performance across all real-world workloads. If
he had picked different ones, they might have shown AMD twice faster than
Intel.

~~~
BeeOnRope
Examples like that still exist: AMD popcnt throughput is 4x Intel's, for
example (4/cycle vs 1).

------
yifanlu
Assuming both Intel and AMD implement performance monitors the same (i.e. same
notion of instructions executed, which may be hard to measure with speculative
execution), the comparison is still flawed because it doesn’t matter if Intel
can do more instruction per cycle if AMD can produce more cycles in a span of
wall time.

> However, it is not clear whether these reports are genuinely based on
> measures of instruction per cycle. Rather it appears that they are measures
> of the amount of work done per unit of time normalized by processor
> frequency.

That’s precisely why nobody really uses IPC as a way to compare processors.
“How much work done per unit of time” is a much better measurement and I guess
for historical reasons, people conflate it with IPC.

But real textbook IPC is useless for comparison.

~~~
endorphone
"the comparison is still flawed because it doesn’t matter if Intel can do more
instruction per cycle if AMD can produce more cycles in a span of wall time."

The reason Intel had the "per core" superiority crown for years is that it had
a better IPC performance due to design efficiency. Both manufacturers are
pushing against the same frequency ceiling, so if you went AMD you had to
significantly increase the core count to catch up, and could never match the
still important single-thread performance.

We know from large scale, comprehensive benchmarks that AMD has massively
picked up the pace and is neck and neck with Intel. At the same processor
speed it matches the best Intel processors.

But yeah, this article is just _terrible_. Not just tiny, minuscule, extremely
myopic benchmarks, but then a gross over-reach with conclusions. And in the
way that ignorance begets ignorance, the fact that it's trending on a couple
of social news sites means that now Google is surfacing it as canonical
information when it's just a junk, extremely lazy analysis.

~~~
Tempest1981
He ran a few basic tests, and showed the results. Where was the "gross over-
reach"? The article ends with a "your mileage may vary" disclaimer.

~~~
endorphone
"So AMD runs at 2/3 the IPC of an old Intel processor. That is quite poor!"

That is most certainly an overreach. An extraordinary overreach. Worse, it's
absurdly using an AVX2 codebase, optimized for Westmere, as the baseline for
"IPC" testing? The premise itself borders of gross negligence.

IPC as a generalized concept is a broad, general purpose set of instructions,
not an absurdly narrow test.

Saying "Intel is faster at AVX512" is going to surprise exactly no one, and
also happens to be irrelevant for the overwhelming majority of users and uses.

The microbenchmarking thing has gone on for years, and at this point anyone
who has paid any attention is rightly cautious when stomping their feet and
making declarations, because usually they're just pouring noise into the mix.
Lazily running a couple of tiny tests is not the rigour to avoid deserved
criticism.

~~~
BeeOnRope
I'm not sure if you were implying it or just using it as example of another
type of unhelpful claim, but this test does not involve AVX-512.

I agree using Westmere isn't necessarily the best approach, but there is no
difference in this case with either -march=native or -march=znver1.

The loop is small and simple, with only 9 instructions and compiles more or
less the same regardless of march setting (I observed some basically no-op
changes such as a mov and blsr swapping places). Here's the assembly (for the
second test, with the bigger IPC gap):

    
    
        top:
        tzcnt  r8,rcx
        add    r8d,edx
        mov    DWORD PTR [rdi+rax*4],r8d
        mov    eax,DWORD PTR [rsi]
        inc    eax
        blsr   rcx,rcx
        mov    DWORD PTR [rsi],eax
        jne    .top

~~~
endorphone
"I'm not sure if you were implying it or just using it as example of another
type of unhelpful claim, but this test does not involve AVX-512."

Even worse! Is this a defense, because it's remarkably unhelpful as one.

The blog post was clearly a cry for attention for some project -- let's just
use some clickbait IPC claims to gain it -- and continually alluded to a whole
project -- an extreme niche project that still wouldn't have any relevance.
But instead it's a meaningless, completely misrepresentative micro-loop.

~~~
BeeOnRope
My read is different than yours.

I think Daniel uses those examples because they are actual examples from
projects that he is or has been working on, and he's familiar with them and
actually cares about them, and because it's at least a notch more realistic
than something totally synthetic.

It seems like a very roundabout thing to use as a cry for attention for
SIMDjson (the project I assume you are talking about), and I don't believe
that's the purpose. I see no problem in linking the project.

Picking two random benchmarks and trying to extract any kind of more general
IPC claim is not on solid ground, but I'm pretty sure Daniel will say he's not
doing that: he's only sharing these two specific results. That's a style that
reoccurs across several entries in that blog, however, so if it triggers (as
it has me on occasion) you might want to look elsewhere.

------
jonstewart
It’s depressing how many comments here are quick to dismiss the
benchmarking/article. Yes, yes, memory bandwidth, I/O, and cache hierarchies
are all important, but Daniel Lemire is one of the top people in the world
when it comes to optimizing algorithms for modern CPUs. Do you like search
engines? Lemire has made them significantly faster. He is often able to take
code/algorithms that already seem fast, and make them much faster. He’s
recently branched out beyond search engine core algorithms into some aspects
of string processing (base64, UTF-8 validation, JSON parsing).

In this blog post, he’s paying attention to IPC because he’s typically working
with inner loops where the data’s being delivered from RAM to L1 as
efficiently as possible.

~~~
BeeOnRope
I have plenty of respect for Daniel (and you can even find me below in this
discussion defending some aspects of this test), but I too find some fault
with this article.

The main problem I have is that the claim in dispute seems to be that Zen 2
has comparable (perhaps slightly higher) IPC to Skylake, and then Daniel picks
out two benchmarks and shows that Skylake has higher IPC than Zen 2... proving
what exactly?

Contradicting people who said that Zen 2 had a higher IPC on _every_
benchmark? Yes, those people were wrong, but it's easy to prove a point if you
pick an argument almost no one was making it in the first place.

In the same (second) benchmark that he selected the "basic_decoder" sub-
benchmark, but there is also another benchmark "bogus" which tests the empty
function calling time, and this case I measure a reversed scenario: Intel at
IPC 2.25 and AMD at 3.43. So should we now say that Intel IPC is "quite poor"?

~~~
jonstewart
Ha, I’m not referencing _your_ comments here, and I am curious about how you
couldn’t reproduce his results; he’s quick to publish and seems happy to
correct so we’ll see. I’m referencing more the other comments here saying
things about the benchmarks not being realistic because good benchmarks need
to have a mix of tasks, like memory and I/O—this ain’t a Phoronix post, folks.

I started reading through your blog last night. I’m slowly trying to learn how
to go from being a programmer who doesn’t write slow code, to one who writes
fast code, so absorbing a lot about vectorization and ILW, etc.

~~~
BeeOnRope
He's corrected the results (possibly even before I wrote the post you
responded to this AM): they originally showed Intel at 2.8 IPC in the second
table, they now show 2.1.

I measured 2.0, but I guess Daniel is using docker w/ a slightly different
compiler version, so I think it the gap is sufficiently small that we can
declare "close enough". I also measured quite different numbers for SKL (2.0)
vs SKX (1.7), which is quite odd given the non-memory intensive behavior of
the test: in that scenario, I'd expect SKL and SKX to perform identically.

------
reitzensteinm
The second example is just a benchmark of tzcnt, added in BMI1. It's a very
specific and very bizarre benchmark to do when you could just look up the
reciprocal throughput (unfortunately Zen 2 has not yet been added).

[https://www.agner.org/optimize/instruction_tables.pdf](https://www.agner.org/optimize/instruction_tables.pdf)

Edit: This is wrong as BeeOnRope points out below.

The first is SIMD heavy, so Zen 2 mostly closing the gap with Intel in one of
the areas where Zen 1 was very weak is a good thing.

~~~
BeeOnRope
Zen2 is on uops.info, it's 2L0.5T on Zen, 3L1T on Intel, so slight theoretical
edge for AMD (2 vs 1 uops tho).

That said, I don't agree it's a tzcnt benchmark - there are about 9
instructions only one of which is tzcnt. I'm not sure why Zen2 is worse here.

~~~
reitzensteinm
You're right, I messed that up (though I'll leave it for posterity). I went
into it with a bias thinking BMI was slow on Zen, since PDEP is 18 cycles vs 1
on Skylake, much to my disappointment back in the day.

After reviewing the example again, there's no obvious reason why Zen 2 is
slower, although it's likely a rare edge case. Too bad there's nothing decent
like VTune on AMD platforms.

I remember one session where my choice of temporary register significantly
impacted throughput while implementing an unrolled int[] hash fn on my Kaby
Lake processor. I never figured out exactly why, but sharp edges do exist even
on Intel chips.

~~~
BeeOnRope
This benchmark heavily stresses branch misprediction recovery, so that could
be worse on Zen.

Also, I could not reproduce Daniel's results: I got IPC of 1.77 (SKX) or 2.00
(SKL) compared to Daniel's reported 2.80 (SKL, I think), so Intel still better
but by a smaller margin. Waiting for clarification on that one.

------
eyegor
I think the only real way to compare IPC is to actually talk to the
architects. Trying to write microbenchmarks is a fools errand when you aren't
aware of how the cpu processes the instructions you give it. Are you actually
stressing the fpu, or is the cpu speculatively executing and then branch
predicting the workload (common for micro loops)? If it is, is that what you
meant to test? Are you trying to compare like for like (in which case you have
to write assembly), or are you trying to write performance benchmarks (and
then the only meaningful metric is cpu time)?

This is an interesting idea, but I'm not sure how you could derive meaning
from comparing two vastly different architectures at such a high level.

------
alecmg
Useless, strictly academic interest.

There is more than execution ports in design of processors. Not every task can
be SIMD optimized to extent of approaching theoretical IPC limits, most will
be bottlenecked by memory access or even IO.

I prefer the "fake" but real-world IPC. Same clocks, same real world task,
measure time to finish.

~~~
Erwin
I think this was more of a response to the linked benchmark at guru3d which
said:

> Instructions per cycle (IPC)

> For many people, this is the holy grail of CPU measurements in terms of how
> fast an architecture per core really is.

Based on his work with simdjson, professor Lemire seems to be quite aware of
microbenchmarks being problematic. But general articles out here and on HN are
proclaiming Intel is doomed and can never recover, due to mitigations/lack of
cores/lack of chiplets. Those concerns have yet to be reflected in the stock
price.

~~~
pjc50
Intel are behind. They have a pretty big cash buffer and a solid sales
channel, as well as being pretty entrenched in OEMs. So they are a very long
way from being doomed, even if it takes them a long time to turn the ship
around (like 00's Microsoft).

------
zippie
IPC microbenchmarks do not properly reflect the complex workloads running on
post Zen2 microarchitecture. Zen2 upends microarchitecture schematics enough
to warrant a different metric.

IPC MB’s, in my experience, tend to benchmark best case scenarios and that is
probably the exception rather than the rule for application workloads in
modern MA’s. Case in point, microbenchmarks showed significant improvements in
IPC for Zen2 in lieu of Skylake yet for the application workload (CPU data
bound), Skylake held up neck and neck.

The more appropriate benchmarking metric for post-Zen2 processors is CPI [0].

[0] [https://john.e-wilkes.com/papers/2013-EuroSys-
CPI2.pdf](https://john.e-wilkes.com/papers/2013-EuroSys-CPI2.pdf)

~~~
mmrezaie
But isn't CPI is just reverse of IPC, and CPI just makes the IPC score being
bounded between 0..1?

------
chucklenorris
Heh, I'm curious if he used the mitigations for all the side channel flaws for
the intel processors.

~~~
BeeOnRope
The mitigations don't affect CPU bound benchmarks [1] which don't call into
the kernel or use specific user-space mitigations, so it won't matter here.

[1] There are some rare exceptions, such as
[https://travisdowns.github.io/blog/2019/03/19/random-
writes-...](https://travisdowns.github.io/blog/2019/03/19/random-writes-and-
microcode-oh-my.html) , but it is unlikely to matter here.

~~~
amluto
That may have been true, but it is rather dramatically false with the new JCC
erratum workaround.

It’s also false if you’re using a hypervisor that mitigates the iTLB multihit
issue.

~~~
BeeOnRope
Good point, I forgot about that one, although here there is only a single hot
loop with one jump so a high chance the crossing doesn't occur, and even if it
does the IPC is low enough the legacy decoder probably does OK (although it
adds a cycle or two to misprediction recovery, which matters here).

So it's something worth checking.

No hypervisor involved.

------
_ph_
While only being part of the performance equation, analyzing IPC can be quite
interesting in understanding the design of the processor and how performance
might be achieved.

One thing itches me with the presented comparison: it is running very few
benchmarks generated with the same compiler. For a thorough IPC analysis,
shouldn't the tests rather being programmed in assembly to exclude any
influence by the compiler choice? Also probably a wider range of algorithms
should be checked, as IPC on modern processors depends less on how many cycles
a certain instruction takes (you should be able to find that in the manuals),
but how well multiple components of the processor can be utilized at the same
time. Which extremely depends on the actual program to be run.

------
tempguy9999
I'm rather surprised at the claim that "but it might easily execute 7 billion
instructions per second on a single core". I'd even question it except the
author's an expert.

If you can keep it fed then ok but one cache miss to main mem, either
instruction or data, will allow the instruction buffers to completely empty
and stay empty for quite a long time. I don't think you can control placement
to reasonably assure cache hits always for anything but the most trivial code,
am I missing something?

Also if you could keep a consistent throughput like this I wonder if thermal
throttling might have to kick in. I mean you're doing a lot of work...

~~~
touisteur
I can't find it back but in a recent article I read that it was useful to have
an idea of the upper-boundary abilities of an arch+algorithm, so that you
'know' what you're aiming for, but it might not be attainable practically
without huge human or decades of superoptimizer effort... Yes if your
algorithm reaches for cold data, you'll get hit. Can you get around that? Do
you really need to hit the cache when you're computing the seven-billionth
decimal of pi or factoring numbers ? This work is quite interesting, if only
for compilers or superoptimizers.

------
Const-me
I wonder how reliable are these Linux syscalls?

Found this
[http://manpages.ubuntu.com/manpages/trusty/man2/perf_event_o...](http://manpages.ubuntu.com/manpages/trusty/man2/perf_event_open.2.html)
and that article doesn't instill much confidence in the reliability of these
counters. Comment for CPU_CYCLES says "Be wary of what happens during CPU
frequency scaling", comment for INSTRUCTIONS says "these can be affected by
various issues, most notably hardware interrupt counts", BRANCH_INSTRUCTIONS
says "Prior to Linux 2.6.34, this used the wrong event on AMD processors" and
so on.

If I wanted to measure what OP was measuring, I would disable frequency
scaling (probably doable on overclocker-targeted motherboards, also search
finds some utilities which claim to do that, both windows and linux ones),
measure time, then divide by frequency.

~~~
amluto
CPU_CYCLES counts cycles. This means that the _time_ per cycle varies with
frequency. If you're trying to see how many cycles something that fits in L1
takes, CPU_CYCLES is the right thing to measure.

~~~
reitzensteinm
Parent is pointing to documentation suggesting that it's measuring time and
dividing it by frequency, and perhaps not perfectly in the case of dynamic
scaling. They seem aware of what CPU_CYCLES is _supposed_ to do.

~~~
amluto
The documentation is not the best. CPU_CYCLES is genuinely counting cycles.

perf is all about reading actual hardware counters. It's awesome for this.
There is essentially nothing made up about perf's output, except to the extent
that the hardware itself reports inexact output. (For example, perf annotate
may attribute events to an instruction near the instruction in question on
older hardware, because older hardware has a small amount of skew when
sampling.)

------
nabla9
In more comprehensive single thread benchmarks (single thread POV Ray) Intel
can still beat Zen 2 architecture sometimes. This test seems to indicate the
reason why.

------
qxnqd
ITT: AMD apologists.

Sorry guys but Intel is still king of single core performance. But that's not
a problem because I'm sure by 2050 most desktop applications and games will
correctly make use of many cores, then AMD will reign

~~~
tempguy9999
Worth responding to blatant troll to point out it's not about performance but
performance by price for 99% of uses.

~~~
eyegor
Or performance per watt in server land. Which is a metric that zen 2 dominates
in. Very few applications truly care about maxing performance at all costs.

~~~
ncmncm
But, indeed, some do. They will provide as much power and as much cooling as
they need to get that performance.

