
Effects of the x86 ISA on the Front End: Where have all the cycles gone? (2001) [pdf] - nkurz
https://www.eecs.umich.edu/techreports/cse/01/CSE-TR-440-01.pdf
======
userbinator
The creation date of the PDF is 2001, so this is an extremely old paper and
the details may not have all that much relevance anymore since x86 CPU
development has progressed very quickly; I'd guess that AMD and Intel have
already made use of the techniques mentioned here.

It's still worth reading though - the uop decoder mechanism of
microarchitectures today remains very similar to that introduced with the P5,
just wider and deeper.

------
wsxcde
The paper's main problem is that they have to second guess all of Intel and
AMD's design decisions. And it looks like they got some important things
wrong, which makes their conclusions suspect.

For instance they claim an ADD mem, reg instruction would be translated into
three 3 uops. This is definitely not the case, x86 chips can do a
load+arith_op or a store+arith_op in the same uop. The reason they need to do
this is because the old x86 (in the pre-AMD64 days) had only 8 registers so
spills were very common and so loads and stores were very frequently executed,
much more so than in RISC code. This meant that getting maximum performance
out of the memory subsystem was very critical for x86. I remember an AMD chief
architect telling me that the hardest part of building a performant x86 core
is getting the memory subsystem right.

Anyway, back to the paper, I'm not terribly impressed. The uops/instruction
number looks bogus. I remember those numbers being very close 1. The cycles
per uop of 0.37 seems wildly optimistic. I remember numbers closer to 1.

My only caveat is that my experience working on x86 is from about 3-4 years
ago. Maybe things were very different in 2000, but I doubt it.

~~~
userbinator
_I remember an AMD chief architect telling me that the hardest part of
building a performant x86 core is getting the memory subsystem right._

Definitely; Linus also makes this point in his well-known post on x86
([http://yarchive.net/comp/linux/x86.html](http://yarchive.net/comp/linux/x86.html)
):

 _The low register count isn 't an issue when you code in any high-level
language, and it has actually forced x86 implementors to do a hell of a lot
better job than the competition when it comes to memory loads and stores -
which helps in general. While the RISC people were off trying to optimize
their compilers to generate loops that used all 32 registers efficiently, the
x86 implementors instead made the chip run fast on varied loads and used tons
of register renaming hardware (and looking at _memory_ renaming too)._

More recently, I've heard that the effectiveness of caching, out-of-order
execution, and speculation mean that on modern x86 the most frequently
accessed memory, which would be the local variables near the stack pointer, is
almost as fast as the registers - comparisons have been made between the x86's
8 (or 16 for x86-64) registers and the 256 bytes around the stack pointer to
the 6502's "one really fast register, and 256 others that are nearly as fast."

 _The uops /instruction number looks bogus. I remember those numbers being
very close 1._

It depends greatly on the exact instruction mix being executed; same for the
cycles/uop.

~~~
sbierwagen

      Definitely; Linux also makes this point in his well-known post on x86
    

Linus, perhaps?

~~~
userbinator
Yes I do mean him. Fixed.

------
jws
I am unclear when this was published. It does not reference an Intel
microarchitecture past P6 (Pentium Pro) and contains no citations later than
2000.

TLDR Abstract: _We look at 8 aspects of front-end that can contribute to
performance loss; then based on this information, we introduce an improvement
that yields 17 % speedup in overall execution time._

~~~
_delirium
It's from 2001. See:
[http://www.eecs.umich.edu/eecs/research/techreports/cse_tr/d...](http://www.eecs.umich.edu/eecs/research/techreports/cse_tr/database/reports.cgi?01)

~~~
dang
Thank you. I searched earlier and that information is surprisingly hard to
come by.

~~~
_delirium
Yeah, it's weird that the document doesn't "self-identify" its publication
information anywhere. The only reason I found the metadata quickly is that I
glanced at the URL and realized it was a tech report, so Googled for Umich
CSE's tech report archive.

