
How NOP Nearly Became a Non-NOP on AMD64 (2007) - mtviewdave
http://www.pagetable.com/?p=6
======
userbinator
I find myself nodding in assent at the last comment... I also believe that the
amount of code out there which would benefit from having an extra 24 32-bit
registers is far more than that which would benefit from having 16 64/32-bit
ones instead.

~~~
mgraczyk
Not likely. Modern out-of-order microarchitectures like Bulldozer and
Broadwell have far more physical registers than the ISA specifies. Haswell has
168 physical integer register, for example.
[http://www.realworldtech.com/haswell-
cpu/3/](http://www.realworldtech.com/haswell-cpu/3/) The lack of registers in
the ISA moves the burden of data dependency checking from compiler to core,
but it doesn't increase the number of stalls.

~~~
devit
Well, it means that if code cannot fit everything into the ISA registers, it
has to spill them to the stack, which AFAIK is not renamed to the physical
registers on current x86, so you still need to pay an extra penalty to access
the stack data in the L1 cache.

~~~
Tuna-Fish
But stack accesses are generally perfectly predicted, meaning that your loads
from stack get executed and the data loaded into those extra registers long
before your code needs to use those values.

~~~
tux3
The prefetcher does a great job, but it's not remotely enough to make stack
access penalties disappear.

It's trivial to write microbenchmarks where spilling hot read/write variables
to the stack destroys performance, much harder to find special cases where the
difference isn't noticeable.

~~~
stephencanon
The hot region of the stack is, for practical purposes, always resident in D$,
so there's no _prefetching_ to be done. I think that you're really talking
about out-of-order and speculative loads. You're absolutely correct that
spill/fill can be catastrophic for performance, however.

------
Asbostos
It seems more a matter of wording than anything in the CPU

" An assembler would translate the mnemonic “nop” into “xchg ax, ax” (opcode
0×90)"

But 0x90 also means nop, so it's not really translating anything. And now it
still means nop in AMD64.

~~~
creshal
> But 0x90 also means nop

As… the article even explained further down, sometimes. That's where the whole
discussion comes from.

• In 32 bit mode, it doesn't matter. Some (dumb) CPUs treated it as xchg
ax,ax; pipelined CPUs optimized it away.

• In 64 bit mode, it matters: Is "xchg eax, eax" a valid way to clear the
upper 32 bits of eax? Or will it always be optimized away as legacy NOP?

AMD decided for the latter. They could also have introduced a new opcode for
it instead (there are already multiple nop instructions, like nopl/nopw, so it
wouldn't have been too far off) – as this only affects 64 bit mode, backwards
compatibility didn't really matter, both would have been possible.

~~~
Asbostos
Opcode 0x90 only means xchg eax,eax on paper. If no documentation ever called
it that then it would be a non-issue. It would always have been nop and still
be nop. Somebody could also have called it xchg ebx,ebx as well and it would
have been just the same in 32-bit mode.

~~~
creshal
> If no documentation ever called it that then it would be a non-issue.

XCHG EAX,target is defined as opcode (0x90 + offset of target register), with
EAX having offset 0.

So, it was xchg eax,eax originally, and documented as such, before it was
turned into NOP because it happened to be safe for it.

It's still documented as "alias for the XCHG (E)AX, (E)AX instruction" in
Intel's instruction set reference, and pre-486 embedded x86s still treat it as
xchg.

~~~
Asbostos
Sure. That "If" kind of makes it moot. That's all I was trying to point out -
that it's a documentation thing rather than something in the design of the
chip and how it works.

Edit: What do you mean "still treat it as xchg"? On those chips, isn't there
no distinction between xchg and what we might retrospectively call "nop"?
Perhaps this is something I'm missing.

~~~
creshal
> Edit: What do you mean "still treat it as xchg"? On those chips, isn't there
> no distinction between xchg and what we might retrospectively call "nop"?
> Perhaps this is something I'm missing.

XCHG EAX, EAX in its dumbest interpretation loads EAX into a temporary
register, replaces it with the contents of EAX and restores the temporary data
to… EAX. So, it _is_ an operation that does nothing, but it does nothing in an
elaborate way. You can skip it instead of executing it, but only if your other
code doesn't depend on 0x90 taking exactly three clock cycles.

The "treat 0x90 as NOP and skip it instead of wasting three cycles"
optimization was only done with the 486 and up, and wasn't retroactively
applied to the 386 embedded versions. Doing so would have messed up their
timings, and would have needed a small design change, both not interesting to
that customer base.

(The 386 and derivatives were still produced for embedded use for a long, long
time, past 2001 – and thus, after the introduction of AMD64. When its
instruction set was drafted, new 386-based embedded devices were still
designed.)

