
Multi-byte NOP opcode made official - based2
https://software.intel.com/en-us/forums/watercooler-catchall/topic/307174
======
yuhong
I wrote about long nops in
[http://www.agner.org/optimize/blog/read.php?i=25#82](http://www.agner.org/optimize/blog/read.php?i=25#82)

I had a private email thread with H. Peter Anvin (formerly of Transmeta) about
this.

I think there was an Intel patent about them.

------
foota
Needs a 2006

------
based2
[http://stackoverflow.com/questions/4798356/amd64-nopw-
assemb...](http://stackoverflow.com/questions/4798356/amd64-nopw-assembly-
instruction)

[http://patents.com/us-9330011.html](http://patents.com/us-9330011.html)
Microprocessor with integrated NOP slide detector, May 3, 2016 - VIA
Technologies, Inc

------
WhitneyLand
I used multiple NOP all the time when learning to code 6502 on the VIC-20.
With no assembler and typing in hex valued opcodes by hand, it was useful to
leave gaps during during development to avoid retyping everything too often.

6502 actually had pretty good performance at the time on a per cycle basis.

~~~
drbw
I'm somewhat surprised that I can remember that the NOP opcode was EA. Amazing
what sticks in your mind for almost forty years...

~~~
phkahler
>> I'm somewhat surprised that I can remember that the NOP opcode was EA.
Amazing what sticks in your mind for almost forty years...

poke 19215,25 It was used on MS basic on the Interact computer (1978/9) to
enable peek and poke. Oddly the poke instruction would execute and THEN error,
but poking this one value disabled the error. There was also a set of 3 pokes
that disabled address checking which was used to prevent people from reading
the system ROM and the basic interpreter itself.

I believe I still have quite a collection of the "Interaction" newsletter that
was published in the Detroit area about this machine, which is where the above
info came from. Yeah, the stuff we remember...

------
13of40
I know I'm being stupid somehow, but why isn't 0x90 0x90 0x90 0x90 0x90 0x90
0x90 0x90 0x90 a good enough 9-byte NOP? It's not like a modern CPU is going
to spend a cycle individually fetching and pondering over every 0x90...

~~~
phire
Modern intel CPUs can decide 4 instructions per cycle. 9 single byte nops in a
row would therefore take 3 cycles (though the last cycle will overlap with
upto 3 more instructions.)

A multibyte nop counts as one instruction, taking just one cycle to decode,
and leaving room for upto 3 more instructions to decode that cycle.

~~~
13of40
Upvoted - thank you for the answer. However, I feel I can't get by without
empirically testing it in the morning...

------
robert_tweed
Intriguing, but the original posts are from 2006 with a bump from 2008 and
definitive answers are never given, just "this may require an NDA to discuss."

If you care about optimising NOPs, you're probably wrting a compiler, so I am
curious if these instructions have found their way into any mainstream
compiler such as GCC or Clang. Does this post explain some odd compiler
behaviour?

Has newer information been published making this anything other than abuse of
an undocumented quirk? I.e., liable to blow up on new processors, as it does
on the Pentium MMX, according to the second post.

I mean, it's _interesting_ but I'm not sure why it's here.

~~~
raverbashing
Yes, major compilers use the multi byte nop. GCC, clang, MS compilers, etc

------
userbinator
The 0F 18 has been known for a long time, and they are not "true NOPs" but
PREFETCH instructions:

[http://x86.renejeschke.de/html/file_module_x86_id_252.html](http://x86.renejeschke.de/html/file_module_x86_id_252.html)

~~~
pbsd
0F 1A and 0F 1B have recently become memory bound verification instructions,
as part of MPX. So really, only 0F 1F is likely to be long-term safe to use as
a multibyte NOP.

Likewise, after Pentium 4 REP NOP (F3 90) became PAUSE, the spinlock waiting
hint instruction, so one needs to be careful about prefixes with the old NOP
as well.

~~~
xorblurb
Although PAUSE only has a performance impact, it was part of its design (so
you need only one binary for your spinlock, and it executes correctly even on
processor who don't know anything about PAUSE)

------
cvs268
I remember gaining a fair bit of bounty recommending multi-byte NOPs for
optimisation (use instead of multiple single-byte NOPs).

[http://stackoverflow.com/a/18279617/319204](http://stackoverflow.com/a/18279617/319204)

As part of researching this, i discovered an empirically verified list of NOPs
(all the way from 1byte to 10bytes each.)

[https://android.googlesource.com/toolchain/binutils/+/f22651...](https://android.googlesource.com/toolchain/binutils/+/f226517827d64cc8f9dccb0952731601ac13ef2a/binutils-2.23/bfd/cpu-i386.c#51)

~~~
pwdisswordfish
> pipeline always stalls on a conditional far jump.

That doesn't make sense. Segment:offset jumps are always unconditional.

~~~
adrianratnapala
Maybe its because there is a possiblity the page table has done something
wicked to the destination.

~~~
pwdisswordfish
Have you read my comment?

------
revelation
The thread seems to say nothing about them being faster. I guess if they were
faster they would be quite useful for trampolines.

Not sure you should ever use a multi-byte instruction for alignment, but then
you shouldn't use NOP for alignment in general. That's what 0xCC is for.

~~~
gliptic
Since 0xCC is an interrupt, it's only useful for padding that isn't executed.
If you want to align loops, you need NOPs. Single-instruction, multi-byte NOPs
are certainly preferred by modern compilers.

~~~
revelation
Interesting, I didn't know loop alignment was a thing. I guess it makes sense
since code is just data, too.

~~~
scaramanga
Absolutely. Furthermore, if the loop is small enough it can execute from uop
cache and the instruction fetcher/decoder can be powered down for a turbo
boost.

------
MarkSweep
Raymond Chen has a nice article about NOPs that had to be used on the 386 in
Windows 95:

[https://blogs.msdn.microsoft.com/oldnewthing/20110112-00/?p=...](https://blogs.msdn.microsoft.com/oldnewthing/20110112-00/?p=11773)

~~~
dingo_bat
> For example, there was one bug that manifested itself in incorrect
> instruction decoding if a conditional branch instruction had just the right
> sequence of taken/not-taken history, and the branch instruction was followed
> immediately by a selector load, and one of the first two instructions at the
> destination of the branch was itself a jump, call, or return.

My brain hurts imagining how they figured out and verified that this bug
exists.

~~~
jakub_h
Judging from the Matt Dillon case, a lot of "nah, the bug must be in my code"
must have been involved.

~~~
dingo_bat
What is the Matt Dillon case?

~~~
jakub_h
The guy who was looking half a year for a bug in his code and then half a year
for a bug in GCC (or something like that) only to find out that it was a bug
in an AMD CPU.

