
My, what strange NOPs you have! - barrkel
http://blogs.msdn.com/b/oldnewthing/archive/2011/01/12/10114521.aspx
======
daeken
I can't help but think about the x64 NOP story:
<http://www.pagetable.com/?p=6>

------
cosmicray
Early developers used to put NOPs into code, so they would have a place to
insert hex patches later on. At a minimum they could plug in a branch
instruction, bounce down to the end of the code segment, put a longer sequence
of code there, then branch back. Old school stuff.

~~~
1amzave
Modern compilers may actually do this as well, though for slightly different
reasons -- GCC (at the right optimization level) likes to put branch targets
and functions at aligned addresses, so you often end up with little pads of
NOP instructions scattered throughout your binaries (run 'objdump -d' on
something built with 'gcc -O2' to see firsthand, if you're interested).

I actually have a mostly-written LD_PRELOAD library lying around that exploits
this for purposes more like the ones you describe though -- hot-patching code
in memory at program load to dispatch system calls via little dynamically-
generated trampolines so you can insert calls to arbitrary tracing functions.
Perhaps I'll polish it up a bit and toss it on github...

~~~
calloc
Something along these lines: [http://timetobleed.com/rewrite-your-ruby-vm-at-
runtime-to-ho...](http://timetobleed.com/rewrite-your-ruby-vm-at-runtime-to-
hot-patch-useful-features/)

------
steveklabnik
Some of these msdn blogs almost feel like a non-apple folklore.org.

I like it.

~~~
ot
I discovered Raymond Chen's blog some weeks ago and I started reading all the
posts that I could, it is fascinating and very well written.

All the problems that Windows developers had to face for compatibility almost
make me sympathize with them. Some context is given in Joel Spolsky's "How
Microsoft Lost the API War"
(<http://www.joelonsoftware.com/articles/APIWar.html>), where he talks about
"The Raymond Chen Camp".

------
msarnoff
In addition to a regular NOP, the Motorola 6809 (and possibly other
6800-series processors) has a BRN instruction: branch never. It's the opposite
of BRA (branch always): it takes an 8-bit offset and ignores it, effectively
making it a two-byte NOP.

~~~
nitrogen
On PIC microcontrollers, it's common to use a "goto $+1" instruction as a two-
cycle NOP to save an instruction word (every word counts when you only have
256 words of ROM and 32 bytes of RAM). It just branches to the next
instruction.

See <http://www.piclist.com/techref/piclist/codegen/delay.htm>

------
gfodor
Honestly I found this part the most interesting:

> Late in the product cycle (after Final Beta), upper management reversed
> their earlier decision and decide not to support the B1 chip after all.

After all that work, on compilers, processes, tools, etc, it was all scrapped
anyway. I wonder if there is a lesson to be learned there. (The usual startup-
mantras don't really seem to apply!)

~~~
rlpb
If it was the right thing to do, then it was the right thing to do despite any
previous work. See: <http://en.wikipedia.org/wiki/Sunk_cost>

~~~
paulgerhardt
Sure, decisions were made on the battlefield that at the time and considering
the circumstances were optimal (for local maxima.) This kind of glib remark
doesn't really prevent anyone else from making the same mistakes.

I don't know what the correct approach should have been, but it seems that
changes to support other architectures should have been made independently
from other fixes and tagged such that they could have been backed out rather
than contaminating your source and pushing it to gold. (Remember this is
before the days of Windows Update!)

~~~
rlpb
> This kind of glib remark doesn't really prevent anyone else from making the
> same mistakes.

Where's the mistake? Wouldn't backing out the workarounds introduce new
potential for error due to other changes made after the workarounds were put
in? Why do the workarounds need to be backed out? They don't break anything.
If it ain't broke...

My point is that it wasn't really a mistake; it was a change in circumstance.

------
alex_c
Fascinating read.

Does anyone have any examples or explanations for how these CPU bugs are
caused? I probably have enough knowledge of CPU architecture from school to
follow an explanation, but not enough to make my own guess.

~~~
1amzave
Circuit-level (layout) problems I'd guess make up a decent portion. I heard
(from a guy who worked on it) about a bug in a prototype version of a
processor from a major company that didn't cause any correctness problems but
was a major performance problem: one of its four cache ways simply didn't have
its power rails connected.

Some CPU verification code I wrote on an internship a couple years ago
discovered a few bugs in a certain fairly widely-used processor, though I'm
pretty sure they were all logic-level problems (i.e. RTL bugs, not circuit
level ones)...

\- The L1 D-cache tracked clean/dirty status at half-cache-line granularity,
and if you did a store (with just the right timing) to one half of a cache
line you had just explicitly cleaned with a cache-clean instruction, the dirty
bit wouldn't get set on that half line, so as soon as the cache got flushed
the data written by the store was lost.

\- The prefetcher would shut down sometimes as a power saving technique, but
if you laid out the right sequence of cache operations and branches in the
last 32 bytes of a 4KB page, sometimes concurrent TLB misses would cause it to
not get re-enabled, meaning the processor would lock up, stop fetching
instructions and just sit there dead in the water until an interrupt came in
(assuming interrupts were enabled).

There were a couple more, but I thought those were the more interesting ones.
Granted, these weren't bugs that were likely to be encountered in normal usage
for various reasons (in addition to being _extremely_ difficult to reproduce
sometimes -- i.e. on one in particular you could run the exact same sequence
of instructions from system power-on and sometimes it happened, sometimes it
didn't), but bugs nonetheless.

------
rwmj
The article also explains why NOP is 0x90 on the 8086. On the Z80, NOP is 0x00
which I always thought was more logical (since you can zero out memory or
code).

~~~
nitrogen
One could make a case against using 0x00 as NOP to prevent a rogue program
from NOPping its way across zero-filled pages and into other parts of memory.

~~~
RodgerTheGreat
If that was the goal, wouldn't you want to reserve 0x00 as a HALT or some sort
of interrupt-firing instruction?

~~~
nitrogen
If 0x00 isn't a valid opcode, as soon as one is hit, the CPU will fire an
invalid instruction interrupt (if it supports invalid instruction interrupts),
allowing the operating system (if any) to do something about it.

However, since the original comment was about the Z80 which according to
<http://www.z80.info/decoding.htm> treats invalid instructions as a NOP, the
point is moot.

By the way, z80.info is the type of site that made me fall in love with the
web -- it's full of information compiled by people doing it not for AdSense
impressions, but just for the love of sharing knowledge. I miss the days when
all of my Google searches would take me to these sites instead of the
contentless content farms of today.

------
wallflower
Fascinating read even though I never had the balls to program ASM.

One of the commenters said that he would buy a book of this story and more and
someone replied that there is in fact a book from him already:

"Old New Thing, The: Practical Development Throughout the Evolution of
Windows" by Raymond Chen

<http://www.informit.com/store/product.aspx?isbn=0321440307>

------
JoeAltmaier
There were 8 NOPs on the x86 - XCHG AX,AX was 1-byte 0x90. XCHG BX,BX etc were
2-byte 0x87 0xXX where the 2nd byte selected the general-purpose registers.

Very strange for a frequency-encoded instruction set to let such short opcode
sequences go to waste.

~~~
ars
Remember that the bits that make up the instructions are not actually numbers
- they are switches. i.e. each bit in the number turns on a section of the
cpu, then that section of the cpu checks the next bit, and turn on the next
part of the cpu, following it down till you get to the proper component that
carries out the instruction.

So by having a pattern to your instructions you can optimize the layout of the
CPU. XCHG AX,AX does exactly that, with no visible effect, but the action is
carried out.

That was then.

These days CPUs compile machine language into microcode, so you can pick any
number for the opcode. The microcode is still switches based on the bits, but
you don't see that.

~~~
jbri
In fact, once the 486 hit, 0x90 was an explicit NOP (1 cycle instead of the 3
that an XCHG usually takes, won't stall the pipeline even if other stuff
touches AX, etc.).

Similarly, on x64 0x90 is a true no-op, while stuff like XCHG EBX, EBX clears
the upper 32 bits of the register.

