
Why would useless MOV instructions speed up a tight loop in x86_64 assembly? - nkurz
https://github.com/tangentstorm/coinops/tree/junkops
======
rayiner
This is Core 2, so it still has the P6's micro architectural limitation that
it can only read two (or three?) values from the register files each cycle.
But it's a 4-way processor, so it can potentially need up to 8 operands. If
the other operands are on the bypass networks, it's fine. If not, then the CPU
stalls. My guess would be that the MOV has the effect of keeping an operand on
the bypass network long enough to avoid a stall. Totally guessing, though, but
that might explain why it's so sensitive to instruction ordering.

~~~
microarchitect
Sorry, this explanation is almost surely incorrect. How long something is
available on the bypass network is determined by how long the instruction that
produces the value takes to "exit" the pipeline. I can't imagine any scenario
where a consumer instruction causes a producer instruction (i.e., an
instruction "ahead" of it) to stall. Note this would be a dangerous design
point because of the risk of deadlocks.

What's the source for your claim that the Core uarch's register file is
underdesigned in comparison to the dispatch width? I'd be extremely surprised
if this were the case. Last time I looked at the data, about 50-70% of the
reads go to the register file _not_ the bypass network.

~~~
rayiner
The P6 has only two read ports in its permanent register file for operand
values:
[http://www.cs.tau.ac.il/~afek/p6tx050111.pdf](http://www.cs.tau.ac.il/~afek/p6tx050111.pdf)
(p. 36). P-M upped it to three, and Sandy Bridge removed the limitation
completely.

Intel's optimization manual describes the stall:
[http://www.intel.com/content/dam/doc/manual/64-ia-32-archite...](http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-
optimization-manual.pdf) (3.5.2.1, "ROB Read Port Stalls.").

The optimization manual mentions examples of the stall occurring when e.g.
often-used constants are stored in registers, or when a load is hoisted "too
high" and the value "goes cold" before its consumers use it.

Agner Fog's manual has a discussion starting on p. 69, 84 of his manual:
[http://www.agner.org/optimize/microarchitecture.pdf](http://www.agner.org/optimize/microarchitecture.pdf).
Note his use of an unnecessary MOV to "refresh" a register to avoid the stall.

I only glanced at the code quickly, but the comment about how he got rid of a
load by holding a value in a register made me think the load was keeping the
value from "going cold." Of course, I didn't profile it so I'm probably
completely wrong...

~~~
microarchitect
Thanks for that that those very interesting links. But I still don't think
that is what is going on here.

In designs which rename using the ROB, the register file holds values produced
by instructions which are completed and retired, the ROB holds values from
instructions that are completed but _not_ retired, and the bypass network
supplies values from instructions currently completing.

What Agner is doing in his example with the seemingly useless instruction is
transferring a value from the the register file to the ROB so that
instructions which try to read logical register ECX will now source it from
the ROB instead of the register file. But when I look at the code in the stack
overflow question, _nothing actually reads_ from s1. So these are even "more
useless" instructions than Agner's example.

Some people have already mentioned instruction alignment issues, so that is
one likely explanation. There are a whole bunch of other possible issues
involving the scheduler and dispatch restrictions. For example, I've seen
processors where there were two pipelines with slightly different instruction
schedulers. So adding a useless instruction like this might push your
bottleneck instruction into a pipe with a scheduler that is slightly better
for your code. Sometimes bypassing across different pipes is more expensive
than within the same pipe, so again the useless instruction might push some
instructions into pipes that have more of their sources. It could one of any
number of reasons and it's going to be very hard to tell from the outside
without knowing the details of the microarchitecture.

~~~
rayiner
For some reason I thought I read in the original question that he'd replaced
the MOV with an equivalent string of NOOPs, but now that I read the example
again I clearly just made that up in my head... In that case, I agree that
it's probably an instruction alignment issue, specifically the MOV pushing
some group of instructions to align better into the 16-byte fetch/decode
window. It'd be interesting if someone can run the code on Sandy Bridge+ and
see if the useless MOV still helps. The decoded u-op cache should take a lot
of the instruction alignment issues off the table.

------
conductor
Sorry for the off topic discussion, but I would like to give some attention to
FreePascal. In my opinion, it is a great piece of software, and is so
underrated. If you find C/C++ too error-prone or hard to learn, Object Pascal
is a very good alternative. FreePascal is a multi-platform, Delphi compatibe
Object Pascal compiler, can generates pretty optimized native code for
multiple architectures (including ARM) and has plenty of libraries. If you
already hadn't, please give Lazarus[1] a try, it's a nice RAD IDE (very
similar to Borland Delphi 7) shipped with the FreePascal compiler.

[1] - [http://lazarus.freepascal.org](http://lazarus.freepascal.org)

~~~
sliverstorm
Does this have _anything whatsoever_ to do with the posted topic? There must
be some connection, and I'm twisting my brain trying to figure out what it is.

~~~
nkurz
The source file in question is in Pascal, and the author suggested downloading
Free Pascal to test it.

[https://github.com/tangentstorm/coinops/blob/junkops/sha256_...](https://github.com/tangentstorm/coinops/blob/junkops/sha256_mjw.pas)

~~~
sliverstorm
Oh, whoops. I saw the two ".py" extensions and figured it was a python
project.

------
rogerbinns
I highly recommend the Stanford EE380 video "Things CPU Architects Need To
Think About". Even though it is from 2004, much of the material is still
relevant. Bob Colwell notes that they had similar unexpected throughput
slowdowns when implementing the P6 and Netburst processor cores, and
discovered that adding random delays cleared them out. The cause for
throughput hiccups could be traced back, sometimes hundreds or thousands of
instructions, but were extremely difficult to address. They also found that
manufacturing assembly lines had similar problems that were also addresses by
adding delay.

[http://www.stanford.edu/class/ee380/ay0304.html](http://www.stanford.edu/class/ee380/ay0304.html)
(18 Feb 2004, site appears to be having issues at the moment)

~~~
acqq
It's such a good talk. There's even the point where he almost says "we're
sorry" regarding Pentium 4, at the time where a lot of people still believed
Intel's marketing that it's "the best." I understood him, as the application I
worked on at the time was significantly faster on the Pentium M than on the
Pentium 4.

~~~
rogerbinns
I also love the bit about the floating point divide non-optimisation, and how
it turned into a success.

------
incision
Interesting, relevant paper mentioned in response to the linked SO question:

 _MAO - an Extensible Micro-Architectural Optimizer_ \-
[http://static.googleusercontent.com/external_content/untrust...](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/37077.pdf)

~~~
kilowatt
There are figures on page 6 that answer the OP's question directly. And
there's also this money quote: "Building an accurate model for modern
processors is virtually impossible."

~~~
rayiner
A good way to test that would be to insert NOP's of the right length in place
of the MOV to see if its an alignment or predictor line sharing issue.

------
pbsd
The code keeps using high registers for no discernible reason, instead of
sticking with EAX--EDI; this means that each instruction is 1 byte longer than
it could be.

This has consequences to the instruction fetching and decoding circuitry,
where (in the Core 2) you can only read 16 instruction bytes per cycle (or 6
instructions). It is possible that the extraneous MOV instructions are just
resulting in a better instruction alignment.

~~~
nkurz
_The code keeps using high registers for no discernible reason_

The function[1] inside of which the assembly is located declares a lot of
variables as well. I don't know how well Free Pascal does register allocation,
but perhaps this avoid clobbering the registers it prefers. Alignment seems
like a likely candidate, but isn't likely to explain why the exact ordering of
the extra ops makes a difference.

Would help to see the assembly for the whole function.

[1]
[https://github.com/tangentstorm/coinops/blob/junkops/sha256_...](https://github.com/tangentstorm/coinops/blob/junkops/sha256_mjw.pas)

------
mistercow
I'm not sure that I'm buying that a 0.7% time difference is anything other
than noise after only 25 runs.

------
raymondh
Adding a MOV shifts subsequent instructions and can relieve deleterious
aliasing in the branch prediction buffer.

Here's a full explanation (with references and links to the diagnostic tools):
[http://stackoverflow.com/a/17906589/1001643](http://stackoverflow.com/a/17906589/1001643)

------
DannyBee
The answer can be one of a billion reasons: dependency chain breaking,
register renaming deficiencies, alias breaking, etc.

First step needs to be the hardware performance counters. Usually it will tell
you something interesting, but leave you scratching your head.

Then you find a friend at intel, who laughs, and explains some wildly esoteric
detail of the core2 processor that causes this.

This is why scheduling and register allocation models in compilers don't
bother with ILP based modeling for x86 processors. Even if you modeled the
_externally published_ architecture perfectly, including port restrictions,
decode lengths, branch stalls, cycle counts, register requirements, etc, the
processor just does whatever it wants internally anyway, because they don't
really give you all the details.

------
rwallace
In my experience the timing jitter on a general-purpose PC - the typical
difference between runs of the same code on the same data - is in the ballpark
of a couple of percent (and no, back-to-back runs aren't fully independent of
each other so you can't average it out over a couple of dozen runs).

The claimed timing difference was less than one percent so while I certainly
can't say it wasn't real, I'm not seeing any evidence that it was.

------
Havoc
From the stackoverflow page:

>randomly inserting nop instructions in programs can easily increase
performance by 5% or more, and no, compilers cannot easily exploit this.

Could someone please help me understand why this isn't exploitable? Surely one
could just tack a module onto the end of the compiler that bruteforces this?
Or something like a genetic algo to trial & error it.

------
ryen
Shouldn't something like this be measured using #CPU ops per CPU-time instead
of just 'seconds'?

