Why would useless MOV instructions speed up a tight loop in x86_64 assembly?

rayiner · on July 27, 2013

This is Core 2, so it still has the P6's micro architectural limitation that it can only read two (or three?) values from the register files each cycle. But it's a 4-way processor, so it can potentially need up to 8 operands. If the other operands are on the bypass networks, it's fine. If not, then the CPU stalls. My guess would be that the MOV has the effect of keeping an operand on the bypass network long enough to avoid a stall. Totally guessing, though, but that might explain why it's so sensitive to instruction ordering.

microarchitect · on July 28, 2013

Sorry, this explanation is almost surely incorrect. How long something is available on the bypass network is determined by how long the instruction that produces the value takes to "exit" the pipeline. I can't imagine any scenario where a consumer instruction causes a producer instruction (i.e., an instruction "ahead" of it) to stall. Note this would be a dangerous design point because of the risk of deadlocks.

What's the source for your claim that the Core uarch's register file is underdesigned in comparison to the dispatch width? I'd be extremely surprised if this were the case. Last time I looked at the data, about 50-70% of the reads go to the register file not the bypass network.

rayiner · on July 28, 2013

The P6 has only two read ports in its permanent register file for operand values: http://www.cs.tau.ac.il/~afek/p6tx050111.pdf (p. 36). P-M upped it to three, and Sandy Bridge removed the limitation completely.

Intel's optimization manual describes the stall: http://www.intel.com/content/dam/doc/manual/64-ia-32-archite... (3.5.2.1, "ROB Read Port Stalls.").

The optimization manual mentions examples of the stall occurring when e.g. often-used constants are stored in registers, or when a load is hoisted "too high" and the value "goes cold" before its consumers use it.

Agner Fog's manual has a discussion starting on p. 69, 84 of his manual: http://www.agner.org/optimize/microarchitecture.pdf. Note his use of an unnecessary MOV to "refresh" a register to avoid the stall.

I only glanced at the code quickly, but the comment about how he got rid of a load by holding a value in a register made me think the load was keeping the value from "going cold." Of course, I didn't profile it so I'm probably completely wrong...

microarchitect · on July 28, 2013

Thanks for that that those very interesting links. But I still don't think that is what is going on here.

In designs which rename using the ROB, the register file holds values produced by instructions which are completed and retired, the ROB holds values from instructions that are completed but not retired, and the bypass network supplies values from instructions currently completing.

What Agner is doing in his example with the seemingly useless instruction is transferring a value from the the register file to the ROB so that instructions which try to read logical register ECX will now source it from the ROB instead of the register file. But when I look at the code in the stack overflow question, nothing actually reads from s1. So these are even "more useless" instructions than Agner's example.

Some people have already mentioned instruction alignment issues, so that is one likely explanation. There are a whole bunch of other possible issues involving the scheduler and dispatch restrictions. For example, I've seen processors where there were two pipelines with slightly different instruction schedulers. So adding a useless instruction like this might push your bottleneck instruction into a pipe with a scheduler that is slightly better for your code. Sometimes bypassing across different pipes is more expensive than within the same pipe, so again the useless instruction might push some instructions into pipes that have more of their sources. It could one of any number of reasons and it's going to be very hard to tell from the outside without knowing the details of the microarchitecture.

rayiner · on July 28, 2013

For some reason I thought I read in the original question that he'd replaced the MOV with an equivalent string of NOOPs, but now that I read the example again I clearly just made that up in my head... In that case, I agree that it's probably an instruction alignment issue, specifically the MOV pushing some group of instructions to align better into the 16-byte fetch/decode window. It'd be interesting if someone can run the code on Sandy Bridge+ and see if the useless MOV still helps. The decoded u-op cache should take a lot of the instruction alignment issues off the table.

Filligree · on July 27, 2013

Sorry, what's a bypass network in this context?

rayiner · on July 27, 2013

The outputs and inputs of execution units are connected by busses that allow results to be communicated directly to dependent instructions without being written to the register file.

Filligree · on July 27, 2013

I see.. well, it makes sense to build something like that.

I still don't understand why a pointless mov instruction is superior to a nop, though, or why a stall differs from a nop (assuming it does).

rayiner · on July 28, 2013

I don't remember the exact way it works on Core 2, but generally a NOP is thrown out very early in the pipeline. So it won't usually have an effect on the execution order of other instructions. But a MOV has source operands (address), and produces an output value. It'll sit in the issue queues until its inputs are available, and in doing so it might cause some other instructions to execute in a different order. If that pushes two other dependent instructions closer together, that might avoid a stall.

As I said, I'm just guessing. But Core 2 has several architectural quirks that can make it sensitive to things like this. Most are things where a useless instruction would not affect things (each reading a 32 bit register after writing the lower 16 bits of it), but this one is such that a useless instruction could trigger or avoid it.

Could also be something more mundane, like the load kicking off the prefetcher for a value that's later needed. As I said, I'm throwing out ideas. Someone needs to test.

jleader · on July 27, 2013

The last time I got my hands dirty at this level was almost 20 years ago, optimizing a software renderer's texture mapping code for the original Pentium, so I don't know much about the details these days, but it sounds like the MOV would touch the data item, which might keep it from getting aged out of the cache-like thing in question (ETA: "bypass network"), so it would still be available when it was actually needed a few instructions later.

A stall can last several clock cycles, potentially quite a bit longer than a NOP.

In the code I worked on, rearranging the order of MOV instructions made the inner loop execute twice as fast, just by avoiding stalls. I think at one point I added a NOP to avoid a stall, but I don't remember if it survived into the final version of the code.

raverbashing · on July 28, 2013

Could it have something to do with stalling while register renaming?

r8d/r9d are used "a lot" so it may have something to do with dependencies between steps, especially from end of the loop to the beginning (if I understood correctly)

conductor · on July 27, 2013

Sorry for the off topic discussion, but I would like to give some attention to FreePascal. In my opinion, it is a great piece of software, and is so underrated. If you find C/C++ too error-prone or hard to learn, Object Pascal is a very good alternative. FreePascal is a multi-platform, Delphi compatibe Object Pascal compiler, can generates pretty optimized native code for multiple architectures (including ARM) and has plenty of libraries. If you already hadn't, please give Lazarus[1] a try, it's a nice RAD IDE (very similar to Borland Delphi 7) shipped with the FreePascal compiler.

[1] - http://lazarus.freepascal.org

claystu · on July 28, 2013

I love pascal, but once you move beyond the turbo pascal features of Free Pascal and start trying to use the Lazarus IDE and the Delphi look-alike features the going gets really rough. The documentation is terrible. There's a lot there, but it's all fragmented and half baked.

They need to freeze development and go look at the documentation for python, perl, racket or a whole host of other languages.

sliverstorm · on July 28, 2013

Does this have anything whatsoever to do with the posted topic? There must be some connection, and I'm twisting my brain trying to figure out what it is.

nkurz · on July 28, 2013

The source file in question is in Pascal, and the author suggested downloading Free Pascal to test it.

https://github.com/tangentstorm/coinops/blob/junkops/sha256_...

sliverstorm · on July 28, 2013

Oh, whoops. I saw the two ".py" extensions and figured it was a python project.

codezero · on July 28, 2013

Sort of, from the Github: "coinops is a repo for experimenting with optimization of freepascal code via x86-64 assembly language, using an iterated sha256 program (the hashing algorithm used in bitcoin) as a subject.

This junkops branch contains an experiment that shows how adding useless junk operations (moving the contents of an arbitrary register into ram) can increase the speed of the executed code."

dom96 · on July 28, 2013

If I may also use this opportunity to advertise a Pascal-like language, as it is also severely underrated. The language's name is Nimrod [1], it compiles to C so it is also very portable, uses whitespace to delimit blocks, is statically typed with generics and has great support for metaprogramming. I am developing an IDE for it called Aporia [2], it is written in Nimrod, it definitely doesn't have as many features as Lazarus but it works well as a simple editor. Other nimrod projects can be found in the nimrod-code organisation on github[3].

[1] - http://nimrod-code.org

[2] - https://github.com/nimrod-code/aporia

[3] - https://github.com/nimrod-code/

silisili · on July 28, 2013

Sorry to be so frank, but why are all of your posts advertisements for Nimrod in non-nimrod threads? I hope it's just infatuation or excitement(or both), rather than some other more nefarious underlying reason. The language looks like it has possible merit, but the message comes off as spammy.

dom96 · on July 28, 2013

I must apologise. I didn't want my advertisements to come off as spammy. I am genuinely excited about the language and I dislike that it gets so little attention. A lot of people that use it say things like "Why is this language so underrated?", "You should be advertising it more!" etc. Which is what I am trying to do. Perhaps I shouldn't be doing it this way, but other people advertise language's like Rust and Go this way too.

silisili · on July 28, 2013

No worries. I do have some suggestions though, from my own perspective as an always learning developer. I took interest in go because of the website, the tour, and the hundred million blog posts that explain why go is good. This may be code beauty, a neat trick, or even benchmarks. But never would I try a language based only on a comment "you should check out go.". Why? Because people for the most part who do that are rather ignorant fanboys...the kind who try to bang nails with a screwdriver if you catch my drift. My time, to me, is valuable...I do a lot of skimming and am easily disinterested. Keep in mind many developers are like this. What I need to become interested is a concise, well laid out page telling me why Nimrod is better than what I have. I don't want to pour through docs or src, I want you to show me. Show me some snippets, some comparisons, some benchmarks, etc. And maybe not even at the same time. Plant the seed in our ears and keep us interested. I didn't try out go, or rust, or even python after the first mention, but after a progression of reads that told me it was right for me. Submit them here, to reddit, to your blog, etc. Also, don't narrow your focus to one language. It's not a competition...see what you like about other languages, what you don't, what works well, etc. With this, you learn to pick the right tool for the job. Further, you can borrow the good parts for nimrod :).

dom96 · on July 29, 2013

Thanks for the awesome advice! Indeed, I have been planning to do that for a long time now, but writing a good blog post takes time and in most cases it is easier to write comments on threads such as these.

X4 · on July 28, 2013

Does it require a runtime? I mean can I develop an OS with it, without shipping the runtime? What language do you know, where that is possible?

dom96 · on July 28, 2013

You can develop an OS with it: https://github.com/dom96/nimkernel

X4 · on July 28, 2013

hmm, but that seems to depend on the C runtime. Nevertheless, it's on my bookmarks and I'll play around with it.

rogerbinns · on July 28, 2013

I highly recommend the Stanford EE380 video "Things CPU Architects Need To Think About". Even though it is from 2004, much of the material is still relevant. Bob Colwell notes that they had similar unexpected throughput slowdowns when implementing the P6 and Netburst processor cores, and discovered that adding random delays cleared them out. The cause for throughput hiccups could be traced back, sometimes hundreds or thousands of instructions, but were extremely difficult to address. They also found that manufacturing assembly lines had similar problems that were also addresses by adding delay.

http://www.stanford.edu/class/ee380/ay0304.html (18 Feb 2004, site appears to be having issues at the moment)

acqq · on July 28, 2013

It's such a good talk. There's even the point where he almost says "we're sorry" regarding Pentium 4, at the time where a lot of people still believed Intel's marketing that it's "the best." I understood him, as the application I worked on at the time was significantly faster on the Pentium M than on the Pentium 4.

rogerbinns · on July 29, 2013

I also love the bit about the floating point divide non-optimisation, and how it turned into a success.

zwegner · on July 28, 2013

Thanks for the link! Looks interesting.

Here's an alternative link, that actually works: http://lang.stanford.edu/courses/ee380/2003-2004/040218-ee38...

incision · on July 27, 2013

Interesting, relevant paper mentioned in response to the linked SO question:

MAO - an Extensible Micro-Architectural Optimizer - http://static.googleusercontent.com/external_content/untrust...

kilowatt · on July 27, 2013

There are figures on page 6 that answer the OP's question directly. And there's also this money quote: "Building an accurate model for modern processors is virtually impossible."

rayiner · on July 28, 2013

A good way to test that would be to insert NOP's of the right length in place of the MOV to see if its an alignment or predictor line sharing issue.

PuercoPop · on July 28, 2013

Very interesting indeed. In particular: "we ﬁnd that most performance results stay ﬂat, in particular for the Nop Killer pass (NOPKILL), which leads to the conclusion that most alignment directives are not helping"

pbsd · on July 28, 2013

The code keeps using high registers for no discernible reason, instead of sticking with EAX--EDI; this means that each instruction is 1 byte longer than it could be.

This has consequences to the instruction fetching and decoding circuitry, where (in the Core 2) you can only read 16 instruction bytes per cycle (or 6 instructions). It is possible that the extraneous MOV instructions are just resulting in a better instruction alignment.

nkurz · on July 28, 2013

The code keeps using high registers for no discernible reason

The function[1] inside of which the assembly is located declares a lot of variables as well. I don't know how well Free Pascal does register allocation, but perhaps this avoid clobbering the registers it prefers. Alignment seems like a likely candidate, but isn't likely to explain why the exact ordering of the extra ops makes a difference.

Would help to see the assembly for the whole function.

[1] https://github.com/tangentstorm/coinops/blob/junkops/sha256_...

mistercow · on July 28, 2013

I'm not sure that I'm buying that a 0.7% time difference is anything other than noise after only 25 runs.

raymondh · on July 28, 2013

Adding a MOV shifts subsequent instructions and can relieve deleterious aliasing in the branch prediction buffer.

Here's a full explanation (with references and links to the diagnostic tools): http://stackoverflow.com/a/17906589/1001643

DannyBee · on July 28, 2013

The answer can be one of a billion reasons: dependency chain breaking, register renaming deficiencies, alias breaking, etc.

First step needs to be the hardware performance counters. Usually it will tell you something interesting, but leave you scratching your head.

Then you find a friend at intel, who laughs, and explains some wildly esoteric detail of the core2 processor that causes this.

This is why scheduling and register allocation models in compilers don't bother with ILP based modeling for x86 processors. Even if you modeled the externally published architecture perfectly, including port restrictions, decode lengths, branch stalls, cycle counts, register requirements, etc, the processor just does whatever it wants internally anyway, because they don't really give you all the details.

rwallace · on July 28, 2013

In my experience the timing jitter on a general-purpose PC - the typical difference between runs of the same code on the same data - is in the ballpark of a couple of percent (and no, back-to-back runs aren't fully independent of each other so you can't average it out over a couple of dozen runs).

The claimed timing difference was less than one percent so while I certainly can't say it wasn't real, I'm not seeing any evidence that it was.

Havoc · on July 28, 2013

From the stackoverflow page:

>randomly inserting nop instructions in programs can easily increase performance by 5% or more, and no, compilers cannot easily exploit this.

Could someone please help me understand why this isn't exploitable? Surely one could just tack a module onto the end of the compiler that bruteforces this? Or something like a genetic algo to trial & error it.

ryen · on July 28, 2013

Shouldn't something like this be measured using #CPU ops per CPU-time instead of just 'seconds'?