This is Core 2, so it still has the P6's micro architectural limitation that it ...

microarchitect · on July 28, 2013

Sorry, this explanation is almost surely incorrect. How long something is available on the bypass network is determined by how long the instruction that produces the value takes to "exit" the pipeline. I can't imagine any scenario where a consumer instruction causes a producer instruction (i.e., an instruction "ahead" of it) to stall. Note this would be a dangerous design point because of the risk of deadlocks.

What's the source for your claim that the Core uarch's register file is underdesigned in comparison to the dispatch width? I'd be extremely surprised if this were the case. Last time I looked at the data, about 50-70% of the reads go to the register file not the bypass network.

rayiner · on July 28, 2013

The P6 has only two read ports in its permanent register file for operand values: http://www.cs.tau.ac.il/~afek/p6tx050111.pdf (p. 36). P-M upped it to three, and Sandy Bridge removed the limitation completely.

Intel's optimization manual describes the stall: http://www.intel.com/content/dam/doc/manual/64-ia-32-archite... (3.5.2.1, "ROB Read Port Stalls.").

The optimization manual mentions examples of the stall occurring when e.g. often-used constants are stored in registers, or when a load is hoisted "too high" and the value "goes cold" before its consumers use it.

Agner Fog's manual has a discussion starting on p. 69, 84 of his manual: http://www.agner.org/optimize/microarchitecture.pdf. Note his use of an unnecessary MOV to "refresh" a register to avoid the stall.

I only glanced at the code quickly, but the comment about how he got rid of a load by holding a value in a register made me think the load was keeping the value from "going cold." Of course, I didn't profile it so I'm probably completely wrong...

microarchitect · on July 28, 2013

Thanks for that that those very interesting links. But I still don't think that is what is going on here.

In designs which rename using the ROB, the register file holds values produced by instructions which are completed and retired, the ROB holds values from instructions that are completed but not retired, and the bypass network supplies values from instructions currently completing.

What Agner is doing in his example with the seemingly useless instruction is transferring a value from the the register file to the ROB so that instructions which try to read logical register ECX will now source it from the ROB instead of the register file. But when I look at the code in the stack overflow question, nothing actually reads from s1. So these are even "more useless" instructions than Agner's example.

Some people have already mentioned instruction alignment issues, so that is one likely explanation. There are a whole bunch of other possible issues involving the scheduler and dispatch restrictions. For example, I've seen processors where there were two pipelines with slightly different instruction schedulers. So adding a useless instruction like this might push your bottleneck instruction into a pipe with a scheduler that is slightly better for your code. Sometimes bypassing across different pipes is more expensive than within the same pipe, so again the useless instruction might push some instructions into pipes that have more of their sources. It could one of any number of reasons and it's going to be very hard to tell from the outside without knowing the details of the microarchitecture.

rayiner · on July 28, 2013

For some reason I thought I read in the original question that he'd replaced the MOV with an equivalent string of NOOPs, but now that I read the example again I clearly just made that up in my head... In that case, I agree that it's probably an instruction alignment issue, specifically the MOV pushing some group of instructions to align better into the 16-byte fetch/decode window. It'd be interesting if someone can run the code on Sandy Bridge+ and see if the useless MOV still helps. The decoded u-op cache should take a lot of the instruction alignment issues off the table.

Filligree · on July 27, 2013

Sorry, what's a bypass network in this context?

rayiner · on July 27, 2013

The outputs and inputs of execution units are connected by busses that allow results to be communicated directly to dependent instructions without being written to the register file.

Filligree · on July 27, 2013

I see.. well, it makes sense to build something like that.

I still don't understand why a pointless mov instruction is superior to a nop, though, or why a stall differs from a nop (assuming it does).

rayiner · on July 28, 2013

I don't remember the exact way it works on Core 2, but generally a NOP is thrown out very early in the pipeline. So it won't usually have an effect on the execution order of other instructions. But a MOV has source operands (address), and produces an output value. It'll sit in the issue queues until its inputs are available, and in doing so it might cause some other instructions to execute in a different order. If that pushes two other dependent instructions closer together, that might avoid a stall.

As I said, I'm just guessing. But Core 2 has several architectural quirks that can make it sensitive to things like this. Most are things where a useless instruction would not affect things (each reading a 32 bit register after writing the lower 16 bits of it), but this one is such that a useless instruction could trigger or avoid it.

Could also be something more mundane, like the load kicking off the prefetcher for a value that's later needed. As I said, I'm throwing out ideas. Someone needs to test.

jleader · on July 27, 2013

The last time I got my hands dirty at this level was almost 20 years ago, optimizing a software renderer's texture mapping code for the original Pentium, so I don't know much about the details these days, but it sounds like the MOV would touch the data item, which might keep it from getting aged out of the cache-like thing in question (ETA: "bypass network"), so it would still be available when it was actually needed a few instructions later.

A stall can last several clock cycles, potentially quite a bit longer than a NOP.

In the code I worked on, rearranging the order of MOV instructions made the inner loop execute twice as fast, just by avoiding stalls. I think at one point I added a NOP to avoid a stall, but I don't remember if it survived into the final version of the code.

raverbashing · on July 28, 2013

Could it have something to do with stalling while register renaming?

r8d/r9d are used "a lot" so it may have something to do with dependencies between steps, especially from end of the loop to the beginning (if I understood correctly)