

Replacing 32-bit loop variable with 64-bit introduces performance deviations - thisisnotmyname
http://stackoverflow.com/questions/25078285/replacing-32bit-loop-count-variable-with-64bit-introduces-crazy-performance-devi

======
mgraczyk
To elaborate on the justification for the answer:

    
    
        So Intel probably shoved popcnt into the same category to keep the processor design simple
    

In the processor design I work on, we do register dependency checks by
partitioning all instructions into a set of "timing classes" and checking the
dispatch delay needed between dependent register producers and consumers
across all possible timing class pairs. The delays vary depending on available
forwarding networks, resource conflicts, etc. Often times we groups
instructions into sub optimal timing classes to simplify other parts of the
design or just to make the dispatch logic simpler.

Intel's x86 core is waaaaay more complicated than the core I work on and has
far more instructions, so I it's probably safe to say that they make these
suboptimal classifications often. I strongly suspect that the false dependency
was intentional and not a "hardware bug" as some of the StackOverflow comments
seem to suggest.

~~~
userbinator
I wouldn't classify it as intentional nor a "bug"; probably it's more of an
oversight, as it's mentioned in the article that AMD's CPUs don't have this
issue. Intel should definitely be made aware of this.

 _We can only speculate, but it 's likely that Intel has the same handling for
a lot two-operand instructions. Common instructions like add, sub take two
operands both of which are inputs. So Intel probably shoved popcnt into the
same category to keep the processor design simple._

On the other hand, MOV doesn't read both operands either.

~~~
caf
Reg-Reg MOV doesn't use an ALU, though.

It would be interesting to see if the Intel C Compiler knows about this false
dependency.

~~~
nkurz
'icpc' (the Intel C++ compiler) has equal performance for both of the test
cases, and it did choose to use different registers for each call. But it's
not clear if that's by design or by chance. In some ways, that's the boring
part. The interesting part (to me) is that both tests are much faster than
either version with g++.

Here's icpc 14.0.3 vs g++ 4.8.1 on a Sandy Bridge E5-1620 @ 3.60GHz and a
Haswell i7-4770 CPU @ 3.40GHz.

    
    
      nate@sandybridge:~/tmp$ g++ -O3 -march=native -std=c++11 popcnt-dependency.cpp -o popcnt-dependency
      nate@sandybridge:~/tmp$ popcnt-dependency 1
      unsigned	41959360000	0.608615 sec 	17.2289 GB/s
      uint64_t	41959360000	0.82312 sec 	12.739 GB/s
      nate@sandybridge:~/tmp$ icpc -O3 -march=native -std=c++11 popcnt-dependency.cpp -o popcnt-dependency
      nate@sandybridge:~/tmp$ popcnt-dependency 1
      unsigned	41959360000	0.182781 sec 	57.3679 GB/s
      uint64_t	41959360000	0.182638 sec 	57.4128 GB/s
    
      nate@haswell:~/tmp$ g++ -O3 -march=native -std=c++11 popcnt-dependency.cpp -o popcnt-dependency
      nate@haswell:~/tmp$ popcnt-dependency 1
      unsigned	41959360000	0.401225 sec 	26.1343 GB/s
      uint64_t	41959360000	0.75841 sec 	13.826 GB/s
      nate@haswell:~/tmp$ icpc -O3 -march=native -std=c++11  popcnt-dependency.cpp -o popcnt-dependency
      nate@haswell:~/tmp$ popcnt-dependency 1
      unsigned	41959360000	0.0843861 sec 	124.259 GB/s
      uint64_t	41959360000	0.0842836 sec 	124.41 GB/s
    

That would be incredible if true! But I think it's a bug, since the inner loop
looks far too short and doesn't seem to be repeating the popcnt's. I'm not
sure yet if it's a problem with the compiler or if the test case is abusing
something undefined.

~~~
nkurz
OK, it looks like 'icpc' has decided that it would be fastest to invert the
two loops: popcnt() once, then repeat the addition 10000 times. I'm neither a
language lawyer nor a friend of C++, so I'll refrain to trying to decide
whether this is a legal optimization. But a liberal sprinkling of 'volatile'
makes it do what was obviously intended. After this, the speeds are more
comparable, although 'icpc' retains a small (but much more plausible) lead:

    
    
      nate@sandybridge:~/tmp$ popcnt-dependency 1
      unsigned	41959360000	0.517827 sec 	20.2495 GB/s
      uint64_t	41959360000	0.518041 sec 	20.2412 GB/s
    
    
      nate@haswell:~/tmp$ popcnt-dependency 1
      unsigned	41959360000	0.351273 sec 	29.8507 GB/s
      uint64_t	41959360000	0.352914 sec 	29.712 GB/s
    

The other test I did was checking what Intel's IACA (a wonderful optimization
tool that you really should be using if you are not already) thought about the
g++ loop. It did _not_ notice the false dependency, and said the loops should
take the same amount of time. Do this suggest that the Intel compiler is just
getting lucky, or that Intel doesn't have great internal communication between
teams?

~~~
floody-berry
That's the problem with microbenchmarks, ensuring they're measuring what you
think they're measuring.

------
tofof
TLDR: Headline (and indeed bulk of article) is phantom symptom. True cause is
register allocator behavior.

Specifically, allocator's handling of an instruction with a false dependency
on register that's written to, coupled with multiple compilers being unaware
of the false dependency.

~~~
PythonicAlpha
Maybe one should add, that (as much I understood) it is a problem of the
processor handling one specific (and rare) instruction. It does assume
register dependencies that do not exist. It was shown, that AMD does not have
this behavior. And it shows, that today's processors are enormous complex
beasts.

The problem with the compilers was, that they where not aware of this behavior
and thus generated sub-optimal code for this situation ... but compiler
builders are also mere humans.

~~~
shmerl
Did they file a bug for gcc and clang?

------
jbondeson
This is why micro-benchmarking is Russian roulette.

When you distill a loop until you're finding the exact bottleneck in the
system (pipelining, branch prediction, etc) you need to be very very careful
you're measuring what you think you are. Otherwise you'll end up in this
situation where you're benchmarking a compiler...

------
byuu
I suppose similarly related to this, when I was keeping track of
synchronization between two cooperative simulation threads running at
different frequencies, I had a 64-bit signed integer: chip A would add
chip_B_frequency * chip_A_cycles_executed; and chip B would subtract
chip_A_frequency * chip_B_cycles_executed. If the value was >=0, chip A was
ahead and would switch to B; and if the value was <0, chip B was ahead and
would switch to A.

I ended up getting a noticeable speed boost just by using sync +=
(uint32_t)clocks * (uint64_t)frequency; ... just a simple 32-bit x 64-bit
multiply was quite a bit faster than a 64-bit x 64-bit multiply. (One had to
be 64-bit to prevent the multiplication from overflowing, as one value was in
the MHz range and the other could be up to ~2000 or so.)

I've observed this on both AMD and Intel amd64 CPUs. Not sure how that'd hold
up on other CPUs. As always though, profile your code first, and only consider
these types of tricks in hot code areas.

------
userbinator
It should be noted that using 64-bit operands, even in 64-bit mode, incurs an
extra penalty of 1 byte per instruction, for the REX prefix. The same applies
to using the extended registers (the uncreatively-named "r8" through "r15".)
This is very much not noticeable for microbenchmarks, where all the code of a
loop fits in the cache, but for bigger ones, the effects of icache misses can
become quite significant. A smaller instruction sequence that is slower than a
larger one when microbenchmarked can become much faster once that code is
benchmarked as part of a whole system.

~~~
nitrogen
_(the uncreatively-named "r8" through "r15".)_

I'd much rather have numbered registers that can be used for anything than
named registers that have usage limitations.

~~~
colanderman
I suspect that aside was written tongue-in-cheek.

~~~
nitrogen
I'm sure, but in case anyone at a CPU design company gets any bright ideas and
decides to start naming everything... ;-)

~~~
poizan42
_cough_ MIPS _cough_

------
frozenport
Hoe can you fix this in VS where there is no way to finely target a.CPU?

