

Novel Optimization Technique - latitude
http://swapped.tumblr.com/post/39656947625/novel-optimization-technique

======
mmastrac
Seeing the actual x86 or x64 code here would be insightful. It's hard to guess
(though not impossible, I suppose) why this might be happening given just the
code, but seeing the emitted assembly would clearly show what improves. It's
possible that adding that assignment changes the register allocation/stack
spill just enough to be slightly more optimal.

Is there any code before or after that has been omitted that might be
contributing to the 5% speedup?

~~~
latitude
No, no changes before or after or anywhere else. Just that one line.

I have written my share of hand-optimized assembly, but this is just an
optimizer peculiarity. I am not after trying to understand why it happened,
and I was making a joke calling it a "technique". That said, here's an
assembly dump of both versions, just in case :) -

Slower version - <https://gist.github.com/4454234>

Faster version - <https://gist.github.com/4454238>

------
latitude
I guess everyone has ran into something like this at some point in time. Care
to share? What's the most bizarre optimization quirks you've seen?

------
sb
Hm, there are already some comments about possible optimizations kicking in
and asking for assembly code (which is _always_ a good idea to look for
changes). However, before doing all of these things, I would want to know how
these results were measured? How many repetitions, which machine
configuration, standard deviation, etc. It could very well be that this
behavior is due to some pre-caching in Windows or some other external factor.

~~~
latitude
This is a part of a custom malloc(). The test involves running

    
    
      p = malloc(); free(p);
    

in a tight loop for 5 seconds, generating about 60,000,000 traversals of the
code in question. The slower version yields a loop count of around 67,... and
the faster version bumps it to over 70,... Not quite 5%, but in thereabouts.

------
CountHackulus
My guess here is that either the Partial Redundancy Elimination or Loop
Invariant Code Motion (or some interaction between the two) in MSVC2010 needed
a helping hand, but I agree with mmastrac, it's hard to know exactly what's
going on without seeing the assembler.

------
jheriko
yeah, it would be nice to see what code is actually generated that makes it
faster...

... also, have you considered pointer arithmetic? your style of for loop is
nothing i would ever write to begin with. perhaps this is why the compiler has
a hard time with it - it is quite unusual - the extra loop counter seems
pointless and can be removed by a single pre-calculation

~~~
latitude
> _... also, have you considered pointer arithmetic?_

Good point. Tried it and makes no difference. Looking at the assembly, it
appears that the difference in performance lies in the prolog code for the
loop rather than in the loop itself.

