

Software performance is counterintuitive - platz
http://lemire.me/blog/archives/2014/06/06/software-performance-is-counterintuitive/

======
robert_tweed
I don't see how this would surprise anyone that has given even the slightest
thought to speed optimisation, never mind an "expert". Putting one cheap
operation into a loop is dominated by the loop overhead, not the cheap
operation, so of course two cheap operations with one loop is more efficient
than two functions each having their own loop overhead (assuming the loops
can't be unrolled of course).

That is never the problem. The problem is that doing this creates tight
coupling between the two operations, making it marginally (or sometimes
significantly) more expensive when you only need one of them. The coupling
also makes your code harder to understand (in all but trivial cases). When the
operations are naturally inseparable, this approach makes sense. It also makes
sense if you are trying to optimise out a specific problem after profiling.

As a general design, it's usually an anti-pattern and one of the reasons
premature optimisation gets such a bad rap with Knuth et al.

Also this case is pretty trivial. There are many cases when trying to couple
together multiple operation results in an explosion of edge-cases with
unexpected side effects that end up making performance worse, unless you are
super careful. This is when "just pick a better algorithm" is usually
applicable, except if you tightly couple two algorithms together, it's much
harder to swap one out.

~~~
gizmo686
With a sufficiently smart compiler, it is possible to get the best of both
worlds. If the compiler can prove that the two loops have the same number of
iterations, and that the two functions don't interfere with each other, then
it can optimize them into one loop. I believe this is one of the tricks
Haskell uses.

~~~
kevinnk
That's called loop fusion and most optimizing compilers have it.

[http://en.wikipedia.org/wiki/Loop_fusion](http://en.wikipedia.org/wiki/Loop_fusion)

------
kyberias
Just because one particular superscalar processor can calculate the
instructions in inner loop of the author's example in parallel doesn't make
the general case incorrect: adding instructions typically makes the program
slower. I think the author is not fair here.

~~~
Arnor
Agreed. This post doesn't show that the mental model is naive or incorrect. It
simply shows that the _expense_ of an operation is not always _time_. We
generally optimize for _time_ and consequently miss the subtlety of _expense_.

Just because the additional instruction did not take more time (or cycles),
doesn't mean it was _free_. Say there are several instructions any of which
can be subbed in for `__builtin_popcountl` and still fit into the same cycle.
Several used together may add up and cost a cycle on each round of the loop.

So although the added instruction doesn't cost any additional time. It isn't
free.

------
jandrewrogers
There are many different ways in which adding instructions improves absolute
performance. Execution port saturation (the example in the article) is one.
Another one is where additional computation in-register is faster than simply
loading a register with a new value (i.e. computation is faster than access).
There is also instruction fusion, where a pair of assembly operations commonly
used together are fused into a one, single-cycle operation inside the CPU.
Optimization on modern micro-architectures has a lot of complex and sometimes
counterintuitive rules but you can often squeeze substantial performance gains
out of them if you know what you are doing.

One caveat to be aware of in the article's example: hyper-threading works by
letting a second thread run on the CPU's execution ports that are unused by
the first thread. Consequently, if you design your code to do an excellent job
of saturating execution ports, it will often mean that hyper-threading is
_slower_ than using a single thread per core (all of the additional cache
pressure, few free execution ports that allow the code to run). However, the
speed improvements you can see on a single thread with excellent execution
port saturation will often be higher than the speed improvements you will see
from hyper-threading code that does not saturate the execution ports due to
the aforementioned cache overhead of running a second thread concurrently.

------
etep
So this would be limited by main memory bandwidth, if implemented correctly,
but I doubt it even is limited by that right now (i.e. because it is
apparently single threaded, and one core typically won't be able to saturate
main men bandwidth). The author should measure this and find the active limit
(instruction retire or mem bw); perhaps then he will find insight.

------
modulus1
There's a ton of variables feeding into the performance of your software. I've
spent a lot of time trying to improve cache coherence, branch mispredictions,
etc... with very little to show for it.

Asymptotic complexity is the best place to start. I thank academia for this
'naive' model.

