
Software optimization resources for C++, assembly, Windows, Linux, BSD, Mac OS - dgellow
https://www.agner.org/optimize/
======
H8crilA
From my experience the greatest optimization gains can be achieved by
introducing more domain knowledge into the code. Heuristics that cut the
number of executions of some expensive function, shortcuts that introduce good
enough approximations, pre-computing and memoizing certain values that are
actually constant in the processed data. Things like that.

And in programs that had never been optimized it's usually very easy to fire
up the profiler and find some obviously terrible things. Like reinitializing
in a tight loop some expensive object that can easily be a singleton (or at
least a thread local object). I've once spent a month on a decent piece of
code and brought down the CPU cost by literally 100x (nobody had optimized the
code before).

Also, always, first create a reliable framework for measuring progress (cpu
instruction retired counters in Intel CPUs are very good and don't need a lot
of execution repetitions to bring the p-values low), and only then start
optimizing.

Low-level, hardcore optimization rarely pays out (not never, but rarely).

~~~
ncmncm
My experience differs.

Domain-specific optimizations do often yield order-of-magnitude improvements,
but once those opportunities have been exploited, there is very often another
factor of two or ten remaining.

For example, a three-line change to quicksort, not changing the algorithm at
all, more than doubles its speed.

The key insight not taught is that the familiar order notation -- O(N), O(N^2)
-- does not dictate what N must count. Traditionally, it counts comparisons,
swaps, or other algorithmic events. Nowadays, these operations are practically
free, and what matters are cache misses and pipeline stalls.

Because each miss is so devastating, it is essential to keep plenty of work in
the pipeline. Interfaces should provide batched operations, because often it
takes no more time to, say, produce four hashes than one.

Measurement is necessary, but since it is so hard to know whether a thing is
fast, reading out the hardware performance counters is essential. How many
pipeline stalls, how many cache misses, how many branch mispredictions?

Transforming "if (c) ++i;" to "i += c;" can make a huge difference, in an
inner loop, when c is unpredictable. Similarly, "if (c) a -= b;" may become "a
-= (-c & b);". Compilers usually will not make these transformations because
they guess that c is predictable. When it isn't, you get a pipeline stall
every other iteration, with 10-15 cycles down the drain each time, enough for
~40-60 useful instructions.

~~~
a1369209993
> [...] what N must count. Traditionally, it counts comparisons, swaps, or
> other algorithmic events.

Er, _N_ is counting the _input_ elements (eg number of nodes in a graph, array
length, etc), _O()_ is what counts algorithmic or microarchitectural events.

> Similarly, "if (c) a -= b;" may become "a -= (-c & b);".

Reminds me of speeding up a ECC loop by almost two orders of magnitude with
(basically):

    
    
      - if(i&(1<<k)) p ^= parity(x[i])<<k;
      + p32[k] ^= x[i] & -(ii&1); ii>>=1;

~~~
einpoklum
GP mis-spoke, s/he meant "O(N) of what?" rather than "N of what?"

------
dang
Small threads from 2016 and 2015:

[https://news.ycombinator.com/item?id=13258254](https://news.ycombinator.com/item?id=13258254)

[https://news.ycombinator.com/item?id=10187043](https://news.ycombinator.com/item?id=10187043)

[https://news.ycombinator.com/item?id=8874206](https://news.ycombinator.com/item?id=8874206)

------
ncmncm
Agner is an absolute treasure. I hang on his every word.

One essential reference not cited is [https://danluu.com/branch-
prediction/](https://danluu.com/branch-prediction/) .

~~~
saagarjha
A treasure he is, but he writes a lot…I can never seem to be able to use his
stuff except as reference when I need to know the IPC of something.

------
renox
I tried recently to optimise a loop looking for a value in a small array of
short integers (8 elements), not so easy, there are so many choices: tight
loop (a priori not an optimal choice but use only one instruction cache line),
unrolled loop, using cmovs, SIMD?

------
loeg
The code is mostly or entirely GPL, if that is something you might be
interested in.

~~~
SloopJon
The code that looks most interesting is the vector class library, which is
licensed under the Apache License 2.0.

~~~
loeg
The string.h libc functions ("Subroutine library") and Object file converter
also look interesting to me. Those are GPL.

