
Accelerating intersections with SIMD instructions - jessaustin
http://lemire.me/blog/archives/2015/03/25/accelerating-intersections-with-simd-instructions/
======
hammadtime
For interested readers who want follow up reading: learn about loop unrolling.
Commonly used with SIMD vectorization to get rid of loop overhead. Then we can
also start doing cache based optimization, cache blocking etc

~~~
nkurz
Actually, I'd probably go the other way on that: loop unrolling is much less
useful than it used to be. As long as your loops are predictable I'd suggest
not worrying about them. Correctly predicted branches are very close to free,
and fitting the entire loop in the decoded µop cache is more important. There
are still cases where it helps, but there are also cases where where the
instruction mix works out so that there is zero overhead. Not just low, but
zero. So until you've measured, keep your loops short and simple.

Instead, I'd suggest that if you want to write really high performance code,
you need to be thinking in assembly regardless of the language you are writing
the actual code in. Look at the code that's actually executing, either by
disassembling the compiled object or profiling with something like 'perf' on
Linux.[1] Then, you need to be aware of how the processor handles that
assembly, in terms of latencies and execution ports. Agner Fog's manuals in
combination with those from Intel and AMD are incredibly valuable.

~~~
rdc12
Wouldn't there be cases where loop unrolling, is what enables SIMD
instructions to be used? Is that the exception to your "rule"

