

Loop Unrolling - bpatelcse
https://www.cs.umd.edu/class/fall2001/cmsc411/proj01/proja/loop.html

======
barrkel
I had an amusing time optimizing a byte-scanning loop in Java a few months
back.

The initial loop was a lot slower than a naive byte-by-byte, non-unrolled
version in C. I implemented an approximate version of Duffs device, and did
the unrolling by hand, improving it by a good 50%. But it was still much
slower than C.

On a whim, I went back to a naive loop, but simplified the core. This was
strictly speaking algorithmically more costly, but it hit a JVM optimization
code path specifically tailored for searches through byte arrays. Suddenly my
code was about 40% faster than naive C (still run through gcc -O3)! A peek at
the disassembly showed the JVM had done all the relevant loop unrolling
itself.

A key part, IIRC, was something to do with the loop bounds check; it was
simple enough to be proven to always be less than the array length.

But it did feel a bit like stumbling around looking for a magic incantation.
Tools like manual loop unrolling, while they help, still incur relatively
heavy costs in array bounds checking that you can't easily escape.

If I ever have to write a loop like that in the future in Java, I'll start out
with the simplest loop, measured in a micro-benchmark, and try and iteratively
modify that loop to the final implementation without losing the sweet spot of
JVM optimization. Doing it from this direction is much easier than the
reverse.

~~~
vardump
Can you show your byte scanning code in both Java and C?

------
vardump
Loop unrolling is mostly unnecessary on modern pipelined out-of-order
execution CPUs. It can even slow down execution due to heavier pressure on L1C
cache.

~~~
gnufx
It still seems a good first guess for Fortran code, at least, e.g. worth ~10%
compiling reference BLAS and the double precision linpack benchmark with
gfortran on Sandybridge.

