I wonder what programmers at the time would think of CPUs eventually being so good at handling loops that unrolling changes from an optimisation technique to be performed manually, to one that compilers might sometimes do, to an anti-optimisation:
by eliminating all instances of Duff's Device from the XFree86 4.0 server, the server shrunk in size by _half_ _a_ _megabyte_ (!!!), and was faster to boot
It seems from his comments that the idea of a compiler being able to do it better in some circumstances wasn't completely outlandish.
> Don't try to be smarter than an over-clever C compiler that recognizes loops that implement block move or block clear and compiles them into machine idioms.
Some of that can be attributed to the fact that there are multiple integer and floating point units, so the loop counter is often totally inconsequential.
"A second poster tried the test, but botched the implementation, proving only that with diligence it is possible to make anything run slowly."
This made me laugh. "with diligence it is possible to make anything run slowly" should be one of the truths carved into an obelisk outside every CompSci department or at least on a t-shirt.
This is a very useful technique for making event driven parsers. I remember being surprised that the case statements didn't all have to be in the same scoping level, but it's true.
http://www.agner.org/optimize/blog/read.php?i=142
It is so important to economize the use of the micro-op cache that I would give the advice never to unroll loops.
http://lkml.iu.edu/hypermail/linux/kernel/0008.2/0171.html
by eliminating all instances of Duff's Device from the XFree86 4.0 server, the server shrunk in size by _half_ _a_ _megabyte_ (!!!), and was faster to boot