When numerous instances of Duff's device were removed from the XFree86 Server in version 4.0, there was a notable improvement in performance. Therefore, when considering using this code, it may be worth running a few benchmarks to verify that it actually is the fastest code on the target architecture, at the target optimization level, with the target compiler.
Optimization without testing is worse then pointless. It's premature and we know that's evil.
Great point: not only does optimization cost in understanding and system fragility but it can become stale over time and invisibly so.
If you're working on the application level, testing and segregating optimizations like this seems important. Systems level people would operate differently though ...
"This code forms some sort of argument in that debate, but I'm not sure whether it's for or against."
hack^2 - it's so brilliant and so ugly, even his author doesn't know which is more.
By the by, we've used this "feature" a fair bit in C and C++ for protothreads in small embedded systems: see http://blog.brush.co.nz/2008/07/protothreads/ and http://www.sics.se/~adam/pt/
Also, with the optimizations on modern chipsets it might not be immediately obvious which implementation will actually run faster.
Including the shell still in use today http://doc.cat-v.org/plan_9/4th_edition/papers/rc
best shell ever
And my favourite web browser : Mothra
while you argue over css / tables - try writing for a browser that has neither !