

How Duff's Device Works - pprov
http://tenaciousc.com/?p=753

======
okmjuhb
The history behind this actually goes the other way if I remember correctly;
it was unclear for a time whether or not this sort of thing was allowed. The
standard was modified so that Duff's device would be legal code after it was
being used.

~~~
ssp
Tom Duff himself writes about it here:

<http://doc.cat-v.org/bell_labs/duffs_device>

In particular,

    
    
        The device is legal dpANS C. I cannot quote chapter and verse, but
        Larry Rosler, who was chairman of the language subcommittee (I think),
        has assured me that X3J11 considered it carefully and decided that
        it was legal. Somewhere I have a note from dmr certifying that all the
        compilers that he believes in accept it. Of course, the device is also
        legal C++, since Bjarne uses it in his book.

------
Jach
You can use this to implement coroutines in C when you're stuck without
Python/Scheme/al.

<http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html>

------
yread
Really interesting! But why wouldn't you copy the data in a simple while loop?
What do you need the switch statement for?

~~~
acqq
Kids these days. I guess the OP didn't really understand everything or didn't
bother to explain, especially my favorite bit about "computed goto" and the
fact that if you code your compiler to emit assembly instructions with labels
it's quite "natural" to implement it.

Duff's device is an example of the "loop unwinding." Say you have n divisible
by 8, then you can loop n / 8 times and have eight assignments in the loop
stated explicitly. Now instead executing n assignments and n looping
instructions (there can be more instructions needed for one loop and the cost
of jump can also be visible) you can have less, n / 8 looping instructions and
still n assignments.

When such thing should be done? 1) If you measured your loop and it was too
slow with a one assignment and one loop pass for every n. 2) you know that
your compiler can't do the loop unwinding automatically -- modern compilers
can do such things even if you write a normal loop if the compiler estimates
that this can really help and when you give to it a specific direction to
produce such code (e.g. aggressively optimize for speed).

Only now comes the reason for using Duff's device: how to code the case where
n is not divisible by 8? Well you can have a loop for the remaining number R
of elements, but then you have R assignments and R looping instructions. More
clever thing is to have the sequence of the assignments in the code stream,
one after another and to _jump_ directly inside to execute exactly R
assignments.

Turns out, C was made in seventies, but Fortran was made in fifties and
already had the construct which allowed exactly that, the direct jump. So
authors of C created something smarter -- a switch statement was implemented
so that it compiles to the series of if ... goto or even to the "computed
goto" sequence, not needing to do more comparisons if the values for jump
conditions are one after another.

So now we're able to have one sequence for a direct jump (say 8 assignments in
a switch) and one sequence for unwound loop (loop around 8 assignments). But
why not trying combining them? That was Duff's insight, and the Duff's device
is a result of it. The benefit: instead having 16 assignments in the generated
code, you'll have just 8 which are going to be used both for R assignments and
for unwound loop.

Just that last step (reducing the number of the assignments in the resulting
code from 16 to 8) is something that you probably won't measure as a direct
speed benefit, you'll only have a bit smaller resulting code. So I guess it
was more a curiosity than something that must have been done exactly so even
as it was written. Still it nicely demonstrates how people who know what is
produced by the compiler are able to think, on one side, and how C can be used
for "assembly level" optimizations, on another.

~~~
anamax
> When such thing should be done? 1) If you measured your loop and it was too
> slow with a one assignment and one loop pass for every n. 2) you know that
> your compiler can't do the loop unwinding automatically -- modern compilers
> can do such things even if you write a normal loop if the compiler estimates
> that this can really help and when you give to it a specific direction to
> produce such code (e.g. aggressively optimize for speed).

Note that many compilers generate crappy code for duff's device. The switch
statement, specifically the intermediate destinations, keeps them from
generating good code for the unrolled loop body.

The compiled code can be so bad that a "not unrolled" loop to get to alignment
followed by an unrolled aligned loop is significantly faster.

And, because this stuff is never simple, the fastest version may be range
dependent, as in one version is faster for up to 100 elements while another is
faster for more than 100.

------
trotsky
well worth the read

------
konad
After a spell at Bell-Labs working on Plan9 (his Mothra web browser is my
favourite HTML2 renderer), Duff is now at Pixar.

~~~
jonhendry
"Back at Pixar" might be more accurate. Though I think when he was there the
first time it was still part of Lucasfilm. That's where he invented the
Device. He's been at Pixar since 1996.

