
Tom Duff on Duff's Device (1988) - mmphosis
http://www.lysator.liu.se/c/duffs-device.html
======
cageface
I worked in the same building as Tom Duff when I was at Pixar and once told
him how chuffed I was to be working with the inventor of Duff's Device
himself.

He told me he hoped that in the end he'd be remembered for all the other work
he did and not his "device".

------
abecedarius
> (Actually, I have another revolting way to use switches to implement
> interrupt driven state machines but it's too horrid to go into.)

I guess this means like
<http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html> ?

------
jejones3141
BTW, note that Duff's device leads to an irreducible flow graph. In one
compiler I know of, that led to really bad code (it gave up on putting
variables in registers); in others, I think that the changes to make the flow
graph reducible undoes the jump into the middle of the loop.

------
quarterto
I remember vividly the moment of zen I acheived when I first understood Duff's
device.

~~~
revjx
I look forward to that moment. I can remember trying to look up a simpleton's
explanation back when I first heard of it, but I gave up and figured I'd
understand one day when I got a bit better at programming.

~~~
drv
First imagine how this routine would look if it could assume that _count_ was
a multiple of 8:

    
    
        register n=count/8;
        do{
            *to = *from++;
            *to = *from++;
            *to = *from++;
            *to = *from++;
            *to = *from++;
            *to = *from++;
            *to = *from++;
            *to = *from++;
        }while(--n>0);
    

This is almost the same as Duff's code without the switch statement. It's a
simple unrolled loop that copies 8 words per iteration.

Next, consider how to handle a _count_ that is not a multiple of 8. You could
simply add a loop that copies a single word at a time, but to avoid the
overhead of a loop, you could instead use a _switch_ statement, making use of
C's fallthrough behavior when omitting the _break_ after each _case_ :

    
    
        switch (count%8) {
        case 7: *to = *from++;
        case 6: *to = *from++;
        case 5: *to = *from++;
        case 4: *to = *from++;
        case 3: *to = *from++;
        case 2: *to = *from++;
        case 1: *to = *from++;
        }
    

Now take a look at the actual Duff's device implementation line by line:

    
    
      register n=(count+7)/8;
    

_n_ is the number of iterations of the copy-8-words unrolled loop. The +7
causes the division by 8 to round up in order to count the extra partial loop
iteration required when _count_ is not a multiple of 8.

    
    
      switch(count%8){
    

_count % 8_ is the number of left-over words that need to be handled with a
partial copy loop (rather than full 8-byte copy loops). The switch will jump
to one of the following case labels in order to skip to a part of the unrolled
copy loop such that exactly the number of left-over bytes will be copied.

    
    
      case 0:	do{	*to = *from++;
    

This is the first case of the switch, but also the beginning of the unrolled
copy loop. If _count_ happened to be a multiple of 8, then _count % 8_ would
be 0, and the unrolled copy loop would begin executing as normal, identical to
the simplified code above that only handles multiples of 8. Note that C's
switch statement cases are essentially just labels; execution will fall
through to the next case label unless a _break_ is encountered, so after first
entering the _do{}while_ loop, the _case_ statements are essentially ignored.

    
    
      case 7:		*to = *from++;
    

This line is where the real magic begins. If _count % 8_ was not 0, part of
the first iteration of the unrolled loop will be skipped _by jumping into the
middle of the loop_ using the _case_ label matching the number of extra words
required. Then further iterations of the loop will continue as normal, copying
the full 8 words per iteration.

The following case statements all serve the same purpose: to set up a label
for a partial copy loop of the given number of words during the first
iteration.

    
    
      }while(--n>0);
    

This is the end of the unrolled copy loop. _n_ , as calculated above, is the
number of times to run the unrolled loop; remember that it included an extra
iteration (due to rounding up) to account for the partial first iteration if
_count % 8_ was not 0. When execution reaches this _while_ , it will jump back
up to the _do_ (which happens to be in the middle of a _switch_ , but that
doesn't matter anymore - the _case_ labels are ignored and the unrolled loop
will now execute as normal, copying the full 8 words each iteration.

The magic is in reusing the same unrolled copy loop code to also do the
initial partial copy.

~~~
revjx
Thank you for taking the time to type this out - I'll study it and hopefully
will be able to make sense of it! I appreciate the effort you put in.

Edit: aaaand that makes sense. I see now why he thought it was a bit of a
hack, but it's still very nifty. Thanks again!

------
redthrowaway
I'm curious, is this still useful on modern architecture? I would have thought
branch prediction would largely negate any performance benefit.

~~~
kd0amg
A superscalar machine should still be able to benefit from loop unrolling by
issuing the instructions for multiple iterations in parallel (though this is
more hardware-dependent than I like to think about when coding). The rollback
machinery in branch speculation hardware can only handle a limited number of
branches, so a loop with a very small body running on a machine with a very
deep pipeline could spend a lot of time waiting on confirmations of correct
branch predictions.

~~~
Tuna-Fish
> by issuing the instructions for multiple iterations in parallel

Unless you actually do something in the loop body, all modern superscalar HW
would be bottlenecked on the writes, making multiple iterations impossible.
For example, SNB has 5 main execution ports -- two agus, 3 alus, and it can do
either of 2 reads or 1 read + 1 write in a single cycle. This means that if
you are storing something every cycle, for each cycle you can do 3
instructions without slowing down the loop. Also, if the jump instruction is
immediately preceded by the alu instruction that generates the flags for it,
they get fusioned into a single instruction, meaning you don't need to count
the jmp as an instruction.

> The rollback machinery in branch speculation hardware can only handle a
> limited number of branches

This used to be more of a problem before modern PRF cpus -- with PRF, recovery
from branch miss is much cheaper (it's essentially the copy of 17 8-bit
pointers), so Sandy Bridge and later can typically handle more in-flight
branches than their pipeline length, so confirmations of branch predictions
shouldn't ever stall you unless you use division to get the flags or
something.

------
DannoHung
Are memory mapped registers even implemented in any architectures these days?

~~~
allenbrunson
I assume you're referring to the use of the 'register' keyword? well, that's
not "memory mapped registers!" That's a suggestion to the compiler that it
should store a variable in a cpu register, if possible. that means that you
can't take the address of that variable, and so on.

Last I heard, 'register' is pretty much obsolete these days, because there is
no way you can do better at guessing which variables are candidates to be
stored in registers than the compiler can.

~~~
CJefferson
No, he is refering to the fact that the original Duff's device contains:

    
    
        *to = *from++
    

Note that there isn't a '++' on 'to'. Make people think that Duff's device is
for making a fast memcpy (which you would get with _to++ =_ from++), but in
fact it was for copying data to a MMR.

Nowadays, most of the loop would be optimised away by the compiler, unless you
marked to as volatile. Assuming you mark to as volatile, this kind of code is
still used to write the MMR, although they are much less used than they used
to be. The last time I wrote code anything like this was on the Nintendo
Gameboy Advance. I don't know if the DS / 3DS still use similar code.

