Hacker News new | past | comments | ask | show | jobs | submit login
Duff's device (wikipedia.org)
145 points by simonpure 43 days ago | hide | past | favorite | 59 comments



Don't miss out on the brilliant link at the bottom - Adam Dunkels' Protothreads - Lightweight, Stackless Threads in C also uses nested switch/case statements [1].

I learned about the theory with it, and used his library to implement cooperative multitasking on a bare metal embedded AVR C project. I wrote a task manager and dispatcher based on Protothreads, and that effort was what later made it simple for me to understand the JS async/await paradigm that would appear under my radar a couple years later.

EDIT to add that I think the concepts might still apply, since these days I've seen things like stackless async coroutines discussed for Rust... didn't study those in detail yet, but I wouldn't be surprised if the basic concepts are the same.

[1]: Current website: http://dunkels.com/adam/pt/


Wikipedia entry on protothreads: https://en.wikipedia.org/wiki/Coroutine#Implementations_for_...

The source code is interesting. LC_SET is particularly neat:

  /* WARNING! lc implementation using switch() does not work if an
     LC_SET() is done within another switch() statement! */
  
  /** \hideinitializer */
  typedef unsigned short lc_t;
  
  #define LC_INIT(s) s = 0;
  
  #define LC_RESUME(s) switch(s) { case 0:
  
  #define LC_SET(s) s = __LINE__; case __LINE__:
  
  #define LC_END(s) }


One cool thing you can do is to manage the numbers in the case manually, eg.

  #define LC_SET(s,num) s = num; case num:
This will let you serialize the state s, modify the program, recompile and reload s, and have it resume at the desired point.


You could take this one step further by using the __COUNTER__ macro to generate unique numbers in the preprocessor.

Alas, this isn't a part of the standard (iirc), so you need to depend on compiler-specific features.


What I'm thinking of is modifying it like this

    LC_SET(s,0)
    A
  + LC_SET(s,2)
  + C
    LC_SET(s,1)
    B
When the number for each resume point is explicit instead of automatic, you can edit the source code while controling where any existing suspended coroutines will resume at.


We use Protothreads in our commercial embedded product to handle a complicated communication protocol to a time-of-flight sensor chip over I2C. It works quite well, but since every function in the call stack needs to pass the protothreads variable it becomes rather complicated.


Why not an RTOS at that point? Is FreeRTOS too heavy in your application?


I worked with a recipient of this email for many years, and part of our work involved writing C. It was a real pleasure reading his C, he was a master of the form.

There is something I miss in modern languages...they do not permit the abuse that C permits. The preprocessor, pointer arithmetic, various semantics around types and overflows...one of course sees some horrendous monstrosities in the Linux kernel source...

C as a tool is only an approximation of the underlying machine, as such the tool can be bent to deliver semantics that are wholly surprising. I can think of no mainstream language that has that property. It's a shame.


> I can think of no mainstream language that has that property.

Duff's Device works in D:

    void send(short* to, short* from, size_t count)
    {
        auto n = (count + 7) / 8;
        final switch (count % 8) {
        case 0: do { *to = *from++; goto case;
        case 7:      *to = *from++; goto case;
        case 6:      *to = *from++; goto case;
        case 5:      *to = *from++; goto case;
        case 4:      *to = *from++; goto case;
        case 3:      *to = *from++; goto case;
        case 2:      *to = *from++; goto case;
        case 1:      *to = *from++;
                } while (--n > 0);
        }
    }
The `goto case;` is D's way of saying "fall through to the next case".


Ah that's interesting, didn't know D had goto case statements. Btw, what are some projects you've used D for? What are other surprising lesser known features of it to you?


The biggest project is implementing the D compiler itself in 100% D. Don't have much time for other projects.

Nested functions are a little gem. I stole the idea from Pascal. I've since discovered that they can nicely clean up spaghetti code that C engenders, without losing efficiency.

Nested functions would fit into C seamlessly, I sometimes wonder why the C people don't adopt it. I can only assume they just don't realize how sweet they are, which is understandable because you just have to use them a bit to see it.


How do D’s nested procedures differ from the Obj-C block extension?

The original pascal based Mac toolbox APIs encouraged nested procedures in modal contexts - but Nested procedures couldn’t escape their outer frame.


I have no idea what Obj-C block extensions are.

D's nested functions (like Pascal's) have two frame pointers - a dynamic frame pointer (like C has), pointing to the caller's stack frame, and a static frame pointer pointing to the enclosing function's stack frame.

This allows the nested function to access the enclosing function's variables.

It's a bit tricky to set up, but works a treat.


GNU C has had nested functions for quite a long time. I would use them a lot if they were standardized. https://gcc.gnu.org/onlinedocs/gcc/Nested-Functions.html


D's differs in that:

1. you cannot "goto" out of the nested function

2. you can declare them "static", which means they don't have a static link and cannot access the enclosing function's variables

3. if garbage collection is enabled, you can pass a pointer to the function out of the enclosing function's scope and still safely access its variables

4. they work in `extern (C++)` functions, too


I don't know if you are aware, but you are replying to the creator of D, so it's unlikely there will be many surprising features for him :)


There could be emergent behavior in the distinct possibility that D becomes self-aware. I've had to reduce D's power at least once to thwart that.


There was a self aware version of lisp. You could feed it the program specification and it would write a 100% conforming program itself and run it. Unfortunately first time they fed the spec for a real project it outputted the words 'dear god no' and deleted itself.

And so here we are.



Every time this article is reposted I can't help but laugh at the line:

"C's default fall-through in case statements has long been one of its most controversial features; Duff himself said that 'This code forms some sort of argument in that debate, but I'm not sure whether it's for or against.'"


Read the source to Putty sometime.

The main program is one big, big function with a big, big loop and switch in it. Each place where it does something that would (otherwise) block, it sets a state variable to match a case label right after, does a non-blocking version, and breaks. (This is done with a macro.)

After the break, it stashes the state, and if anything is ready, switches to that state and re-enters the switch. Otherwise, it sleeps waiting for work that can proceed.

As a colleague remarked, "I love it, and hate myself for loving it."


Something like it works for if...else...if...else too. To my horror, this was more readable and maintainable than any alternative I could imagine.

    switch(override){
    default:
        printf("unexpected...\n");
        // fallthrough
    case 0:
        if(z_bits&4 || N_flag){
    case 100:
            handle_100();
        }else if(x_bits & 0x80){
    case 200:
            handle_200();
        }else{
    case 500:
            handle_500();
        }
    }
It's cut down a bit for Hacker News. The original had about a dozen lines of code where each handle_XXX function now sits.


While not a "pure" Duff's Device, Golang apparently does something similar with STOSD for an unrolled memcpy that it can call into at different points. I think it's really cute.

https://twitter.com/mayahustle/status/1260678725257003008


I used one of these for the implementation of Siphash in the Linux kernel, if you're interested in a real world example: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...


Maybe I'm missing something, but that just looks like a regular C switch to me. Yeah, it has fall-through, but all C switches has fall-through unless you explicitly break. The point of Duff's device is that it interleaves two different structures: a switch and a loop.

It's unusual because in structured languages you expect the lexical structure to be entirely hierarchical (you wouldn't have an if block that somehow overlapped with a for-loop, for instance). Duff's device breaks that assumption because it uses one control structure (the switch) to jump into the middle of a second control structure (the loop), and lexically, the switch is both inside the loop and outside of it.

This is why lots of people when they first see it don't think it is even syntactically valid. A regular switch with fall-through doesn't violate the "hierarchical lexical structure" assumption in that way.


This is, oddly, one of the few things I think would actually be clearer in assembly language than it is in C.


You can actually get rid of the confusing nesting:

    switch(n%4)
    {
    loop:
    case 0: stuff();
    case 3: stuff();
    case 2: stuff();
    case 1: stuff();
    if(--n) goto loop;
    }


This is a fun language hack, but be aware that this is probably not good for performance in 2020. Nearly anything larger than a coffee machine microcontroller these days probably has instruction level parallelism and also operates more efficiently on larger data words. Obviously it depends on what you're trying to do but using this for any kind of iterative memory op is probably slow. It's also such an oddball construction that the compiler is not likely to be able to optimize it.


Yeah, I think dma is more common now, memory mapped registers aren't as common anymore.


I have actually used this in production for cooperative multitasking framework for Verifone VX510.

I needed this to run multiple stuff at the same time (for example printing while also doing network -- both serial devices on VX510).

Could have done this with setjmp/longjmp but it would not be as fun:)


> Its discovery is credited to Tom Duff in November 1983, when Duff was working for Lucasfilm and used it to speed up a real-time animation program.

It's mind-boggling to me that something like that can be "discovered". It's certainly "discovered" because it relies on a particular setup that relies on specific low-level features of C. Perhaps I would have denoted it as "invented" but the more I read about it, the more I realize that it probably doesn't qualify as something that was "invented" for the same reasons.


Didn't the whole meta-programming thing with C++ templates start because someone noticed unusual output in an error message.


Would a modern C compiler even permit that?

The article does say " At the time of the device's invention this was the first edition of The C Programming Language which requires only that the body of the switch be a syntactically valid (compound) statement..." which implies it isn't valid now.

Any correct parser based on any sane grammar should instantly choke on that.


Yes, it's correct modern C. I'm not sure what that paragraph is supposed ot mean. AFAICT there is no difference between old and modern C in how the cases can appear in the body of the switch. Here is the first edition of The C Programming Language on switch:

> switch (expression) statement

>The usual arithmetic conversion is performed on the expression, but the result must be int. The statement is typically compound. Any statement within the statement may be labeled with one or more case prefixes as follows

> case constant_expression :

https://archive.org/details/TheCProgrammingLanguageFirstEdit...


I can’t at all support this with a source, but I had THOUGHT the register keyword needed to make this fast was largely up to how the compiler decides to use it. That register can’t be required to be register vs ram, but it’s encouraged to be IF the compiler can do it.

So I thought the downside was you can declare “register” but it might not obey but then definitely can not use the & to get the memory address by using they keyword. Which on portable code register isn’t recommended.

So I suppose it’s not entirely fair to say all modern or all past support this. Right?


You're tangling a bunch of stuff together in a confusing mess. Let's take one thing at a time.

1. You're right that you can't use the & operator on a variable that was declared using the register keyword.

2. However, if you look very carefully at the example code, it actually follows this rule. In fact, it doesn't use the & operator at all. There's a lot of pointers all over the place, but no usage of the & operator.

3. You're also right that the register keyword is just a hint that the compiler is free to ignore.

4. Expanding on the previous point: modern compilers usually will ignore it, but in the olden days, compilers were not so good at register allocation, so this keyword was needed if you wanted speeds competitive with assembly. Since this is code from the 80s, no surprise that it uses this keyword.

5. Now that we've got all that out of the way: the register keyword is a complete distraction here, and has nothing at all to do Duff's device. Since modern compilers ignore it anyway, you can take the original example and just delete all uses of the word "register". The point of Duff's device is how to do manual loop unrolling without then having a special "pre-loop" or "post-loop" which takes care of "leftovers" because your array size was not divisible by 8, or 4, or whatever unroll factor you happened to use.


It's a very fun historical article, but it's no longer worth it on modern CPUs, where it actually harms performance.


It's more a comment about the syntax of the C switch statement than about micro-optimization.


There's two kinds of C programmers: those familiar with Duff's device, and those who wouldn't expect Duff's device to even compile.


I'm familiar with it and I still don't think it should compile...


I'm disappointed that gcc's nested function extension doesn't let me use a switch to have a second entry point for the inner function.

I also can't seem to get there with a plain goto. I can't jump out either. Probably the computed goto extension will do the job, but that is cheating.


Using computed goto would (probably) make the code compile, but it wouldn't actually work. That's would be heading directly into undefined behavior territory (aka land of the nasal demons).

To give just one or two examples of why this is doomed: imagine you want to jump "inward". If the inner function takes any arguments, you haven't passed them! And if you jump inward, and the inner function then hits a return statement, do you return to the outer function, or to its caller?

And after you resolve all such semantic issues, you need to actually implement all this, without slowing down code which doesn't use this feature.

And all this complexity doesn't buy you much because you have two other options:

1. just call the function to jump inwards, and simply return to jump out.

2. Don't bother with nested functions, and just inline the code.


So it turns out I'm kind of wrong! My arguments for jumping "inward" still seem pretty valid, but they don't generalize to jumping "outward". And indeed, GCC actually allows this, but you need to use the right syntax:

https://gcc.gnu.org/onlinedocs/gcc/Nested-Functions.html


> There's two kinds of C programmers: those familiar with Duff's device, and those who wouldn't expect Duff's device to even compile.

There's also the C programmers that come from assembler for whom the construction expressed by Duff's device is very natural. They would have a hard time believing that somebody find it surprising. For those people, it is obvious that the code should compile and its meaning is clear, as can be mapped directly to the compiled code.


Sounds like a benchmark is in order!


What's not worth it? Loop unrolling? Something else?


Does duff's device create irreducible control flow? I couldn't tell you off the top of my head but seems likely and if it does that's a good reason to avoid as some modern compilers do not like irreducible control flow.


Yes, it creates a loop with multiple entry points, which is irreducible. (I think it's even one of several equivalent definitions of irreducibility.)


Putting the code into Godbolt the control flow

      switch (count % 8) {
turns into

          jmp     [QWORD PTR .L4[0+rdx*8]]
just like you'd do if you were doing the assembly by hand.


Yes that’s how gcc implements a switch. But what’s they got to do with irreducibility?


'irriducible'?


It comes from 'reduce' - I think it's spelt 'irreducible', but maybe you're American and I'm British.


Made me smile. No, I'm questioning the meaning here. I have a vague inkling it's a property of the dataflow graph of a program, I'm just struggling to remember what. Is it literally that repeated collapses of basic node types (if/while/sequence) will end up with a single node? Basically 'does it properly nest'? Or is it something else? (and I'm a brit too BTW)


Yes it means properly-nested, or really the same thing as 'structured' as in 'structured programming.' I'm sure there's some graph-theoretic definition but I don't know it off the top of my head.



modern cpus rarely do i/o that way anymore


When I was in college people had text files of tricks like this that they would send around. Over time you built up a library of these little hacks that could make a real difference especially in rendering or graphics applications.


Please don't use Duff's device. Look at how your favorite c std lib implements memcpy and think about that rather than this.


Duff's device does not do memcpy. It's writing all the elements of a string to the same location in memory, where there's a port.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: