

Golang Duff's devices - lucio
http://luciotato.svbtle.com/golangs-duffs-devices

======
pslam
This is probably a bad idea and I would bet (and probably win) that less
aggressive unrolling would be better in pretty much all cases _except a micro-
benchmark_.

This kind of trick worked well on CPUs in the 1980s (as jvoorhis points out in
another comment). It works wonder for micro-benchmarks. It is a net loss for
any CPU and workload where instruction cache miss rate can severely impact
performance. That is to say, pretty much everything made after 1990s and doing
more than just shuffling memory around.

The problems are: 1) this is a large fraction of I-Cache size, 2) unrolling
doesn't actually save much (or sometimes anything) on out-of-order multi-issue
cores.

First off: extremely aggressive unrolling results in increasing I-cache
thrashing. This is usually not apparent in micro-benchmarks, because the
entire working code set fits in the I-cache. It becomes apparent if the
unrolling is stupid enough to be larger than I-cache (or usually just close
enough to it), because then you're spending just as much (usually more) time
re-fetching instructions from memory as you are moving data around. That's why
picking the best result from a micro-benchmark is misleading. In real use
cases, you find that functions like this one tend to cause a "global slow
down", with no specific function accounting for it, because they're greatly
reducing the effectiveness of the I-cache.

Also, unrolling doesn't really do much these days. You can do better than REP
MOVSD and friends - go read Intel's documentation for their recommended
sequences (which has a nasty habit of changing every damn CPU generation).
Modern out-of-order, multi-issue cores can very easily saturate their
load/store units, while also executing all those pesky loop counters and
branches. In fact, that's the whole point of being out-of-order and multi-
issue: keeping units busy.

So, please can everybody stop using Duff's device, unless: 1) you're targeting
a CPU which has no I-cache (e.g tightly coupled ROM/RAMs), 2) you really do
have a working set where this fits well enough to be better, 3) you're trying
to cheat on benchmarks, or 4) you're trying to obfuscate your binaries against
reverse engineering.

~~~
eloff
Rep movsd is only recommended for large clears/copies by Intel. Indeed the
commit message states:

REP MOVSQ and REP STOSQ have a really high startup overhead. Use a Duff's
device to do the repetition instead.

------
dilmuff
This is just aggressive loop unrolling. Duff's device is specifically the
syntactic curiosity in C/C++ that allows implementing unrolls with a switch
and inner while statement.

~~~
sago
The authors of this code seem to be using "Duff's device" to mean the unroll,
plus the mechanism of jumping into the unrolled code. I agree, both are not
Duff-specific in assembly. And anyway, not as impressive without the
case...do. It still amazes me that Duff's version works.

~~~
eridius
Duff's version will _always_ work for as long as compilers remain compatible
with ANSI C. It's a clever trick that takes advantage of the precise
definition of a switch in the C standard (as opposed to taking advantage of
compiler-specific implementation details).

~~~
sago
It _still_ amazes me it works, not it amazes me it _still_ works. I just mean
that even knowing why and how it works it still looks like it is malformed.

------
jvoorhis
Is loop unrolling truly helpful in either case? CPUs have changed a lot since
the 1980s when Duff's was invented..

~~~
cremno
The commit message contains some numbers:

[https://github.com/golang/go/commit/6c7cbf086c34ebb88311ba12...](https://github.com/golang/go/commit/6c7cbf086c34ebb88311ba12d3a75adcbdce8ac8)

~~~
vinkelhake
I like the hundreds of repeated lines. I wonder when they are going to
rediscover macro assemblers.

~~~
prattmic
The Go build system uses the Plan 9 assembler, which is explicitly simplistic
to discourage the use of assembly except when there is a very compelling case
for it.

~~~
coldtea
Or because they couldn't be bothered, and the "discourage" part is just the
excuse...

------
jbandela1
This illustrates one of the issues with "high-level" languages: when you need
to do something that the language does not allow, you have to resort to
horrible hacks.

I thought the C version of Duff's device was bad, but this is atrocious. This
is non-portable copy-pasta on steroids.

This is one of the reasons why people still use C and C++. When you need to
tune something, you can write it using the same language and nicely
encapsulate it.

~~~
koffiezet
Except that this is C code (well at least some dialect of...) with ASM
embedded...

