In fact, the example in the message is not implementable as memcpy, nor is any computer likely to have an memcpy-like idiom that implements it
REP OUTSW.
Reading a lot of literature from this period in computing, it seems that the majority of "serious programmers" were focused mostly on mainframes/minicomputers, and there was a divide between those people and the ones who were focused almost exclusively on the PC/microcomputers. No one thought x86 would become as commonplace as it is today.
Be careful when you use this thing that you don't unwind the loop so much that you overflow your machine's instruction cache.
Saving cache has become more important with the growing gap between core and memory latency, and processors have gotten much better at predicting and executing "around" loops (especially for superscalar/OoO), so much that in my experience loop unrolling is very seldom beneficial to a great extent today. You may still see improvements if you microbenchmark, but the extra code has displaced something else from the cache and made the overall result slower.
It's always worth rereading this when it comes up just for the last line:
"Many people (even bwk?) have said that the worst feature of C is that switches don't break automatically before each case label. This code forms some sort of argument in that debate, but I'm not sure whether it's for or against."
Cool to look at the original work on this from 1988, and see just how little it's changed in the nearly 3 decades since.
Go for instance makes use of Duff's devices generated for each architecture in its core runtime. You can see the code to generate the devices here - https://golang.org/src/runtime/mkduff.go
Been there, done that :) It's a neat trick, but the issue for me is that you still need to pass a separate struct for state which breaks the illusion enough that you might as well just write an ordinary iterator.
See the NeWS PostScript interpreter code I posted in this thread.
NeWS has multiple PostScript threads, which must be able to block and re-start in the middle of parsing PostScript expressions or anything else that might block, so the entire interpreter has a switch statement for dispatching PostScript operators, and any of those operators which might block have "goto" labels so the interpreter can jump back into the middle of the operator when the threads gets switched back in.
The most complicated operator is the PostScript parser itself (the "file_execution" operator), which blocks in many places, and whose C source code was broken out into a .h file with the comment "be charitable and think of this as a procedure call".
This is actually quite similar to how C# and other languages implement co-routines under the hood. It's just written by hand in portable C.
Here's the comment explaining it:
/*
* The PostScript interpreter has a fairly simple basic structure:
*
* while (1) { 1. Pick up an object to interpret 2. Execute it }
*
* Part 1, picking up an object to interpret is handled by the statement that
* starts "switch (es->type)". es points to the top of the execution stack: a
* stack of sources of executable objects. The switch determines what kind of
* thing the object is being fetched from: an array, a string, ...
*
* Part 2, executing an object, is handled by the code that starts with the label
* "execute_top". It does some tests and switches on the type of the object
* being executed.
*
* All of this is complicated by the mechanism for blocking. If the process
* needs to block, say by needing to read a new buffer from a socket or by
* needing to wait for a message, then the interpreter needs to stop working
* on this process and leave information around that lets it restart. This is
* the "restart_state" variable: it is used as the index to a switch statement
* that is executed when the interpreter starts up, so it won't necessarily be
* intering the main loop of the interpreter at the top. There are cleaner
* ways of doing this, but they can't be written in C!
*/
NeWS had some spectacular abuses of C preprocessor macros with nested switch statement, loops and gotos.
Especially the conic splines code by Vaughan Pratt and James Gosling, which "flattens" Bezier curves into conic splines (unlike straight lines like most PostScript renderer implementations), which Vaughan Pratt's conix package can draw really fast.
SunDew - A Distributed and Extensible Window System
[...] Two pieces of work were done at SUN which provide other key components of the solution to the imaging problems. One is Vaughan Pratt's Conix [53], a package for quickly manipulating curve bounded regions, and the other is Craig Taylor's Pixscene [63], a package for performing graphics operations in overlapped layers of bitmaps. [...]
[...] Pixscene is based on a shape algebra package. The ability, provided by Conix, to do algebra very rapidly on curves should make non-rectangular windows perform well. [...]
/*-
Convert a conic arc to a curve
arctochain.c, Tue Jun 11 14:23:17 1985
"Elegance and truth are inversely related." -- Becker's Razor
Vaughan Pratt
Sun Microsystems
This code is a version of the conix package that has been
heavily massaged by James Gosling.
*/
/* enter conic machine at appropriate state */
rem0 = size.y;
if ((size.x > 0) == (side < 0))
goto EOD;
else
goto OED;
/* OE state machine */
ODE:trans1(across, OED);
DOE:trans2(across, ODE, >, OED);
OED:trans2(down, ODE, <, DOE);
/* EO state machine */
EDO:trans1(down, DEO);
DEO:trans2(across, EDO, >, EOD);
EOD:trans2(down, EDO, <, DEO);
final:;
/*- clearasil(buf0+1, bitlen); /* clean out 0011/1100 pimples */
assert(ypos <= curve->x + (curve->y1 - curve->y0 + 1));
The main loop of the NeWS PostScript interpreter itself is the realization of Duff's dire warning "Actually, I have another revolting way to use switches to implement interrupt driven state machines but it's too horrid to go into."
/*
* This is the PostScript interpreter. Its implementation breaks almost every
* rule of structured programming. This is a result of needing to support
* multiple PostScript processes and a paranoia about performance. There are a
* couple of #includes that you should think of as procedure calls. Life is
* tough in the trenches.
*
*/
/*
* The PostScript interpreter has a fairly simple basic structure:
*
* while (1) { 1. Pick up an object to interpret 2. Execute it }
*
* Part 1, picking up an object to interpret is handled by the statement that
* starts "switch (es->type)". es points to the top of the execution stack: a
* stack of sources of executable objects. The switch determines what kind of
* thing the object is being fetched from: an array, a string, ...
*
* Part 2, executing an object, is handled by the code that starts with the label
* "execute_top". It does some tests and switches on the type of the object
* being executed.
*
* All of this is complicated by the mechanism for blocking. If the process
* needs to block, say by needing to read a new buffer from a socket or by
* needing to wait for a message, then the interpreter needs to stop working
* on this process and leave information around that lets it restart. This is
* the "restart_state" variable: it is used as the index to a switch statement
* that is executed when the interpreter starts up, so it won't necessarily be
* intering the main loop of the interpreter at the top. There are cleaner
* ways of doing this, but they can't be written in C!
*/
[...]
/*
* Restart the interpreter where it left off last
*/
switch (ee->restart_state) {
case 1:
goto read_1;
[...]
case 24:
goto read_24;
}
/* The main loop of the interpreter */
while (1) {
struct object new;
[...]
/* Pick up an object to execute */
switch (es->type) {
[...]
case file_execution:{
register PSFILE *f = es->env.file.file;
/* be charitable and think of this as a procedure call */
#ifdef GPROF_HOOKS
/* gP_GetFile defined in parse_file.h */
#endif
#include "parse_file.h"
break;
}
/*-
* Parse a PostScript object from a PSFILE
*
* parse_file.h, Mon Dec 31 09:36:02 1984
*
* James Gosling,
* Sun Microsystems
*/
/*
* This is a version of getc that pauses if the buffer is empty and suspends
* the process
*/
REP OUTSW.
Reading a lot of literature from this period in computing, it seems that the majority of "serious programmers" were focused mostly on mainframes/minicomputers, and there was a divide between those people and the ones who were focused almost exclusively on the PC/microcomputers. No one thought x86 would become as commonplace as it is today.
Be careful when you use this thing that you don't unwind the loop so much that you overflow your machine's instruction cache.
Saving cache has become more important with the growing gap between core and memory latency, and processors have gotten much better at predicting and executing "around" loops (especially for superscalar/OoO), so much that in my experience loop unrolling is very seldom beneficial to a great extent today. You may still see improvements if you microbenchmark, but the extra code has displaced something else from the cache and made the overall result slower.