
Cooperative Threading - tjalfi
https://byuu.net/design/cooperative-threading
======
gpderetta
There is a section about pipeline stalls caused by switching stacks, but that
is misleading. Switching stacks per se is not a huge penalty, the cpu [1]
might need to add synchronization opcodes to synchronize the stack engine view
of the stack pointer with the actual stack pointer, but the performance cost
is minimal and often free.

A big penalty is due to (failed) branch prediction: a coroutine switch is
essentially an indirect branch so the predictor might need help (for example
using a different branch instruction address for each coroutine type); also
call instruction used to call into the coroutine switch function is not paired
with a ret instruction, which messes with the specialized call predictor.
Changing the final jmp instruction in the switch function to push add; ret
actually makes thing worse; the best solution is to inline the switch function
in the caller via inline assembler (this also helps with the previous issue
and, if the compiler provides the functionality, it allows only saving the
registers that are actually in use)

edit: last time I did a synthetic benchmark of my one of my own coroutine
implementations, the switch performance was only constrained by the number of
taken branches the CPU could issue (one every other cycle):
[https://github.com/gpderetta/delimited/blob/master/benchmark...](https://github.com/gpderetta/delimited/blob/master/benchmark.cc)

[1] I'm talking about the typical x86 cpu.

~~~
byuu
You could well be right. In my own tests, the ESP swap is what tanked
performance. A more limited stackless test did not have nearly the same
overhead. And indeed, the largest speedup was from moving to set ESP as soon
as possible, get the return address into EAX, and finally jump to it. CALL+RET
with late ESP set behaved poorly, leading me to believe it was due to the CPU
stalling on the ESP change.

I did also try inlining this, but it was much less portable and actually
performed slower in my emulator. I won't claim that will be the case for
everyone, of course. YMMV.

~~~
avdicius
Wow. I used my own stack switch routine for several years now and didn't
realize it could be improved until I read this thread. Thank you very much.

This is my old naive version:

[https://github.com/ademakov/MainMemory/blob/master/src/base/...](https://github.com/ademakov/MainMemory/blob/master/src/base/arch/x86-64/cstack-
switch.S)

This is what I have now:

[https://github.com/ademakov/MainMemory/blob/cstack-switch-
re...](https://github.com/ademakov/MainMemory/blob/cstack-switch-
revamp/src/base/arch/x86-64/cstack.h)

~~~
gpderetta
You are missing x/y/zmm regs in the clobber list.

~~~
avdicius
Yes, that's true. However in my app this has never caused any problems.
Apparently even if the compiler auto-vectorizes anything it does this only
with scratch regs or in leaf procedures. I don't have any fp or simd code of
myself.

------
ShroudedNight
> Basically it requires dropping down to the assembler-level for each
> supported architecture to implement the context switching, as modifying the
> CPU context registers and stack frame directly is not permitted by most sane
> programming languages.

I thought that was (essentially) the definition of longjmp? Thinking further,
it seems like the initial setup of additional stacks would require at least
taking the covers of setjmp and interacting with its implementation details
directly.

[https://en.wikipedia.org/wiki/Setjmp.h](https://en.wikipedia.org/wiki/Setjmp.h)

~~~
jjjordan
There's an implementation using setjmp/longjmp in his library here [1]. It
uses sigaltstack to assign a newly allocated stack to the coroutine. Marc
Lehman's libcoro [2] library does the same.

[1]
[https://github.com/byuu/higan/blob/master/libco/sjlj.c](https://github.com/byuu/higan/blob/master/libco/sjlj.c)

[2]
[http://software.schmorp.de/pkg/libcoro.html](http://software.schmorp.de/pkg/libcoro.html)

~~~
gpderetta
as an historical note, the sigaltstack trick was popularized, if not invented,
by the Pth library.

------
lebuffon
The efficiency of cooperative tasking was seen in the 1970s when machines were
notoriously slow. In the example below roughly 8x better.

[https://www.forth.com/resources/forth-programming-
language/](https://www.forth.com/resources/forth-programming-language/)

"a PDP-11 or Nova could be expected to support up to eight users, although the
performance in a system with eight active users was poor.

On this hardware, Moore’s Forth systems offered an integrated development
toolkit including interactive access to an assembler, editor and the high-
level Forth language, combined with a multitasking, multiuser operating
environment supporting 64 users without visible degradation..."

------
artemonster
I wonder whether writing whole system in verilog, synthesizing cycle accurate
simulator with verilator would yield better results...

~~~
ginko
I work for a silicon IP company and the verilog simulation of our product is
about a factor of ten slower than our "cycle accurate" software model.

The FPGA version is about 40 times or so faster than the model and the actual
silicon is another 10 times or so faster than that.

~~~
bhouston
How cheap are fgpas that could simulate an SNES or similar?

Any chance we standardize and commoditize fpgas so that it can be just another
device on a laptop or phone that anyone could use? Like bluetooth or a GPU or
gps?

This has to happen at some point but how far off from that are we now?

I assume there are different grades of fpgas and that likely complicates
things but it is like desktop GPUs are way way better than phone GPUs... But
both run opengl/vulkan and essentially the same code.

~~~
gpderetta
I know of FPGA based implementation of whole Amigas [1] and I do not htink
they are particularly expensive. It is very likely that SNES reimplementations
also exist already.

Edit: this [2] is a full system for 220 euros and in addition to Amiga it
emulates a bunch of 8 bit consoles.

[1]:
[https://en.wikipedia.org/wiki/Minimig](https://en.wikipedia.org/wiki/Minimig)

[2]: [https://amigastore.eu/en/358-mist-midi-fpga-computer-with-
mi...](https://amigastore.eu/en/358-mist-midi-fpga-computer-with-midi-add-
on.html)

------
helltone
According to the article, being preempted by the kernel is too expensive
because of the context switch userland<>kernel. I wonder if its possible to
implement the same preemptive mechanism purely in userland, for example with a
timer alarm and a signal handler using swapcontext or similar to achieve
better performance?

~~~
cfallin
That would still involve entering and leaving kernel mode, because timer
alarms ultimately are triggered by a hardware interrupt (from e.g. the APIC
timer or, in really old x86 machines, the 8254 programmable interval timer,
aka IRQ 0). So IRQ -> switch to kernel timer IRQ handler (prologue saves all
usermode registers) -> examine internal data structures, find pending alarm
signal -> modify process state, pushing signal frame and setting RIP to signal
handler -> return to userspace. That's probably slower than just Timer IRQ ->
switch to kernel -> invoke scheduler -> decides to preempt, changes to new
thread -> restore context, return to user space (but I'd be curious if someone
actually measured).

Fundamentally, preemption has to originate from a hardware IRQ (or IPI from
another core), so really the only way would be to kernel-bypass by setting an
IRQ handler in userspace (ring 3). That's technically possible on x86 (IDT
entry can have a ring-3 code segment) but I don't think the kernel has a
mechanism for that...

~~~
nitrogen
How do user-mode network drivers work? Do they rely on polling, or do they
still have the kernel handle interrupts?

------
saagarjha
What’s with them mprotect here:
[https://github.com/byuu/bsnes/blob/master/libco/x86.c](https://github.com/byuu/bsnes/blob/master/libco/x86.c)
?

~~~
lilyball
Good question.
[https://github.com/byuu/bsnes/blob/bd8e94a7c7cbfdf7ac0b7f24c...](https://github.com/byuu/bsnes/blob/bd8e94a7c7cbfdf7ac0b7f24ce06fa219e7e5974/libco/settings.h#L3-L6)
suggests there are reasons why the section(text) approach might not work, and
[https://github.com/byuu/bsnes/blob/6b7e6e01bb025bf5cbeb92e65...](https://github.com/byuu/bsnes/blob/6b7e6e01bb025bf5cbeb92e65a492121070fb996/libco/libco.c#L1-L7)
explicitly says it doesn't work with clang, though I don't see why there's not
some way of solving this without mprotect that works everywhere.

~~~
gpderetta
but why is it not simply using a separate asm file instead of embedding the
preassembled binary as a constant?

~~~
byuu
Because for many years libco has supported _all_ C89 compilers. I didn't want
a dependency on GNU as on Windows, nor did I want to write an MSVC variant.
There is no technical reason it cannot be inline now, especially now that we
have Clang with a compatible asm syntax.

------
aidenn0
I do wonder why byuu wrote his own library rather than using one of the half-
dozen existing ones. Presumably there is something about this use-case that
makes a dedicated library useful.

~~~
byuu
There's probably hundreds of these libraries. Almost everyone chooses to just
write their own.

I wanted one because in 2007, no project existed that was laser-focused on
maximizing performance (I switch threads tens of millions of times a second in
my emulators), was lightweight enough (I implement my own schedulers for my
emulators), and was portable enough (I am not even aware of a CPU architecture
libco won't run on currently.)

The library is extremely small, and being in control of it allows me to adapt
and support new targets directly as needed.

------
grawlinson
The cool thing about libco, is that it's also used in LXD. Quite a novel use
for this library!

------
axilmar
Do we really need all this complexity? how about if having, let's say, N chips
to emulate, do a simple loop and call 'emulate' for each of the chips?

    
    
        for(EmulatedChip &chip : chips) {
            chip.emulate(time);
        }
    

Why would the above not be suitable for emulation?

~~~
joppy
I think this approach is the "state machine" approach given in the article.
The added complexity in moving to a coroutines/co-operative threading model is
balanced by the drastic code reduction, pointed out in the article by "Yes,
it's really that dramatic of a difference".

~~~
fwsgonzo
Yes, you inevitably end up implementing a state machine like that.

~~~
byuu
I consider it something of a trap in fact.

You start writing a new emulator for the first time, dispatching entire
instructions in one go, and so you don't even really need a state machine at
all.

Then you start trying to get the timing better with opcode-cycle granularity,
and in comes a separate state machine for every instruction.

Then you realize you need clock-cycle granularity to fix certain edge-case
games (eg emulating the effects that occur during bus accesses), and suddenly
the prospect of a state machine for every cycle of every instruction becomes
overwhelming.

You would then be realistically stuck with the choice to either stop improving
your accuracy, or rewriting things cooperatively.

Since I'm focusing byuu.net articles toward aspiring emulator developers, I
thought it would be an important topic to cover.

------
amelius
> Unlike coroutines, each cooperative thread has its own stack frame.

This seems to be incorrect. See:

[https://stackoverflow.com/questions/28977302/how-do-
stackles...](https://stackoverflow.com/questions/28977302/how-do-stackless-
coroutines-differ-from-stackful-coroutines)

> In contrast to a stackless coroutine a stackful coroutine can be suspended
> from within a nested stackframe.

It seems the author is trying to reinvent stackful coroutines, and calls them
"cooperative threads".

~~~
byuu
No one really agrees on the exact terminology. There are stackful and
stackless coroutines, continuations, green threads, fibers, cooperative
threads, etc. Generally when people talk about coroutines without a
qualification they are meaning the stackless variety. I personally consider a
coroutine with a stack frame to be a cooperative thread, but understandably
not everyone will agree on naming. The nomenclature hopefully should not
affect the broader message of the article though.

In any case, I am not so much trying as I already have. C++ does not have
stackful coroutines. I implemented them via my libco library and have been
using them for over a decade now.

