
Improving Ruby Fibers - ioquatix
https://www.codeotaku.com/journal/2018-06/improving-ruby-fibers/index
======
edsiper2
we are using libco[1] in our project[2]; it provides a good abstraction for
architectures and provide different backends using ucontext, setjmp and native
register management for x86 and x86_64.

[1] [https://byuu.org/library/libco/](https://byuu.org/library/libco/)

[2] [https://fluentbit.io](https://fluentbit.io)

~~~
ioquatix
I like `libco`.

The API is a bit more heavy weight than what I submitted to Ruby. That's
because `libco` tries to avoid an assembler and instead dynamically loads
binary data into an executable section. In my opinion it's an unnecessary
performance cost. The benefit is to avoid an assembler at compile time.

I think it's better to use natively compiled code by the normal toolchain. I
think it's a simpler design.

~~~
byuu
It's a very tough tradeoff. You're not the first to take aim at it, and that's
fair.

Visual C++, GCC/Clang, and Intel C++ have quite different syntax for declaring
inline assembler. I did not wish to burden libco users with having to use a
specific compiler. Further, GCC/Clang can prove ... challenging to produce
"naked" functions (without stack BP/SP adjustments), which is critical for
libco's co_switch routine. I recall having trouble with a less popular OS.
Using an external assembler like yasm or GNU as adds further dependencies on
specific tools.

It is a _very_ minor performance penalty to have the co_swap thunk, but it is
non-zero, so I respect your decision. But do note at least that libco supports
many architectures (without going pointlessly obscure.) It'd be a shame to
miss out on cooperating in supporting architectures by duplicating our work
over something like this.

With or without this, coroutines will always be _slaughtered_ compared to a
simple stackless state machine function, but they really shine and (in my
opinion) are worth their overhead when you need to switch tasks in the middle
of nested function calls.

...

    
    
        # Put the first argument into the return value
        movq %rdi, %rax
        # We pop the return address and jump to it
        ret
    

So if you look at libco:
[https://gitlab.com/higan/higan/blob/master/libco/amd64.c#L26](https://gitlab.com/higan/higan/blob/master/libco/amd64.c#L26)

You'll notice that I get the return address into RAX as soon as possible.
Believe it or not, this makes a real difference in performance. It allows the
CPU to start fetching instructions after the JMP/RET even sooner than if you
have it at the bottom of the function as you do.

The choice between push/pop and mov [rsp] for handling the non-volatile
registers isn't really important. I found the latter slightly more performant
on Athlon 64 CPUs, and a wash on Intel CPUs.

Preserving signals isn't so critical, but SSE is probably quite important, and
only necessary on Windows. Again, your decision, but I would preserve it in
your case. Can lead to really nasty surprises if you don't.

~~~
gpderetta
>You'll notice that I get the return address into RAX as soon as possible.
Believe it or not, this makes a real difference in performance. It allows the
CPU to start fetching instructions after the JMP/RET even sooner than if you
have it at the bottom of the function as you do.

The "execution" of the jump happens well before the pop rax is executed.

The real difference is using a jump insted of a ret. The latter will be always
mispredicted as the CPU return address predictor (the stack engine) will
always get it wrong, while the indirect predictor used by jmp has a chance to
get it right.

~~~
ioquatix
Do you both mind elaborating a bit on this? I have to admit CPU instruction
scheduling is not really my area.. I'd love to hear more.

~~~
byuu
It's been about ten years since I made that code change, so I'm a bit fuzzy on
the details, but in my mind, I really don't see how "pop rax; jmp rax" is
going to be faster than "ret". I really feel like it was important to get the
return address into rax as soon as possible.

And that, plus the ability to save/restore the xmm registers for the Win64 ABI
(there is no push/pop xmm#), is why I used mov/movaps[] instead. I do recall
it also testing faster on Athlons as well, but that was so long ago as to be
irrelevant now, most likely.

Nostalgic! :3

In any case, I'd recommend benchmarking both under an application that tries
to do as many cothread switches as possible per second. Theory's one thing,
seeing the results in practice is always another.

~~~
gpderetta
> I really don't see how "pop rax; jmp rax" is going to be faster than "ret".

As I mentioned earlier this is due to jump predictions. jmp rax (or any
register really) uses the indirect jump prediction, while ret uses a special
dedicated predictor that uses a stack (the Return Stack Buffer or RSB),
populated by call instructions, to predict the return address. In the
coroutine case, the ret does not jump to the address of the last call so it
will be mispredicted many time. The general indirect predictor, while not
guaranteed to get it right, has at least a chance to predict the target
especially when you have a small set of fibers calling each other in a
deterministic sequence. In the general case with dozens of fibers calling each
others from random locations, there is no chance of successful prediction in
any case.

In principle a CPU could have a meta predictor that would chose between the
general indirect predictor and the RSB, but this does not appear to be the
case for current CPUs (the only switch to indirect when the stack buffer
underflows).

Fibers are generally problematic for the return call predictor, even if the
pop/jmp sequence is used, the call into the coroutine switch function will
permanently damage the RBS, so any subsequent ret will be eventually
mispredicted. For example:

    
    
        Fiber 1:
            call foo:
                call coro_switch
                   jump fiber 2
        Fiber 2:
            call coro_switch
                jump fiber 1
        Fiber 1
                ret // return from foo call, always mispredicted
    

In my coroutine library I have been experimenting with writing coro_switch in
inline assembler, so there is no call to it (this only works if your target
language support inline assembler). So if a fiber switches to another fiber
(from any function nesting depth) which then switches back from the same
nesting level, the predictor is not damaged. I think this is worth doing to
optimize for coroutines used as generators.

BTW, I measured these effects multiple times.

Regarding populating the register well ahead of the jump, it is an interesting
idea, but I do not think it matters much in practice. First of all, jumps and
call are 'executed' at the fetch stage in the pipeline, this is about 10 clock
cycles earlier than the stage that would execute the pop; because of OoO
execution this might even be much earlier. Also performing a jump never has a
dependency on anything as it is always executed under speculation, so 'pop
rax' being completed or not has no impact. The register is only used much
later to confirm or rollback the speculation.

~~~
byuu
I gave this a try:

[https://hastebin.com/raw/ipodigizey](https://hastebin.com/raw/ipodigizey)

There's no difference between an inline assembler call function and a fully
unrolled, inlined thread swap that avoids any calls at all. There's only an
infinitesimal difference between these two versions and the libco method of
using a byte array and a function thunk to invoke it in a compiler-independent
way.

Further testing in bsnes (where tens of millions of thread switches occur per
second) reveals absolutely no observable difference in performance.

It's a fun idea to toy around with this, but in the end any of these methods
are going to be almost identical. You might as well pick whichever you like
and go with that.

~~~
gpderetta
One (but not both) of the coroutines need to do a return after invoking coro
swap to see the advantage of avoiding the call, otherwise the additional call
instruction is pretty much 0 cost.

Inline assembler does have another advantage though: you can use extended asm
clobbers instead of saving registers manually. At the very least the compiler
can do a better job of scheduling the instruction around (which is admittedly
marginal on an OoO CPU), but in some cases (like this specific artificial
test) can completely omit them. You should be able to see a coroutine jump
every other cycle this way (i.e. the theoretical limit as most CPUs can only
perform a taken jump that often).

At the end of the day depends on what you are aiming for: if a task is doing
some reasonable amount of work between coroutine jumps (i.e. 'just' light
weight threads), the canonical implementation is perfectly fine. My end goal
is to make coroutines usable as generators, so probably even inline assembler
is not enough there as ideally you would like the compiler to optimize, inline
and vectorize across coroutine calls.

------
crb002
IMHO Ruby was better off in 1.8 with green threads; one Ruby VM per core and
using MPI for message passing.

~~~
ioquatix
There are two proposals on the table to return to such as design:

[https://bugs.ruby-lang.org/issues/13618](https://bugs.ruby-
lang.org/issues/13618)

[https://bugs.ruby-lang.org/issues/14736](https://bugs.ruby-
lang.org/issues/14736)

I'm not sure green threads are a great design for primarily concurrent IO.
Because they have all the cognitive overhead of threads but it's simply not
necessary for primarily cooperative workloads. Unless you have several network
cards, IO is going to be serialised at the hardware level.

