
Fibers are the right solution to improve Ruby performance - justinhj
https://www.codeotaku.com/journal/2018-11/fibers-are-the-right-solution/index
======
coder543
> fibers can context switch billions of times per second on a modern
> processor.

On a 3GHz (3 billion hertz) processor, you expect to be able to context switch
_billions_ of times per second?

I would probably accept millions without question, even though that might be
pushing it for a GIL-ed runtime like Ruby has. But, unless your definition of
"context switch" counts every blocked fiber that's passed over for a context
switch as being implicitly context switched to and away from in the act of
ignoring it, I find this hard to believe.

It takes more than one clock cycle to determine the next task that we're going
to resume execution for and then actually resume it.

~~~
erdewit
It must be off by 3 of 4 orders of magnitude. Python with asyncio does about
120_000 switches per second with a trivial fibre/coroutine here:

    
    
      import time
      import asyncio
    
      REPS = 1_000_000
    
      async def coro():
          for i in range(REPS):
              await asyncio.sleep(0)
    
      t0 = time.time()
      asyncio.run(coro())
      dt = time.time() - t0
      print(REPS / dt, 'reps/s')

~~~
ovi256
Golang seems to do around 350k. There's a chance I'm missing some tricks, but
the code is so short it's not probable I'm missing that much.

See:
[https://repl.it/repls/GrimyChiefGame](https://repl.it/repls/GrimyChiefGame)

    
    
      package main
    
      import (
      	"fmt"
      	"time"
      )
    
      func coro(reps int) {
      	for i := 0; i < reps; i++ {
    		go time.Sleep(0 * time.Nanosecond)
      	}
      }
    
      func main() {
    	REPS := 5000000
    	start := time.Now()
    	coro(REPS)
    	dt := time.Since(start)
    	fmt.Printf("The call took %v to run.\n", dt)
    	fmt.Printf("REPS / duration %v\n", (REPS*1e9)/int(dt))
      }

~~~
coder543
I don't think you're measuring context switching.

if I remember correctly, Go's scheduler has a global queue and a local queue
per worker thread, so when you spawn a goroutine it probably has to acquire a
write lock on the global queue.

Allocating a brand new goroutine stack and doing some other setup tasks has a
nontrivial overhead that has nothing to do with context switching, regardless
of global locks.

To properly benchmark this, I think I would start with just measuring single
task switching by measuring how long it takes main to call
[https://golang.org/pkg/runtime/#Gosched](https://golang.org/pkg/runtime/#Gosched)
in a loop a million times. This would measure how quickly Go can yield a
thread to the scheduler and have it be resumed, although this includes the
overhead of calling a function.

Then I would launch a goroutine per core doing this yield loop and see how
many switches per second they did in total, and then launch several per core,
just to ensure that the number hasn't changed much from the goroutine per core
measurement.

Since Go's scheduler is not bound to a single core, it should scale pretty
well with core count.

I might run this benchmark myself in awhile, if I find time.

~~~
coder543
I wrote my own quick benchmark:
[https://gist.github.com/coder543/8c1b9cdffdf09c19ef61322bd26...](https://gist.github.com/coder543/8c1b9cdffdf09c19ef61322bd26d2e44)

The results:

    
    
        1 switcher:    14_289_797.08 yields/sec
        2 switchers:    5_866_478.94 yields/sec
        3 switchers:    4_832_941.33 yields/sec
        4 switchers:    4_604_051.57 yields/sec
        5 switchers:    4_268_906.99 yields/sec
        6 switchers:    3_982_688.58 yields/sec
        7 switchers:    3_799_103.41 yields/sec
        8 switchers:    3_673_094.58 yields/sec
        9 switchers:    3_513_868.07 yields/sec
        10 switchers:   3_351_813.00 yields/sec
        11 switchers:   3_325_754.64 yields/sec
        12 switchers:   3_150_383.56 yields/sec
        13 switchers:   3_037_539.31 yields/sec
        14 switchers:   2_435_807.77 yields/sec
        15 switchers:   2_326_201.72 yields/sec
        16 switchers:   2_275_610.57 yields/sec
        64 switchers:   2_366_303.83 yields/sec
        256 switchers:  2_400_782.51 yields/sec
        512 switchers:  2_408_757.26 yields/sec
        1024 switchers: 2_418_661.29 yields/sec
        4096 switchers: 2_460_257.29 yields/sec
    
    

Underscores and alignment added for legibility.

It looks like the context switching speed when you have a single Goroutine
just completely outperforms any of the benchmark numbers that have been posted
here for Python or Ruby, as would be expected, and it still outperforms the
others even when running 256 yielding tasks for every logical core.

The cost of switching increased more with the number of goroutines than I
would have expected, but it seems to become pretty constant once you pass the
number of cores on the machine. Also keep in mind that this benchmark is
completely unrealistic. No one is writing busy loops that just yield as
quickly as possible outside of microbenchmarks.

This benchmark was run on an AMD 2700X, so, 8 physical cores and 16 logical
cores.

~~~
ioquatix
I wrote an addendum [https://www.codeotaku.com/journal/2018-11/fibers-are-the-
rig...](https://www.codeotaku.com/journal/2018-11/fibers-are-the-right-
solution/context-switching)

With C++/assembly, you can context about 100 million times per CPU core in a
tight loop.

~~~
coder543
The one additional comment I have is that this addendum doesn't involve a
reactor/scheduler in the benchmark, so it excludes the process of selecting
the coroutine to switch into, which is a _significant_ task. The Go benchmark
I posted above is running within a scheduler.

But, I appreciate the addendum.

~~~
ioquatix
So, that's a good point, and yes the scheduler will have an impact probably
several orders of magnitude in comparison.

That being said, a good scheduler is basically just a loop, like:

[https://github.com/kurocha/async/blob/bee8e8b95d23c6c0cfb319...](https://github.com/kurocha/async/blob/bee8e8b95d23c6c0cfb31992491e4e9e1527db15/source/Async/Reactor.cpp#L63-L68)

So, once it's decided what work to do, it's just a matter of resuming all the
fibers in order.

Additionally, since fibers know what work to do next in some cases, the
overhead can be very small. You sometimes don't need to yield back to the
scheduler, but can resume directly another task.

------
vemv
I'm a bit hesitant about this article's proposition, although I'm open to be
convinced.

In the end there's a GIL. For pure CPU-bound workloads, your only true source
of parallelism will be putting more processes into the mix, be it via forking
or simply spawning more processes from scratch.

But inside a given a process, it seems to me that no matter what you do
(fibers/threads), only 1 CPU can do CPU-bound work at a time.

If I was to design a high-performance Ruby solution, I'd ditch forking,
threading and fibers. I'd focus instead in creating low-memory-footprint,
fast-startup processes (read: no Rails!) that one can comfortably spawn in
parallel, be it in a single server, or in a cluster (think k8s). Max Ruby
processes count: 1 per core.

~~~
repsilat
Mostly Ruby web app workloads aren't CPU-bound. And when they are, you're
using a language that is 100 to 1000 times slower than C.

That is, for a CPU-bound task, you might be able to replace your thousand node
k8s cluster with one machine running code written in a faster language (before
even getting into communication and load balancing and HA and all that crap.)

~~~
timdorr
But abandoning everything you've written in one language for the sake of a
single hot path isn't really a good idea either.

It's probably a more sensible idea to use a native extension to optimize just
that hot path. For instance, the Rust bindings are _really_ good:
[https://github.com/tildeio/helix](https://github.com/tildeio/helix)

~~~
repsilat
Sometimes. If your profile is "lumpy" enough that can be true, but often the
performance problems of slow languages manifest as annoyingly flat death-by-a-
thousand-cuts profiles. (Especially after the low hanging fruit has been
picked.)

Grinding another 5x out of a flat profile can be hard work requiring more
creativity and producing less readable code than a (fairly mechanical)
translation. Luckily these tasks also tend to crop up when design has
stabilised and team sizes are increasing, meaning it can often coincide with
the benefits of static typing becoming greater.

It's not a panacea, and it's obviously both risky and costly. Shrugs.

(I've also seen the other case, FWIW. Wrote some C extension code called from
a Rails app, and it was the right call at the time and we got great mileage
out of it.)

------
karmakaze
Title is actually "Fibers are the Right Solution" and the article ends with
"Fibers are the best solution for composable, scalable, non-blocking clients
and servers."

The title as posted is most definitely not true. I have worked on high volume
Ruby applications where the problem with using async I/O combined with poor
raw execution of Ruby code resulted in excess garbage to be collected. Ruby
performance should mean the performance of executing Ruby code. Perhaps the
title could have been 'Ruby webapp performance'.

~~~
ioquatix
> I have worked on high volume Ruby applications where the problem with using
> async I/O combined with poor raw execution of Ruby code resulted in excess
> garbage to be collected.

What async I/O approach did you use and where did the garbage come from? I/O
buffers?

~~~
karmakaze
It was an api endpoint making a number of db and other network requests. The
garbage was the accumulation of temporary and response objects that were held
for the long duration of processing, up to half of which could be in the ruby
code not waiting on i/o. If the code ran faster there would be more time for
gc and handling the incoming request volume.

------
bhuga
> We show that fibers require minimal changes to existing application code and
> are thus a good approach for retrofitting existing systems.

I disagree that 'minimal changes to existing code' is a good goal for a Ruby
parallelism solution. The large Ruby codebases I have dealt with have gigantic
hacks to get around the GIL: complex fork management systems, subprocess
management, crazy load-this-but-not-that systems for prefork memory management
to optimize copy on write, and probably more. Parallel tasks in Ruby are a
nightmare to work with.

Changing existing code _should_ be a goal of any Ruby parallelism solution. If
we can't get rid of this kind of cruft, what are we even doing?

I still love Ruby, but I want go-style channels, not apologies for the GIL.

~~~
olivierlacan
That’s the plan with Guilds:
[http://www.atdot.net/~ko1/activities/2018_rubyconf2018.pdf](http://www.atdot.net/~ko1/activities/2018_rubyconf2018.pdf)

These are the slides for Koichi Sasada’s RubyConf 2018 (last week) talk
updating the community on his progress in the design and implementation of
Guilds.

Surprising that the author isn’t aware of this planned CRuby feature announced
in 2016. I wrote about it then: [https://olivierlacan.com/posts/concurrency-
in-ruby-3-with-gu...](https://olivierlacan.com/posts/concurrency-in-
ruby-3-with-guilds/)

Take this with a grain of salt of course because the design has clearly
changed since then and I need to upgrade my post or do a follow-up.

~~~
ioquatix
Hi, I'm the author, I'm also a Ruby core committer, and yes I'm aware of
Guilds. The are a model for parallelism and don't really affect concurrency at
all. At best, they provide a 3rd option on top of processes/threads/guilds.

~~~
olivierlacan
That makes sense. Sorry for the assumption.

Guilds felt omitted in your post since they (should) address one of the points
you make about the usability/ergonomics of existing Ruby APIs for managing
non-sequential execution.

But it’s definitely a bit early to tell what Guilds will actually look like as
a final product.

~~~
ioquatix
Guilds are just a fancy name for threads or processes at this point. What they
ultimately end up being is yet to be seen :)

------
devxpy
Fibers are basically just async tasks with multicore support, right?

~~~
coder543
Fibers are basically just async tasks where each task looks and feels like a
normal, synchronous thread to the programmer.

It doesn't really imply anything about how parallel (or not) the fibers
execute.

~~~
devxpy
So it is preemptively switching then?

> It doesn't really imply anything about how parallel (or not) the fibers
> execute

What's all the fuss about paralllelism in the article about then?

~~~
coder543
> So it is preemptively switching then?

Any time you call a blocking function that the system provides, it should
immediately yield the fiber, which makes it _look_ preemptive. If you write a
loop that just spins forever, that could block the whole system, potentially,
making the abstraction leaky. In a language like Ruby, they could definitely
add some true preemption, but I don't know if that's what they plan to do.

From the article: "Fibers, on the other hand, are semantically similarly to
threads (excepting parallelism), but with less overhead." So, the author
definitely isn't implying parallelism of the fibers.

> What's all the fuss about paralllelism in the article about then?

The author was talking about different methods for handling more than one
request at a given time, which include forking and threads. With Ruby's GIL,
threads are a lot less attractive than they could be. A good fiber
implementation can handle tons of network requests concurrently and very
efficiently even on a single core, which is the case being discussed here.

At the end, the author discusses a hybrid approach of forking and fibers,
where each processor core would have a fork of the Ruby program running, and
each fork would have its own fiber pool, running many tasks concurrently.

In languages that don't have a GIL, forking is _rarely_ a tool that I reach
for. It really hurts your database pooling and all sorts of other small
problems, but it's a common trade-off when using Ruby, Python, and Node.

~~~
pooppaint
> Any time you call a blocking function that the system provides, it should
> immediately yield the fiber, which makes it look preemptive.

In the old days we would call this _cooperative_ to contrast it from
preemptive. This is the essence of cooperative, yielding at explicit points be
they IO request, timers, or waiting on a message queue. Preemptive used to
mean a certain thing and this is not it at all.

~~~
coder543
cooperative multitasking typically implies (to me, at least) that the
programmer is required to explicitly / manually yield their task, which is
annoying, error prone, and isn't required here. The system's blocking
functions will handle that behind the scenes.

Fibers _are_ cooperative here, but not from the programmer's point of view,
and that's an important distinction to make. If you write the same code for a
cooperative system as you would for a preemptive system, is there really any
difference to the programmer? It _looks_ preemptive. If anything, properly
implemented cooperative systems are more efficient. Most of the time when
people ask the question that is asked higher in the thread, I believe they're
worried that they will be responsible for remembering to yield control.

I'm pretty sure I did a decent job in my previous comment of explaining that
the system only _looks_ preemptive, and that it is possible to block it with
some uncooperative code, so I'm not sure what point you're trying to make.

~~~
Xixi
Not the person you are responding to.

It's a matter of point of view, but to me cooperative/preemptive is a property
of the underlying scheduler, not of what the programmer is _usually_ exposed
to. As you correctly pointed out, it is possible to block the scheduler with
uncooperative code. It's not even hard: it takes just one heavy CPU-bound
computation. I write these kind of computations every day: if you sell me a
system as preemptive and it's not, I will get angry...

------
magwa101
I've coded multiprocess, multithread, callback and context switching on
multiple OSes and languages. Golang nails it, a bed of green threads scheduled
onto CPU threads with isolation. The foundation is correct.

~~~
ioquatix
It's the perfect approach except users must deal with mutual exclusion which
is tricky for most developers.

------
zeveb
Seems to me that one interesting approach would be to automatically rewrite a
program in continuation-passing style, and then use callbacks (the lowest-
overhead approach) where the callback is the continuation.

It _ought_ to compile pretty cleanly, since a continuation can end up being
compiled into an address.

~~~
ioquatix
I've thought about this too, and it might be the route C++ takes. That being
said, to resume you need to wind through all your function calls executing
parts of the state machine, it might actually be slower.

That being said, I'm sure there are different ways to implement it and I look
forward to proposed faster/efficient implementations.

------
adamnemecek
Cooperative switching hasn’t been this cool since 1985! /s

~~~
mikekchar
I've been trying hard not to be snarky :-) It's pretty amazing how old things
become new as a combination of 1) people not knowing about it and 2) it being
given a new name. The first one is a bit sad and speaks volumes about our
collective lack of education as an industry. I gladly include my self in that
indictment by the way. I only console myself with the fact that the things I
don't know (but should) is probably greater than what could fill my (mostly
useless) 4 year CS degree. Of course I graduated a _long_ time ago... I'm a
bit worried that as people get into this industry in less traditional ways we
are gradually losing the sight of the fact that there really _is_ something
important to learn.

------
kljdlwjfkewe
I'm not trying to be snarky but writing a performant server or web server in
userland Ruby doesn't seem to be good engineering. We already have nginx &
apache. If you want to write performant servers you probably want something
like Go, Rust, C++, C, D, JVM, BEAM.

------
srean
I am glad that fibers and coroutines are going mainstream the way it is. With
sincere and profuse apologies to Ruby fans (lest they think this is an attempt
at hijacking the discussion) let me share this
[https://felix.readthedocs.io/en/latest/fibres.html](https://felix.readthedocs.io/en/latest/fibres.html)
I feel some HN'ers would be curious as there arent many languages that offer
fibers as a first class citizen.

BTW I am in no way involved in the development of the Felix. Its not a new
language, its more than 15 years old and has had fibers from the start.

------
mamcx
BTW, a easy explanation in how implement this for a language?

I asked in
[https://www.reddit.com/r/ProgrammingLanguages/comments/9xsya...](https://www.reddit.com/r/ProgrammingLanguages/comments/9xsyad/how_provide_in_an_interpreter_iteratosgenerators/)
and get a nice answer but is also of the style "just read this code and get
it".

------
ridiculous_fish
How does context switching occur wrt the native stack? Does ruby avoid using
the native stack, or does this save and restore it?

------
Tharkun
Are these similar in implementation to the upcoming Java Fibers? Those
basically yield automatically whenever they perform a blocking operation, so
another fiber can be executed.

~~~
aardvark179
Ruby fibers yield control to the fiber that spawned them or transfer control
to another specific fiber explicitly. That's close to the Continuation model
which Loom's fibers are built on top of, but Loom's fibers hide that process
of yielding from the user, instead concentrating on providing a more general
scheduling API and yielding at certain points inside the core library.

There are quite a lot of other small differences as well since Ruby fibers are
always run on the same thread which originally spawned them, whereas Loom's
continuations and fibers can be resumed on a different thread.

It is quite possible to implement Ruby fibers in terms of Loom's
continuations, and we've done a prototype of this in TruffleRuby (I think
Charles Nutter has done one in JRuby as well), and it certainly allows us to
support very large numbers of active fibers.

------
exabrial
Types are the right solution to fix Ruby performance. Crystal is showing how a
language can be simple and performant!

~~~
Maledictus
Ruby has types. Not everyone enjoys working with statically typed languages.

~~~
gxigzigxigxi
it doesn’t really matter what you _enjoy_ if we are talking about what is
needed to make a given language faster. That being said, nodejs on v8 is a lot
faster than ruby and dynamically typed, so I do agree with your conclusion
that static typing is not necessarily the answer.

~~~
k33n
Ruby has always put developer experience before performance. However, both are
improving with each release. Give it time.

~~~
cnasc
Getting a bit off topic here, but you sound knowledgeable about Ruby. Ruby is
a language I've always admired from afar, but never spent much time trying
out. I've read the poignant guide, which was amusing but didn't teach me too
much. What's the SICP of Ruby? As in, a high quality book that will teach me
the ins and outs of the language.

~~~
canhascodez
Ruby is a heavily idiomatic language. It's strongly advised to use rubocop or
a similar style guide -- and to temper that with good judgment, at least when
it recommends avoiding `!!`.

In addition to the Pickaxe book, I recommend _Metaprogramming Ruby_ by Paolo
Perrotta, and potentially the sequel to that book (which I have yet to read).

Ruby lends itself to a very fluid style. One of the things that you may find
less common is explicit variable assignment within a method: most of the time
your assignment will be in method signatures and block declarations. The
following code is illustrative of technique, but not recommended practice:

    
    
      puts(Enumerator.new do |y|
        loop { y.yield Array.new(rand(1..18)) { [*'a'..'z'].sample }.join }
      end.lazy.reject(&IO.readlines('/usr/share/dict/words').method(:include?)).first(10).join(' '))
    

This generates "non-words", which are guaranteed not to exist in the (Unix)
system dictionary, without using explicit assignment. First it creates a
(lazy) generator object which yields random "words" of varying length. In
Ruby, if you are calling a method with only one argument in an inner loop, you
can avoid writing the loop explicitly, which is nice here because it also
avoids the performance hit of reading the dictionary file repeatedly. The
`method` method lets us pass an object-method pair, to be evaluated later, and
the '&' there is a sort of cast-to-block, and you'll see that used in other
contexts. So, at that point, we have a lazily-evaluated collection of lazily-
filtered strings, and we can take an arbitrary number of these out and print
them.

The nice thing about Ruby is that you can probably express what you want in
one statement. This does come at the cost of a fair amount of idiom. Some of
it is functionally important, some of it is convenient (like using the `*`
operator to destructure ranges), and some is pure window dressing, but
enforced by convention just the same. The Pickaxe book is better than anything
else that I am aware of for describing Ruby idioms. I'm not sure how well it
has aged. It's probably recommended to do a lot of pair programming and code
review. At times I have mentored others on the website Exercism, and I would
recommend that or a similar site.

~~~
dragonwriter
> The Pickaxe book is better than anything else that I am aware of for
> describing Ruby idioms. I'm not sure how well it has aged.

The last version covered Ruby 1.9/2.0; the language has evolved since 2.0, so
it's certainly not complete and current.

~~~
canhascodez
I had decided against a somewhat stronger statement; I didn't want to bash
Pickaxe unnecessarily, particularly as I don't have a lot of better
suggestions. I'm fairly inclined to write a book myself to address the
situation, but not any time soon.

------
mfourcade
beware before using fibers because it won't access Thread.current[:vars].

So if you plan on going with fiber on an existing app, double check first if
Thread.current[:vars] is a must have for you & you dependencies (ex: the I18n
gem

~~~
spariev
Are you sure about that ? If you search `Ruby Thread Locals are also Fiber-
Local` there will be a blog post from 2012 about that, and the code sample
works fine for me on ruby 2.5

~~~
mfourcade
I'm sorry, my previous comment was not clear. I should have written beware
before using fibers because it won't access Thread.current[:vars] _as you
might expect_.

Here is an example:

    
    
      fiber = Fiber.new do # Thread.current locals are copied to the Fiber when fiber it is built
        puts "1. #{Thread.current[:test]}" 
        Thread.current[:test] = false # the fiber has it's own stack, won't leak away
        puts "2. #{Thread.current[:test]}"
      end
    
      Thread.current[:test] = true
      fiber.resume
    
      puts "3. #{Thread.current[:test]}"
    
    

Output:

    
    
      1. 
      2. false
      3. true
    
    

So fibers comes with their _own stack_, including threads locals, yes, but
from _when_ you instantiated them. Not from Thread.current :/ Also writing
Thread.current[] won't apply outside the Fiber.

I was confused, and not alone:
[https://github.com/rails/rails/pull/30361](https://github.com/rails/rails/pull/30361).

So my conclusion about Fiber is always be careful with Fiber / Thread.current.

edit: style/explanation

------
snek
tl;dr context switching gives good perf

------
evantahler
And ruby become node!

------
xutopia
I've been using Ruby and Rails for about 12 years now and I see all the
solutions people come up with as bandages over a wound. A normal Rails app
starts off with Node and Redis installed. If like me you hop onto the Elixir
world it becomes refreshing to see the language you are using do all the stuff
multiple tools did and doing it faster.

