
Go's work-stealing scheduler - signa11
https://rakyll.org/scheduler/
======
geodel
The original design document for scheduler [1]. The author Dmitry Vyukov is
famous for lockfree data structures implementation.

1\.
[https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sL...](https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw/edit)

------
twoodfin
One missing piece for me: How does the Go scheduler ensure it won't end up
fighting the underlying OS scheduler? Are thread-affinity APIs enough?

Does the scheduler have any awareness of NUMA or NUMA-like topologies (I'm
thinking here of hyperthreading) that could influence scheduling decisions?

~~~
jlouis
An interesting tidbit from the Erlang world: We have CPU affinity controls
(erl(1) options +sbt {db,tnnps}) which will spread schedulers (Erlang's name
for a P) over a NUMA-like topology and keep track of the relative price of
handoff/stealing between CPUs. Another option defines if you want load
compaction and load balancing (+scl, +sub). Some workloads are more efficient
if you can keep the work mostly on a few cores rather than spread the work
over all available cores in the system because communication will be faster
that way. If one core can process all work and get back to an idling state
quickly, it usually beats spreading the load out due to caching.

For some workloads and hardware configurations, the gain from doing this is
quite high because you can avoid schedulers jumping between cores and in the
process destroying cache layers and TLBs.

OTOH, when running in a typical cloud environment, you are at the mercy of the
underlying hypervisor and thus your load profile is different. In that case,
affinity has little to no effect.

------
pron
Go's work-stealing scheduler is crude in comparison to Doug Lea's, which is a
work of artful mechanical sympathy:
[https://github.com/netroby/jdk9-dev/blob/master/jdk/src/java...](https://github.com/netroby/jdk9-dev/blob/master/jdk/src/java.base/share/classes/java/util/concurrent/ForkJoinPool.java#L173)

~~~
geodel
Seems like a reasonable take on Java Fork-Join shortcomings [1]. One of them
is exceeding complexity which of course is very popular in Java world and the
comment shows how Go version is 'crude' so apparently not good enough.

"The F/J framework classes -

1) have many, many levels of inheritance,

2) nested classes on top of nested classes,

3) instance variables used directly by other classes (known internally as
“representation-level coupling among classes”), code from the Hackers Delight
without comments about what it does,

4) homegrown deques and queues instead of standard Java™ Classes, and so much
more."

1\.
[http://www.coopsoft.com/ar/CalamityArticle.html](http://www.coopsoft.com/ar/CalamityArticle.html)

~~~
scott_s
These are arguments against the implementation, not the actual utility of the
framework. One can have ugly internal implementation (which sometimes is
unavoidable because you're solving hard problems), but clean interfaces.
Regardless, Doug Lea anticipated these complaints in the comments to his code:

    
    
         * Style notes
         * ===========
         *
         * Memory ordering relies mainly on VarHandles.  This can be
         * awkward and ugly, but also reflects the need to control
         * outcomes across the unusual cases that arise in very racy code
         * with very few invariants. All fields are read into locals
         * before use, and null-checked if they are references.  This is
         * usually done in a "C"-like style of listing declarations at the
         * heads of methods or blocks, and using inline assignments on
         * first encounter.  Nearly all explicit checks lead to
         * bypass/return, not exception throws, because they may
         * legitimately arise due to cancellation/revocation during
         * shutdown.
         *
         * There is a lot of representation-level coupling among classes
         * ForkJoinPool, ForkJoinWorkerThread, and ForkJoinTask.  The
         * fields of WorkQueue maintain data structures managed by
         * ForkJoinPool, so are directly accessed.  There is little point
         * trying to reduce this, since any associated future changes in
         * representations will need to be accompanied by algorithmic
         * changes anyway. Several methods intrinsically sprawl because
         * they must accumulate sets of consistent reads of fields held in
         * local variables.  There are also other coding oddities
         * (including several unnecessary-looking hoisted null checks)
         * that help some methods perform reasonably even when interpreted
         * (not compiled).
    

I find the "Calamity" article disingenuous. If you read more of the comments,
Lea clearly explains the source of many of his data structures and gives an
overview of how it all works.

~~~
cooper6
The article was written in 2010 just before the release of JDK1.7. The above
comments are for the JDK1.9 release in late 2017. Doug has had some time to
improve his comments. Perhaps you should look into the calamity part 2 article
to see how well this "framework" with it's internal structure problems
performs in JDJ1.8 streams.

------
teacpde
> Each M should be assigned to a P. Ps may have no Ms if they are blocked or
> in a system call. At any time, there are at most GOMAXPROCS number of P. At
> any time, only one M can run per P. More Ms can be created by the scheduler
> if required.

So if every new Goroutine I spawn does an I/O operation that takes long time
to finish, the scheduler will essentially spawn the same number of OS threads
because every other one is blocked and no 'spinning' thread is available? If
so, creating new OS threads seems like a lot overhead comparing to run a
single thread in an event loop.

~~~
Matthias247
That's only happening for IO operations which get really blocked on syscall
level, which means no async version is available. For the most important IO
operations (sockets) the asynchronous variants are used internally. Which
means if a read/write can not be performed immediately, the goroutine will
register at the netpoller component (which is a wrapper around
epoll/kqueue/iocp for IO status updates) for IO status updates, and then the
goroutine will yield and wait until IO becomes possible again. The current
thread can execute another goroutine until then. This means for goroutines
blocked on network no extra OS threads are required. For other blocking OS
operations (maybe file IO) extra OS threads are required.

------
snnn
Tensorflow has the same thing from Eigen

------
signa11
there is also this paper:
[http://supertech.csail.mit.edu/papers/steal.pdf](http://supertech.csail.mit.edu/papers/steal.pdf)
which also describes something similar.

~~~
scott_s
For interested readers, there's a bunch of papers on the Cilk runtime's work-
stealing scheduler. The paper signa11 pointed to is one of many. They are
classics in the runtime literature. Other two great papers are "Cilk: An
Efficient Multithreaded Runtime System" by Blumofe et al., PPoPP 1995,
[http://supertech.csail.mit.edu/papers/PPoPP95.pdf](http://supertech.csail.mit.edu/papers/PPoPP95.pdf)
and "The Implementation of the Cilk-5 Multithreaded Language", Frigo et al.,
PLDI 1998,
[http://supertech.csail.mit.edu/papers/cilk5.pdf](http://supertech.csail.mit.edu/papers/cilk5.pdf)

------
omginternets
>if not found, poll network.

Does this mean network IO is only performed when there are no idling
goroutines? That can't be right... what am I not understanding, here?

~~~
chrisseaton
I don't know if the article is accurate or not, but I think you do have it
backwards from what the article says.

I think it says that network IO is only performed where there are no runnable
(not idling) goroutines.

~~~
loppers92
The author of this article is part of the Golang team
([https://github.com/rakyll](https://github.com/rakyll)). I guess the
correctness of this article is ok.

------
Matthias247
> Ps may have no Ms if they are blocked or in a system call.

Is it really this way or the other way around? I would have guessed threads
which are blocked in a system call are represented by an M which is not
currently scheduled by the scheduler, which means that M isn't assigned to a
P.

~~~
samuell
I'm not super sure what was meant here, but I could add one point of info,
FWIW: Go is creating extra OS threads for blocking system calls. In fact,
sometimes more than one thread per such (for taking care of stdin and stdout,
I think).

I have been able to create at least 4999 go-routines with blocking system
calls though (Two OS threads were created per go-routine / syscall), so not a
problem you will run into quickly.

~~~
stcredzero
_FWIW: Go is creating extra OS threads for blocking system calls._

One point for house Go!

------
euyyn
Interesting!

