
NUMA-aware scheduler for Go - signa11
https://docs.google.com/document/u/0/d/1d3iI2QWURgDIsSR6G2275vMeQ_X7w-qxM2Vp7iGwwuM/pub
======
billhathaway
There is a new small thread[0] on golang-dev about someone from Intel looking
into this. It would be great to see the go scheduler be more aware of NUMA
characteristics.

[0] [https://groups.google.com/d/msg/golang-
dev/ARthO774J7s/7D9P0...](https://groups.google.com/d/msg/golang-
dev/ARthO774J7s/7D9P00XhAQAJ)

~~~
stonogo
Intel needs _everything_ to be NUMA-aware. They're betting a lot of money on
Xeon Phi, and once the self-booting KNL machines are out nobody will want to
deal with the pcie cards any more.

~~~
Jweb_Guru
As far as I know, the Phi doesn't actually require NUMA-awareness at all (at
least, the older models didn't; see
[https://arxiv.org/pdf/1310.5842v2.pdf](https://arxiv.org/pdf/1310.5842v2.pdf)).
A Phi lives on a single socket with a coherent L2 cache, and remote L2
accesses are not much slower than main memory ones, nor does core distance
along the interconnect seem to affect access time. The new models with lots of
main memory are going to be used with six-DIMM slot DDR4 sockets (64 GB each
of DDR4, in addition to 16 GB MCDRAM to get even more absurd bandwidth for
pure FLOPS / benchmark / coprocessing workfloads; see
[http://www.intel.com/content/www/us/en/processors/xeon/xeon-...](http://www.intel.com/content/www/us/en/processors/xeon/xeon-
phi-processor-x200-product-family-datasheet.html)), in order to avoid having
to split the Phi up into multiple NUMA domains.

So, I have no idea why Intel would care at all about making stuff NUMA-aware
for the purpose of Phis. Cache-aware, sure, but that's pretty much required
for good performance on modern machines already. What they _would_ care about
is making everything vectorize properly, since Phis do horribly if you aren't
exploiting the VPU; hence, you'd think they'd be more interested in adding
badly missing SIMD support to Go than NUMA-aware scheduling.

(Please let me know if I'm wrong and there's a multi-socket Phi announced, but
I've been following it really carefully because I'm excited about the
possibilities of using the new KNLs for main-memory databases, and I have yet
to hear anything about that).

~~~
martinpw
There is no multi-socket Phi - I asked about it at an Intel booth at a
conference a while back and was told the delta between memory bandwidth and
inter-socket bandwidth would be so great that it would not be a useful
configuration.

I believe the talk of NUMA refers to the single socket behaving like a cluster
with up to 4 NUMA domains, but I can't find any good references right now.

~~~
Jweb_Guru
Ah, interesting; I hadn't read that anywhere. From the limited reading I just
did, it does seem like that's a configuration they offer, but from the scant
sources available I can't quite figure out to what extent it's actually
necessary to extract maximum performance out of the machine (compared to just
artificially pinning each core to disjoint memory). Either way, good
information--thanks!

------
scott_s
The first listed risk is why I shy away from solutions that depend on pinning
threads to logical processors:

 _Several processes can decide to schedule threads on the same NUMA node. If
each process has only one runnable goroutine, the NUMA node will be over-
subscribed, while other nodes will be idle. To partially alleviate the
problem, we can randomize node numbering within each process. Then the
starting NODE0 refers to different physical nodes across [Go] processes._

Basically, your particular runtime system is probably not going to be the only
thing running on a host. And even if it is, the kernel itself may choose to
run things on particular logical processors, and it may not take into account
what pinning you have done. For that reason, I find these approaches brittle.
If your users know exactly how they're going to deploy applications (not _you_
, since you're implementing a runtime system for user code), they can squeeze
out some more performance, but all it can take is one extra process running on
that host to mess it all up.

That's the difficulty with implementing runtime systems, and not applications:
your runtime system has to work for (usually) arbitrary user code on (usually)
arbitrary systems. If you're writing a single application, and you know
exactly how and where it will run, thread-pin away. But when implementing a
runtime system, you don't have that kind of luxury. You often have to leave
performance on the floor for a small number of cases so that you don't hose it
for most cases.

In principle, I think this kind of scheduling should be handled by the
operating system itself. If the kernel does not have enough information to do
it properly, then we can identify what information it would need, and devise
an API to inform it. But the kernel is the only entity that always has global
knowledge of everything running, and controls all of the resources. I find
that a much more promising direction.

As some minor support, consider the recent paper "The Linux Scheduler: a
Decade of Wasted Cores",
[https://news.ycombinator.com/item?id=11501493](https://news.ycombinator.com/item?id=11501493).
My intuition is that runtime systems which perform thread pinning like this
will tend to make such problems _worse_ , since it constrains the kernel
scheduler even more.

~~~
toast0
> Basically, your particular runtime system is probably not going to be the
> only thing running on a host.

I'm running Erlang, not Go, but basically the runtime is the only real thing
running on our systems[1], so it's good for the runtime to pin its os-threads
to specific logical processors. On the systems where this isn't the case (for
example, when using a separate TLS termination daemon), it's easy to unpin the
threads and let the OS manage where to run things.

[1] there's also monitoring, ntpd, sshd, getty, and network/disk processing in
the kernel

~~~
scott_s
I'm unfamiliar with Erlang's runtime, so please forgive some basic questions.

Is Erlang's runtime doing the thread pinning without any input from you? Or
are you, at the application level, explicitly telling the Erlang runtime how
to pin threads?

edit: Did some googling, looks like it's the latter:
[http://erlang.org/doc/man/erl.html#+sbt](http://erlang.org/doc/man/erl.html#+sbt).
There are a bunch of policy options where the user picks what behavior they
think will work best with their application, on the current system. Key to my
point, though is: _The runtime system will by default not bind schedulers to
logical processors._

Providing options where users opt-in to such behavior is good. But the Go
proposal, as far as I read, was unilaterally proposing that is how the runtime
would work, always. That's not good, for the reasons I stated.

------
morecoffee
One thing I have never understood about the Go scheduler is how P's are
involved. The OS (assume Linux) works with threads, and schedules threads not
processors. How does it pin the P to the processor, or in this case Node?

~~~
willvarfar
Go calls them 'processors', but in OS terms they are OS threads. You can
configure Go to have some different number of processors than you have
physical processors (GOMAXPROCS).

~~~
prattmic
This is not quite right. M's are OS threads. P's are processing units, on
which goroutines are scheduled. There are exactly GOMAXPROCS P's. P's are
scheduled to run on M's, but there may be more M's than GOMAXPROCS.

For instance, when a goroutine makes a blocking syscall, it will continue to
use its current M (which is blocked in the kernel), but will release its P,
allowing another goroutine to execute.

This means that GOMAXPROCS goroutines can execute in user space in parallel,
but more goroutines can be blocked in the kernel on different OS threads.

The Go runtime will create more M's as necessary to run all of the P's.

(Note that the Go runtime does try to avoid needing one M per goroutine. For
instance, goroutines blocked on a channel are descheduled entirely (they give
up their P and M), and are scheduled again only once they need to be woken.)

~~~
giovannibajo1
It's also very important to notice that network blocking calls like send/recv
also release the P because the scheduler knows what's happening and passes the
FDs over to the net poller, that is a single thread that waits on all of them
through epoll or similar API. So you don't end up with one M for network
socket and you get fully transparent async networking

------
iends
This was proposed a few years ago but it never got any traction it seems.

~~~
gribbly
Well, AFAIK the author of this suggestion, Dmitry Vyukov, is the main
architect of Go's runtime scheduling, so I doubt there is anything preventing
him from implementing this should he so wish.

