NUMA-aware scheduler for Go

billhathaway · on Sept 9, 2016

There is a new small thread[0] on golang-dev about someone from Intel looking into this. It would be great to see the go scheduler be more aware of NUMA characteristics.

[0] https://groups.google.com/d/msg/golang-dev/ARthO774J7s/7D9P0...

stonogo · on Sept 9, 2016

Intel needs everything to be NUMA-aware. They're betting a lot of money on Xeon Phi, and once the self-booting KNL machines are out nobody will want to deal with the pcie cards any more.

Jweb_Guru · on Sept 9, 2016

As far as I know, the Phi doesn't actually require NUMA-awareness at all (at least, the older models didn't; see https://arxiv.org/pdf/1310.5842v2.pdf). A Phi lives on a single socket with a coherent L2 cache, and remote L2 accesses are not much slower than main memory ones, nor does core distance along the interconnect seem to affect access time. The new models with lots of main memory are going to be used with six-DIMM slot DDR4 sockets (64 GB each of DDR4, in addition to 16 GB MCDRAM to get even more absurd bandwidth for pure FLOPS / benchmark / coprocessing workfloads; see http://www.intel.com/content/www/us/en/processors/xeon/xeon-...), in order to avoid having to split the Phi up into multiple NUMA domains.

So, I have no idea why Intel would care at all about making stuff NUMA-aware for the purpose of Phis. Cache-aware, sure, but that's pretty much required for good performance on modern machines already. What they would care about is making everything vectorize properly, since Phis do horribly if you aren't exploiting the VPU; hence, you'd think they'd be more interested in adding badly missing SIMD support to Go than NUMA-aware scheduling.

(Please let me know if I'm wrong and there's a multi-socket Phi announced, but I've been following it really carefully because I'm excited about the possibilities of using the new KNLs for main-memory databases, and I have yet to hear anything about that).

martinpw · on Sept 9, 2016

There is no multi-socket Phi - I asked about it at an Intel booth at a conference a while back and was told the delta between memory bandwidth and inter-socket bandwidth would be so great that it would not be a useful configuration.

I believe the talk of NUMA refers to the single socket behaving like a cluster with up to 4 NUMA domains, but I can't find any good references right now.

Jweb_Guru · on Sept 9, 2016

Ah, interesting; I hadn't read that anywhere. From the limited reading I just did, it does seem like that's a configuration they offer, but from the scant sources available I can't quite figure out to what extent it's actually necessary to extract maximum performance out of the machine (compared to just artificially pinning each core to disjoint memory). Either way, good information--thanks!

stonogo · on Sept 9, 2016

It's not about multi-socket configuration; selfbooting KNL machines can have two classes of memory. For brevity's sake you can think of them as "fast" and "huge." A regular malloc call gets you a piece of "huge", and there's a separate malloc function available to allocate "fast." This is the difference between the DDR4 and the MCDRAM you mentioned -- they're not accessed uniformly.

While Intel has done a ton of work to make sure you don't have to care about this, it's obviously in their best interest to have as much software as possible be able to care about this, especially because the KNL clock is so slow.

Jweb_Guru · on Sept 9, 2016

Sure, but it's not clear to me that the best way to deal with the two memory classes would be with NUMA-aware scheduling, unless you happen to have an application with "fast" and "slow" application threads (which I suspect describes relatively few applications in practice; and even if it does, wouldn't you have to tell the scheduler about it explicitly?) Seems to me like it will usually be much more efficient to use the MCDRAM either as L3 (default configuration) or as an explicit scratchpad (which a scheduler wouldn't really be able to exploit, given that if the scheduler has data structures that don't fit in L2 it's already probably screwed).

That being said, I did some more reading this morning and the sub-NUMA clustering configuration on the new Phi does provide tile-to-directory-to-MCDRAM affinity (via pin domains), which would make sense for maximizing its performance as either L3 or scratchpad; AFAICT this is not the case for the remote DDR4, though. So whether it's worth caring probably depends very much on your workload; I think KNL is most interesting for workloads with working datasets that are much larger than 16GB, since otherwise you could just use a GPU (you can get more usable working memory per second with much better bandwidth with something like a DGX-1 thanks to NVLink, but unless I'm missing something not at a remotely competitive pricepoint, and it's unclear to me whether it's sustainable for larger working sets since you can only transfer up to 80 GB/s from the CPU to the GPUs, which is lower than the 90 GB/s each Phi gets out of DDR4 on Triad [and a better comparison is probably the 115.2 theoretical peak for KNL anyway]).

leetrout · on Sept 9, 2016

For anyone else who hadn't heard of this...

In designing the second-generation Intel Xeon Phi chip, we created a massively multicore processor that is available in a self-boot socket. This eliminates the need to run an OS on a separate host and pass data across a PCIe* slot. (However, for those who prefer using the latest Intel Xeon Phi chip as a co-processor, a PCIe-card-version will be available shortly.)

https://software.intel.com/en-us/blogs/2016/06/20/how-xeon-p...

And if anyone knows of a "KNL for dummies" or similar please let me know.

scott_s · on Sept 9, 2016

The first listed risk is why I shy away from solutions that depend on pinning threads to logical processors:

Several processes can decide to schedule threads on the same NUMA node. If each process has only one runnable goroutine, the NUMA node will be over-subscribed, while other nodes will be idle. To partially alleviate the problem, we can randomize node numbering within each process. Then the starting NODE0 refers to different physical nodes across [Go] processes.

Basically, your particular runtime system is probably not going to be the only thing running on a host. And even if it is, the kernel itself may choose to run things on particular logical processors, and it may not take into account what pinning you have done. For that reason, I find these approaches brittle. If your users know exactly how they're going to deploy applications (not you, since you're implementing a runtime system for user code), they can squeeze out some more performance, but all it can take is one extra process running on that host to mess it all up.

That's the difficulty with implementing runtime systems, and not applications: your runtime system has to work for (usually) arbitrary user code on (usually) arbitrary systems. If you're writing a single application, and you know exactly how and where it will run, thread-pin away. But when implementing a runtime system, you don't have that kind of luxury. You often have to leave performance on the floor for a small number of cases so that you don't hose it for most cases.

In principle, I think this kind of scheduling should be handled by the operating system itself. If the kernel does not have enough information to do it properly, then we can identify what information it would need, and devise an API to inform it. But the kernel is the only entity that always has global knowledge of everything running, and controls all of the resources. I find that a much more promising direction.

As some minor support, consider the recent paper "The Linux Scheduler: a Decade of Wasted Cores", https://news.ycombinator.com/item?id=11501493. My intuition is that runtime systems which perform thread pinning like this will tend to make such problems worse, since it constrains the kernel scheduler even more.

toast0 · on Sept 9, 2016

> Basically, your particular runtime system is probably not going to be the only thing running on a host.

I'm running Erlang, not Go, but basically the runtime is the only real thing running on our systems[1], so it's good for the runtime to pin its os-threads to specific logical processors. On the systems where this isn't the case (for example, when using a separate TLS termination daemon), it's easy to unpin the threads and let the OS manage where to run things.

[1] there's also monitoring, ntpd, sshd, getty, and network/disk processing in the kernel

scott_s · on Sept 9, 2016

I'm unfamiliar with Erlang's runtime, so please forgive some basic questions.

Is Erlang's runtime doing the thread pinning without any input from you? Or are you, at the application level, explicitly telling the Erlang runtime how to pin threads?

edit: Did some googling, looks like it's the latter: http://erlang.org/doc/man/erl.html#+sbt. There are a bunch of policy options where the user picks what behavior they think will work best with their application, on the current system. Key to my point, though is: The runtime system will by default not bind schedulers to logical processors.

Providing options where users opt-in to such behavior is good. But the Go proposal, as far as I read, was unilaterally proposing that is how the runtime would work, always. That's not good, for the reasons I stated.

wmf · on Sept 9, 2016

Erlang/Go/Java VMs could probably benefit from some kind of "appliance mode" where they take over the whole machine and reconfigure the kernel for maximum performance, but I wouldn't want that mode to be the default.

morecoffee · on Sept 9, 2016

One thing I have never understood about the Go scheduler is how P's are involved. The OS (assume Linux) works with threads, and schedules threads not processors. How does it pin the P to the processor, or in this case Node?

willvarfar · on Sept 9, 2016

Go calls them 'processors', but in OS terms they are OS threads. You can configure Go to have some different number of processors than you have physical processors (GOMAXPROCS).

prattmic · on Sept 9, 2016

This is not quite right. M's are OS threads. P's are processing units, on which goroutines are scheduled. There are exactly GOMAXPROCS P's. P's are scheduled to run on M's, but there may be more M's than GOMAXPROCS.

For instance, when a goroutine makes a blocking syscall, it will continue to use its current M (which is blocked in the kernel), but will release its P, allowing another goroutine to execute.

This means that GOMAXPROCS goroutines can execute in user space in parallel, but more goroutines can be blocked in the kernel on different OS threads.

The Go runtime will create more M's as necessary to run all of the P's.

(Note that the Go runtime does try to avoid needing one M per goroutine. For instance, goroutines blocked on a channel are descheduled entirely (they give up their P and M), and are scheduled again only once they need to be woken.)

giovannibajo1 · on Sept 9, 2016

It's also very important to notice that network blocking calls like send/recv also release the P because the scheduler knows what's happening and passes the FDs over to the net poller, that is a single thread that waits on all of them through epoll or similar API. So you don't end up with one M for network socket and you get fully transparent async networking

4ad · on Sept 9, 2016

> This is not quite right. M's are OS threads. P's are processing units, on which goroutines are scheduled. There are exactly GOMAXPROCS P's. P's are scheduled to run on M's, but there may be more M's than GOMAXPROCS.

What you say is almost true, but the terminology is not quite right. Goroutines (Gs) are multiplexed (scheduled) onto threads (Ms), not Ps. Goroutines acquire Ps, they are not scheduled on Ps. However, they are scheduled onto Ms through Ps.

P really stands for processor, and it's just an abstract resource.

> The Go runtime will create more M's as necessary to run all of the P's.

Ps are not runnable entities. The runtime will create Ms in order to run all the runnable Gs, of there which are most the number of Ps (since runnable Gs acquire Ps). But they can be less, and then the runtime will use less threads, while the number of Ps is always exactly equal to GOMAXPROCS.

> when a goroutine makes a blocking syscall, it will continue to use its current M (which is blocked in the kernel), but will release its P

Only for blocking syscalls issues outside the runtime. Non-blocking syscalls do not release P, and neither do syscalls the runtime (not the syscall package) has to do.

Of course, you don't need Ps at all to implement Go with user-space scheduling. Go only added them in Go 1.1. However, this design avoids a global scheduler lock, and uses less memory. Plus some things just fall out naturally from the design, e.g. GOMAXPROCS accounting comes for free simply by the fact that you have GOMAXPROCS Ps.

4ad · on Sept 9, 2016

I think you are confused by the terminology, which doesn't mean what you think it does.

Yes, the kernel scheduler works with threads. But the purpose of a scheduler (both the kernel's and Go's) is to be invisible to the user. Since Go programs are user-level programs, the kernel scheduler is invisible to Go, and this is a very good thing.

The kernel provides an abstraction, independently-executed threads of execution. These threads are managed by the kernel scheduler to implement parallelism, guarantee fairness, and do many other things, but all this is not relevant to Go. What matters is that these threads are concurrent, independent, and carry their own state.

The Go runtime also provides the user with independently-executed threads of execution. For various reasons I won't get into it does all this in userspace rather than transparently using kernel threads (note that gccgo on some platforms just uses kernel threads). So how does it do it?

Well, for once, forget about Ps. They are an optimization. Let's think how you can do this without Ps; we'll add them later. And forget about parallelism too. We're making a toy, strictly GOMAXPROCS=1 implementation for now.

The runtime has to account for all the goroutines the user has to run. These exist in the runtime as Gs. There are as many Gs as there are goroutines, but more than the number the user asked for, since the runtime creates its own.

Like all programs, Go programs start their life as single-threaded programs. So we have an arbitrary number of Gs that somehow have to all run on this single thread. This implementation of Go is cooperativelly-scheduled. That means that the Go code is never preempted, but code must yield.

But where does it yield? The programmer surely doesn't call runtime.Gosched(), and yet goroutuines seem to yield somehow. Well the compiler inserts call into the runtime in various places, e.g. channel send and channel receive will call into the runtime. In fact they will end up in the scheduler. which looks at the list of runnable Gs, and if there are any, saves the current context and it reschedules another on top of the running thread.

How does it reschedule? It changes the execution context, which for Go code means setting a program counter, a stack pointer, a g in the TLS slot (more on this later), and some other miscellaneous things.

So this works, but there are some problems. The program eventually ends up doing system calls like read and write, and these system calls can block for an unbounded period of time. The programs has many runnable Gs. It would be good if somehow we could run all this code while some other part of the program is waiting for the kernel.

So we introduce threads now. We're still at GOMAXPROCS=1 level, but we start using kernel threads.

In this variant of the implementation, the runtime will check the list of runnable Gs before issuing a system call. If there are Gs to run, it will start a new thread. The current thread will do the system calls, as before, but there will be a new thread that will run the scheduler code (because this is how we set it up when we started it) that will pick a runnable G, and execute it.

In the original thread, we block in the system call. Once that completes, we save the result somewhere, the we exit, and the thread disappears.

This works but it's really wastful. All that thread creation and destruction. It would be better if we reused threads, and only create new ones if needed. So to do that, we need to do some accouting, and we need to manage these threads in some data structure. So we introduce Ms. M stands for machine -- a machine that will execute Go code. We now have both a list of Gs, and a list of Ms. Now when we need a thread we first search the Ms, we might have one available already. Only if we don't we will create another M. When a system calls resumes, the thread will park itself (parked means it's not runnable, and the kernel will not schedule it) and insert itself in the list of available Ms.

The relation between all Gs and Ms is n:m, but we only have one runnable G, many Gs blocked in system calls, and some Ms that do nothing. We can do better. We want to run multiple Gs in parallel.

For that, we need to introduce a scheduler lock. Multiple Gs will now enter the scheduler at the same time, so we need a lock.

It works pretty much the same as before, just concurrent. And this concurrency will enable parallelism if the hardware has multiple core or CPUs. Now the relation between Gs and Ms trully is m:n.

But there is a problem now. If we set GOMAXPROCS too high, performance is bad. We don't get the speed-up we expect. The problem is that many goroutines now compete for the same scheduler lock. Lock contention is bad and prevents scalability.

So we introduce Ps, and split up the G:M relation into G:P:M. P stands for processor.

There is a n:1 relation between Gs and Ps. When a Go program starts, it creates exactly GOMAXPROCS Ps.

When Go code wants to run, it first has to acquire a P. You can think of this as "G needs to acquire processor time". When a new goroutine is created, it's placed in a per-P run queue.

Most of the scheduling is done through per-P run queues. There is still a global run queue, but the idea is that is seldom used, in general, the per-P run queue is prefered, and this allows to use a per-P lock instead of a global lock.

Ms uses these Ps to get their workload. All Ms that run user code have a P, and use its runqueue and locks to schedule the work. There are more Ms, however. Ms that don't execute user code, for example when doing a system call (then you are stuck in the kernel, so you don't run user code) hand off their P before issuing the system call. Since this P is now free, another M can grab it and schedule Gs from its run queue.

Now we finally discovered the real Go implementation used today. This design allows both simple accouting of resources, and it is scalable and performant.

Hope this clear it up.

Oh, yes, and the g in the TLS slot? The current G is always stored in a register or a TLS slot. In the function preamble, this G is inspected. Originally it was done to check for stack overflow, but now it has a dual purpose. Some other concurrent part of the scheduler measures time spent by running Gs, and if they run too long, it will set some state in the corresponding G, so that the next time that G will do a function call, the function call prolog will detect this and will jump into the scheduler, descheduling the current goroutine and allowing others to run.

aconz2 · on Sept 9, 2016

Bravo! Thank you for the full story.

Would fusing many tightly coupled Gs (lots of send's and recv's to each other) into a larger G by transformation into a state machine be a way to reduce scheduler overhead? Of course, you may not frequently know ahead-of-time which Gs are tightly coupled.

wmf · on Sept 9, 2016

I don't know anything about Go internals, but presumably you'd use libnuma http://linux.die.net/man/3/numa to bind threads to a NUMA node. I guess libnuma uses info in /proc to find the topology, the mbind() system call to set memory locality, and the sched_setaffinity() system call to pin threads.

(Or I guess Go would reimplement libnuma themselves since they don't use C.)

signa11 · on Sept 9, 2016

> Or I guess Go would reimplement libnuma themselves since they don't use C

it needs to be cross-platform. unless there is posix spec which was implemented as part of this.

4ad · on Sept 9, 2016

All the NUMA stuff is highly OS-specific. Every port will need its own hooks into the NUMA-related system calls. If Go ever gets a NUMA-aware scheduler, Linux will get it first and I suspect most other platforms might not get it at all.

iends · on Sept 9, 2016

This was proposed a few years ago but it never got any traction it seems.

gribbly · on Sept 9, 2016

Well, AFAIK the author of this suggestion, Dmitry Vyukov, is the main architect of Go's runtime scheduling, so I doubt there is anything preventing him from implementing this should he so wish.