One thing I have never understood about the Go scheduler is how P's are involved. The OS (assume Linux) works with threads, and schedules threads not processors. How does it pin the P to the processor, or in this case Node?
Go calls them 'processors', but in OS terms they are OS threads. You can configure Go to have some different number of processors than you have physical processors (GOMAXPROCS).
This is not quite right. M's are OS threads. P's are processing units, on which goroutines are scheduled. There are exactly GOMAXPROCS P's. P's are scheduled to run on M's, but there may be more M's than GOMAXPROCS.
For instance, when a goroutine makes a blocking syscall, it will continue to use its current M (which is blocked in the kernel), but will release its P, allowing another goroutine to execute.
This means that GOMAXPROCS goroutines can execute in user space in parallel, but more goroutines can be blocked in the kernel on different OS threads.
The Go runtime will create more M's as necessary to run all of the P's.
(Note that the Go runtime does try to avoid needing one M per goroutine. For instance, goroutines blocked on a channel are descheduled entirely (they give up their P and M), and are scheduled again only once they need to be woken.)
It's also very important to notice that network blocking calls like send/recv also release the P because the scheduler knows what's happening and passes the FDs over to the net poller, that is a single thread that waits on all of them through epoll or similar API. So you don't end up with one M for network socket and you get fully transparent async networking
> This is not quite right. M's are OS threads. P's are processing units, on which goroutines are scheduled. There are exactly GOMAXPROCS P's. P's are scheduled to run on M's, but there may be more M's than GOMAXPROCS.
What you say is almost true, but the terminology is not quite right. Goroutines (Gs) are multiplexed (scheduled) onto threads (Ms), not Ps. Goroutines acquire Ps, they are not scheduled on Ps. However, they are scheduled onto Ms through Ps.
P really stands for processor, and it's just an abstract resource.
> The Go runtime will create more M's as necessary to run all of the P's.
Ps are not runnable entities. The runtime will create Ms in order to run all the runnable Gs, of there which are most the number of Ps (since runnable Gs acquire Ps). But they can be less, and then the runtime will use less threads, while the number of Ps is always exactly equal to GOMAXPROCS.
> when a goroutine makes a blocking syscall, it will continue to use its current M (which is blocked in the kernel), but will release its P
Only for blocking syscalls issues outside the runtime. Non-blocking syscalls do not release P, and neither do syscalls the runtime (not the syscall package) has to do.
Of course, you don't need Ps at all to implement Go with user-space scheduling. Go only added them in Go 1.1. However, this design avoids a global scheduler lock, and uses less memory. Plus some things just fall out naturally from the design, e.g. GOMAXPROCS accounting comes for free simply by the fact that you have GOMAXPROCS Ps.
I think you are confused by the terminology, which doesn't mean what you think it does.
Yes, the kernel scheduler works with threads. But the purpose of a scheduler (both the kernel's and Go's) is to be invisible to the user. Since Go programs are user-level programs, the kernel scheduler is invisible to Go, and this is a very good thing.
The kernel provides an abstraction, independently-executed threads of execution. These threads are managed by the kernel scheduler to implement parallelism, guarantee fairness, and do many other things, but all this is not relevant to Go. What matters is that these threads are concurrent, independent, and carry their own state.
The Go runtime also provides the user with independently-executed threads of execution. For various reasons I won't get into it does all this in userspace rather than transparently using kernel threads (note that gccgo on some platforms just uses kernel threads). So how does it do it?
Well, for once, forget about Ps. They are an optimization. Let's think how you can do this without Ps; we'll add them later. And forget about parallelism too. We're making a toy, strictly GOMAXPROCS=1 implementation for now.
The runtime has to account for all the goroutines the user has to run. These exist in the runtime as Gs. There are as many Gs as there are goroutines, but more than the number the user asked for, since the runtime creates its own.
Like all programs, Go programs start their life as single-threaded programs. So we have an arbitrary number of Gs that somehow have to all run on this single thread. This implementation of Go is cooperativelly-scheduled. That means that the Go code is never preempted, but code must yield.
But where does it yield? The programmer surely doesn't call runtime.Gosched(), and yet goroutuines seem to yield somehow. Well the compiler inserts call into the runtime in various places, e.g. channel send and channel receive will call into the runtime. In fact they will end up in the scheduler. which looks at the list of runnable Gs, and if there are any, saves the current context and it reschedules another on top of the running thread.
How does it reschedule? It changes the execution context, which for Go code means setting a program counter, a stack pointer, a g in the TLS slot (more on this later), and some other miscellaneous things.
So this works, but there are some problems. The program eventually ends up doing system calls like read and write, and these system calls can block for an unbounded period of time. The programs has many runnable Gs. It would be good if somehow we could run all this code while some other part of the program is waiting for the kernel.
So we introduce threads now. We're still at GOMAXPROCS=1 level, but we start using kernel threads.
In this variant of the implementation, the runtime will check the list of runnable Gs before issuing a system call. If there are Gs to run, it will start a new thread. The current thread will do the system calls, as before, but there will be a new thread that will run the scheduler code (because this is how we set it up when we started it) that will pick a runnable G, and execute it.
In the original thread, we block in the system call. Once that completes, we save the result somewhere, the we exit, and the thread disappears.
This works but it's really wastful. All that thread creation and destruction. It would be better if we reused threads, and only create new ones if needed. So to do that, we need to do some accouting, and we need to manage these threads in some data structure. So we introduce Ms. M stands for machine -- a machine that will execute Go code. We now have both a list of Gs, and a list of Ms. Now when we need a thread we first search the Ms, we might have one available already. Only if we don't we will create another M. When a system calls resumes, the thread will park itself (parked means it's not runnable, and the kernel will not schedule it) and insert itself in the list of available Ms.
The relation between all Gs and Ms is n:m, but we only have one runnable G, many Gs blocked in system calls, and some Ms that do nothing. We can do better. We want to run multiple Gs in parallel.
For that, we need to introduce a scheduler lock. Multiple Gs will now enter the scheduler at the same time, so we need a lock.
It works pretty much the same as before, just concurrent. And this concurrency will enable parallelism if the hardware has multiple core or CPUs. Now the relation between Gs and Ms trully is m:n.
But there is a problem now. If we set GOMAXPROCS too high, performance is bad. We don't get the speed-up we expect. The problem is that many goroutines now compete for the same scheduler lock. Lock contention is bad and prevents scalability.
So we introduce Ps, and split up the G:M relation into G:P:M. P stands for processor.
There is a n:1 relation between Gs and Ps. When a Go program starts, it creates exactly GOMAXPROCS Ps.
When Go code wants to run, it first has to acquire a P. You can think of this as "G needs to acquire processor time". When a new goroutine is created, it's placed in a per-P run queue.
Most of the scheduling is done through per-P run queues. There is still a global run queue, but the idea is that is seldom used, in general, the per-P run queue is prefered, and this allows to use a per-P lock instead of a global lock.
Ms uses these Ps to get their workload. All Ms that run user code have a P, and use its runqueue and locks to schedule the work. There are more Ms, however. Ms that don't execute user code, for example when doing a system call (then you are stuck in the kernel, so you don't run user code) hand off their P before issuing the system call. Since this P is now free, another M can grab it and schedule Gs from its run queue.
Now we finally discovered the real Go implementation used today. This design allows both simple accouting of resources, and it is scalable and performant.
Hope this clear it up.
Oh, yes, and the g in the TLS slot? The current G is always stored in a register or a TLS slot. In the function preamble, this G is inspected. Originally it was done to check for stack overflow, but now it has a dual purpose. Some other concurrent part of the scheduler measures time spent by running Gs, and if they run too long, it will set some state in the corresponding G, so that the next time that G will do a function call, the function call prolog will detect this and will jump into the scheduler, descheduling the current goroutine and allowing others to run.
Would fusing many tightly coupled Gs (lots of send's and recv's to each other) into a larger G by transformation into a state machine be a way to reduce scheduler overhead? Of course, you may not frequently know ahead-of-time which Gs are tightly coupled.
I don't know anything about Go internals, but presumably you'd use libnuma http://linux.die.net/man/3/numa to bind threads to a NUMA node. I guess libnuma uses info in /proc to find the topology, the mbind() system call to set memory locality, and the sched_setaffinity() system call to pin threads.
(Or I guess Go would reimplement libnuma themselves since they don't use C.)
All the NUMA stuff is highly OS-specific. Every port will need its own hooks into the NUMA-related system calls. If Go ever gets a NUMA-aware scheduler, Linux will get it first and I suspect most other platforms might not get it at all.