Yes, the kernel scheduler works with threads. But the purpose of a scheduler (both the kernel's and Go's) is to be invisible to the user. Since Go programs are user-level programs, the kernel scheduler is invisible to Go, and this is a very good thing.
The kernel provides an abstraction, independently-executed threads of execution. These threads are managed by the kernel scheduler to implement parallelism, guarantee fairness, and do many other things, but all this is not relevant to Go. What matters is that these threads are concurrent, independent, and carry their own state.
The Go runtime also provides the user with independently-executed threads of execution. For various reasons I won't get into it does all this in userspace rather than transparently using kernel threads (note that gccgo on some platforms just uses kernel threads). So how does it do it?
Well, for once, forget about Ps. They are an optimization. Let's think how you can do this without Ps; we'll add them later. And forget about parallelism too. We're making a toy, strictly GOMAXPROCS=1 implementation for now.
The runtime has to account for all the goroutines the user has to run. These exist in the runtime as Gs. There are as many Gs as there are goroutines, but more than the number the user asked for, since the runtime creates its own.
Like all programs, Go programs start their life as single-threaded programs. So we have an arbitrary number of Gs that somehow have to all run on this single thread. This implementation of Go is cooperativelly-scheduled. That means that the Go code is never preempted, but code must yield.
But where does it yield? The programmer surely doesn't call runtime.Gosched(), and yet goroutuines seem to yield somehow. Well the compiler inserts call into the runtime in various places, e.g. channel send and channel receive will call into the runtime. In fact they will end up in the scheduler. which looks at the list of runnable Gs, and if there are any, saves the current context and it reschedules another on top of the running thread.
How does it reschedule? It changes the execution context, which for Go code means setting a program counter, a stack pointer, a g in the TLS slot (more on this later), and some other miscellaneous things.
So this works, but there are some problems. The program eventually ends up doing system calls like read and write, and these system calls can block for an unbounded period of time. The programs has many runnable Gs. It would be good if somehow we could run all this code while some other part of the program is waiting for the kernel.
So we introduce threads now. We're still at GOMAXPROCS=1 level, but we start using kernel threads.
In this variant of the implementation, the runtime will check the list of runnable Gs before issuing a system call. If there are Gs to run, it will start a new thread. The current thread will do the system calls, as before, but there will be a new thread that will run the scheduler code (because this is how we set it up when we started it) that will pick a runnable G, and execute it.
In the original thread, we block in the system call. Once that completes, we save the result somewhere, the we exit, and the thread disappears.
This works but it's really wastful. All that thread creation and destruction. It would be better if we reused threads, and only create new ones if needed. So to do that, we need to do some accouting, and we need to manage these threads in some data structure. So we introduce Ms. M stands for machine -- a machine that will execute Go code. We now have both a list of Gs, and a list of Ms. Now when we need a thread we first search the Ms, we might have one available already. Only if we don't we will create another M. When a system calls resumes, the thread will park itself (parked means it's not runnable, and the kernel will not schedule it) and insert itself in the list of available Ms.
The relation between all Gs and Ms is n:m, but we only have one runnable G, many Gs blocked in system calls, and some Ms that do nothing. We can do better. We want to run multiple Gs in parallel.
For that, we need to introduce a scheduler lock. Multiple Gs will now enter the scheduler at the same time, so we need a lock.
It works pretty much the same as before, just concurrent. And this concurrency will enable parallelism if the hardware has multiple core or CPUs. Now the relation between Gs and Ms trully is m:n.
But there is a problem now. If we set GOMAXPROCS too high, performance is bad. We don't get the speed-up we expect. The problem is that many goroutines now compete for the same scheduler lock. Lock contention is bad and prevents scalability.
So we introduce Ps, and split up the G:M relation into G:P:M. P stands for processor.
There is a n:1 relation between Gs and Ps. When a Go program starts, it creates exactly GOMAXPROCS Ps.
When Go code wants to run, it first has to acquire a P. You can think of this as "G needs to acquire processor time". When a new goroutine is created, it's placed in a per-P run queue.
Most of the scheduling is done through per-P run queues. There is still a global run queue, but the idea is that is seldom used, in general, the per-P run queue is prefered, and this allows to use a per-P lock instead of a global lock.
Ms uses these Ps to get their workload. All Ms that run user code have a P, and use its runqueue and locks to schedule the work. There are more Ms, however. Ms that don't execute user code, for example when doing a system call (then you are stuck in the kernel, so you don't run user code) hand off their P before issuing the system call. Since this P is now free, another M can grab it and schedule Gs from its run queue.
Now we finally discovered the real Go implementation used today. This design allows both simple accouting of resources, and it is scalable and performant.
Hope this clear it up.
Oh, yes, and the g in the TLS slot? The current G is always stored in a register or a TLS slot. In the function preamble, this G is inspected. Originally it was done to check for stack overflow, but now it has a dual purpose. Some other concurrent part of the scheduler measures time spent by running Gs, and if they run too long, it will set some state in the corresponding G, so that the next time that G will do a function call, the function call prolog will detect this and will jump into the scheduler, descheduling the current goroutine and allowing others to run.
Would fusing many tightly coupled Gs (lots of send's and recv's to each other) into a larger G by transformation into a state machine be a way to reduce scheduler overhead? Of course, you may not frequently know ahead-of-time which Gs are tightly coupled.