
The Go runtime scheduler's way of dealing with system calls - todotask
https://utcc.utoronto.ca/~cks/space/blog/programming/GoSchedulerAndSyscalls
======
protomyth
[https://marc.info/?l=openbsd-
cvs&m=157500930922882&w=2](https://marc.info/?l=openbsd-
cvs&m=157500930922882&w=2)

 _For dynamic binaries, we continue to to permit the main program exec segment
because "go" (and potentially a few other applications) have embedded system
calls in the main program. Hopefully at least go gets fixed soon._

 _We declare the concept of embedded syscalls a bad idea for numerous reasons,
as we notice the ecosystem has many of static-syscall-in-base-binary which are
dynamically linked against libraries which in turn use libc, which contains
another set of syscall stubs. We 've been concerned about adding even one
additional syscall entry point... but go's approach tends to double the entry-
point attack surface._

[https://marc.info/?l=openbsd-
tech&m=157488907117170&w=2](https://marc.info/?l=openbsd-
tech&m=157488907117170&w=2)

[edit for convenience of readers - read the above linked thread - I just
grabbed the go part]

 _Unfortunately our current go build model hasn 't followed solaris/macos
approach yet of calling libc stubs, and uses the inappropriate "embed system
calls directly" method, so for now we'll need to authorize the main program
text as well. A comment in exec_elf.c explains this._

 _If go is adapted to call library-based system call stubs on OpenBSD as well,
this problem will go away. There may be other environments creating raw system
calls. I guess we 'll need to find them as time goes by, and hope in time we
can repair those also._

[/edit]

~~~
Reelin
Related (AFAIU):
[https://news.ycombinator.com/item?id=21653119](https://news.ycombinator.com/item?id=21653119)

> We've been concerned about adding even one additional syscall entry point

I don't understand the need for such a severe "only libc syscalls ever"
approach.

What would be the security concern with allowing syscalls only from
preauthorized (ie msyscall(2)) regions, making initial region authorization
opt-in (instead of opt-out), allowing the program to call msyscall(2) itself,
and rejecting any statically linked (ie non-ASLR'd) regions for authorization?

~~~
masklinn
> I don't understand the need for such a severe "only libc syscalls ever"
> approach.

There's nothing severe about it. Most systems are exactly that: _systems_ of
which the kernel is only one part, syscalls are rarely if ever intended to be
called directly nilly-willy.

The issue is that unlike windows unices have never _enforced_ this.

~~~
amscanne
Sorry, that generalization doesn’t hold water.

It makes sense for systems where libc is tightly coupled and coversioned with
the kernel, e.g. BSDs, but Linux always relied on third-party C libraries and
supported static binaries, etc.

You could argue that BSD made the mistake of intending to have a Windows-style
C library compat guarantee but not enforcing it, but that was not in scope for
Linux. The philosophy has always been syscall-level compat (and there are lots
of famous threads with Linus re-enforcing this to others who would presume
that things should be “fixed in user space”).

So it’s hardly reasonable to generalize based on some BSD concerns; Linux is
WAI and represents the most common Unix-like system people use today by far.

There’s a pretty good argument that this level of compat, while the source of
some problems, has also made other things much easier: consider container
images that are bundled with their own system libraries. (You could certainly
invent schemes to inject these libraries, but dealing with link and library
level compatibility seems even more complex to deal with than system call-
level compatibility.)

~~~
pcwalton
Darwin/macOS has the same rules as Windows and the BSDs--syscalls are private
API--and it's extremely popular due to iOS. Linux is in fact the odd one out
here.

~~~
giovannibajo1
There is a difference though: libSystem on Darwin is a very thin wrapper over
the kernel syscalls; on the contrary, libc is a library that was designed for
C, then standardized in POSIX, and has several layer of abstraction over
kernel syscalls including many bad defaults that are universally recognized as
wrong today (eg: libc’s created file descriptors will all inherit by default).

~~~
cesarb
> (eg: libc’s created file descriptors will all inherit by default)

Isn't that the _kernel_ default? Even if you use system calls directly, file
descriptors still inherit by default.

~~~
loeg
Yeah, libc's syscall wrappers just do what you tell them. If you don't pass
O_CLOEXEC to the kernel syscalls, you get the inherit behavior. Libc's syscall
wrappers don't change this in any way.

To the extent that Go's default for file descriptors today is !inherit (I'm
unfamiliar, but if so, it's a good choice), the Go runtime must already add
O_CLOEXEC to bare syscalls. There's no reason to believe it incapable of
adding the flag to libc syscalls instead.

~~~
giovannibajo1
You can't do that atomically with libc. There's a short window in which the
file descriptor will potentially be inherited, if another thread forks.

~~~
jlokier
That's incorrect.

You are thinking of the older way, where fcntl(fd, F_SETFD, FD_CLOEXEC) must
be used after open(), leaving a short window in which the file descriptor may
be inherited.

The newer way passes the O_CLOEXEC to open() and there is no fcntl() call.
This is atomic with respect to inheritability: The kernel returns a non-
inheritable file descriptor to libc, and libc returns it to the application.

Other syscalls that return a file descriptor have similar flags, so they are
atomic too.

These flags and behaviours are exactly the same, whether done by calling
through libc as most programs do, or direct kernel syscalls bypassing libc, as
Go and a few other programs do.

------
jnwatson
This scheduler is probably the most salient feature of Go, but is only
indirectly described in the language specification.

Perhaps it is just me, but it seems all this user space rigamarole to map bits
of execution onto cores points to an overall architecture “smell”. This should
be performed and enabled by the OS.

You can see the seams between the OS and the go runtime tear a little whenever
a library acquires an ownership lock where the thread id is recorded. In Go,
computation moves freely between threads, so that lock doesn’t work (at least
without special instructions to the runtime to lock that goroutine to a
thread).

The whole POSIX threading model seems broken in this context.

~~~
gok
POSIX threading is not broken, the Go scheduler just does a bunch of goofy
things that aren't really supported. Moving stacks between threads breaks all
kinds of things. A more idiomatic approach would be for the compiler to emit
properly resumable functions, like most async/await implementations do.

~~~
mitchty
The go runtime moves stacks between threads?

Oof that’s horrible, any pointers to the logic behind it? I’m curious the
rationale.

~~~
duelingjello
I think to keep Go code directly callable from C, they have to follow the
platform's C calling conventions which means the same stack layout. So for
cooperative concurrency on a single thread to work, each Goroutine needs its
very own stack. On Intel, that means saving stack pointers RSP and RBP (16
bytes) for each. Also, each will need memory allocated for its stack for the
stack pointers to point to... another 8-16 bytes (pointer and length).

~~~
echlebek
The gc compiler, used by the vast majority of Go developers, does not use the
C calling convention.

[https://golang.org/doc/faq#Do_Go_programs_link_with_Cpp_prog...](https://golang.org/doc/faq#Do_Go_programs_link_with_Cpp_programs)

------
derefr
I would love to see a compare-and-contrast between the Golang scheduler and
the Erlang scheduler, in the way they handle network-IO-heavy workloads. Maybe
throw in the JVM scheduler, too (though its JIT would likely complicate
things.)

------
tyingq
Makes me somewhat curious how go deals with a hung NFS mount ("hard mount"). I
suspect everything would stop, where a normal OS thread wouldn't hang if it
weren't interacting with NFS.

~~~
siebenmann
This should work fine. The goroutine making the system call that touches the
NFS mount will consume an OS thread (an 'M' in Go terminology), but it will
release its hold on other resources. Go uses as many OS threads as necessary
to cope with running user code and doing OS system calls and so on (and starts
new ones on demand).

If you had lots of goroutines do lots of things that stalled on hung NFS
mounts, you would build up a lot of OS threads (all sitting in system calls)
and might run into limits there. But that's inevitable in any synchronous
system call that can stall.

(I'm the author of the linked-to article.)

------
amluto
A side effect of this scheme is that a long sequence of slow-but-not-that-slow
syscalls becomes _extremely_ slow because the Go scheduler gets invoked each
time.

------
pythux
Nice read! I am not very familiar with this field of research but, could
runtimes of other languages (say, Node.js or Python) benefit from such
optimizations? What about libraries like libuv, I guess they must be fairly
fine-tuned already? Or is this something that is specific to Go and would be
hard in other contexts?

~~~
kevingadd
Go is one of the only languages that does syscalls itself (mostly because it's
extremely high-risk and low-payoff), so some of its syscall-related techniques
are not easily adapted to other runtimes.

~~~
pcwalton
Note that even Go only does syscalls itself on Linux. On macOS and Windows it
calls into libSystem and kernel32.dll respectively, as the syscall interface
is not stable on those platforms.

~~~
masklinn
> Note that even Go only does syscalls itself on Linux.

AFAIK Go does syscalls itself on any platform but Windows and macOS, this
includes all BSDs. And even for macOS despite that having never been
officially supported it took multiple breakages a few years back.

The first thread here mentions the issues that causes for openbsd.

------
duelingjello
1\. Does OS thread M get pinned to run only on a particular processor P? (It
seems like "yes" when default.)

2\. If M blocks in a syscall too long in the optimistic case:

2.a. is M unpinned from P but continues to block until the syscall returns?

2.b. is another thread from the pool used or new thread created, and pinned to
P so that P can be used for other work? (I think this depends on configuration
if there are fewer, same or more threads than processors.)

2.c. is there an upper limit on outstanding blocked syscall worker threads or
will it simply be the last task any extra created threads beyond the normal
limit would ever process?

~~~
siebenmann
An OS thread M can run on any available P. While there are some caches
associated with each P, Ps are fundamentally there to insure that only so many
CPUs worth of Go user code is ever running at once, so the important thing is
that an M that wants to run user Go code has _some_ P, not a particular P. Ms
claim and release Ps as they go in and out of running Go user code, but I
believe they don't release and then re-acquire a P as they switch between
goroutines.

(I believe the actual implementation treats Ms as a sort of secondary thing.
For instance, I think that the local list of runnable goroutines is attached
to the P, not to the M. At one level, the M is just a context for running
things on Ps.)

In the optimistic case when the system call blocks for too long, the M is
unpinned from the P it was using and continues to sit in the system call (the
Go runtime doesn't attempt to interrupt the system call itself). If there is
another runnable goroutine and there are no free M's, the Go scheduler will
create another M to run the goroutine on the now-free P. I think that the
runtime directly allocates the free P to the newly created M rather than
letting the new M try to contend with other things for the P, but I'm not
sure.

I don't think there's any limit on the number of Ms (OS threads) that the Go
runtime will create, but I haven't checked the code carefully. Idle Ms are
reclaimed under some circumstances.

(I'm the author of the linked-to article.)

------
jancsika
> For example, on modern systems the 'system call' to get the current time may
> not even enter the kernel (see vdso(7) on Linux).

Is there a way to check from the running process whether that is the case or
not?

~~~
wyldfire
Check whether it happened to do that for a given call or a portable way to
predict whether it will?

dump_vdso [1] will write the vdso to stdout, you can use binutils like objdump
or nm to list the symbols present.

[1]
[https://kernel.googlesource.com/pub/scm/linux/kernel/git/lut...](https://kernel.googlesource.com/pub/scm/linux/kernel/git/luto/misc-
tests/+/5655bd41ffedc002af69e3a8d1b0a168c22f2549/dump-vdso.c)

~~~
monocasa
And if you wanted the same information at runtime, the base address of the
VDSO is passed in as an auxv, and then passing that address into libelf would
get you everything.

