Hacker News new | past | comments | ask | show | jobs | submit login
How Stacks Are Handled in Go (cloudflare.com)
106 points by jgrahamc on Sept 15, 2014 | hide | past | web | favorite | 31 comments

In Quasar[1] -- a library that adds true fibers (just like Go's goroutines) to the JVM -- we've decided to go with copying stacks as well (this is the current, and original, implementation). Of course, Java doesn't have pointers into the stack, so there's no need to fix any pointers when copying, and the implementation can be in pure Java.

[1]: https://github.com/puniverse/quasar

Interesting. Is it like Python's Gevent in that it patches otherwise blocking calls (say socket.recv) to return to the Quasar runtime so it can switch fibers?

It does use bytecode injection, but it doesn't automagically intercept calls. Quasar's users use fiber-compatible implementations of existing APIs (so the code is the same, but you have to explicitly call a fiber compatible method). Those implementations aren't built from the ground up, but thin wrappers around the original library. Usually, the library's asynchronous API is used under the covers to implement its synchronous API in a way that's fiber- rather than thread-blocking.

This may be kind of a dumb question, but does Go use the stdlib C stack or are they talking about their own stack?

I'm assuming that Go has a standard / vanilla / kernel aware C-style stack living somewhere right? Then the stack they are referring to is part of their run time environment...which is actually allocated on the heap somewhere.

This is all so meta. :)

> This may be kind of a dumb question, but does Go use the stdlib C stack or are they talking about their own stack?

Their own.

> I'm assuming that Go has a standard / vanilla / kernel aware C-style stack living somewhere right?

Not unless it has to call into C, which would require a C-style stack.

> Then the stack they are referring to is part of their run time environment...which is actually allocated on the heap somewhere.

Yes, as it is in C really. The difference is that the C stack is bounded and "fairly limited" (and highly variable, from 16k to a few MB[0]).

[0] http://lists.gnu.org/archive/html/bug-coreutils/2009-10/msg0...

> Yes, as it is in C really.

Not really. What people understand by "the C stack" is the stack that automatically comes when new threads are created. This stack is not on the heap; in fact it's at the top of the user address space and it has some special properties, like the operating system setting guard pages for you.

Pthreads allow you to set your own stack for a thread, but it is a feature seldom used; though it used by Go when cgo is employed.

Go does not choose the thread stack when using pthreads. Pthreads does.

You're right, I was confused for a moment. Go doesn't do this. It is possible though, see pthread_attr_setstackaddr.

There is no such thing as a "stdlib" stack. There is a thread stack, but with the exception of the initial thread, all Go stacks (both for Go code and for the scheduler) are allocated from "the heap" (rather Go is also responsible for mapping heap-like memory). That is, on anything except Windows, Solaris and Plan 9, where the stack given by the operating system is used for the scheduler stack, but user goroutine stacks are always allocated from the heap.

Not true. System stacks are system allocated. See my reply above.

Thanks for your note, but I'm confused by your statement. Take Linux go/src/runtime/os_linux.c:/^runtime·newosproc. It passes the already-allocated stack to clone(2). runtime·newosproc is called from go/src/runtime/proc.c:/^newm, which uses runtime·allocm to allocate a M. go/src/runtime/proc.c:/^runtime·allocm allocates this memory using runtime·malg on everything except Solaris, Windows, Plan 9, or when using cgo. The comment even reads:

    // In case of cgo or Solaris, pthread_create will make us a stack.
    // Windows and Plan 9 will layout sched stack on OS stack.
For that class of systems, runtime·newosproc (or _cgo_sys_thread_start) indeed does use the system stacks, but that's what I said too.

Let's take a look at runtime·malg again. runtime·malg calls go/src/runtime/stack.c:^/runtime·stackalloc which uses runtime·sysAlloc to allocate memory. Moving back to Linux, go/src/runtime/mem_linux.c:/^runtime·sysAlloc just calls mmap(2).

So, the answer to "who allocates system stacks" seems to be "it depends", but I already qualified this in my original answer (omitting cgo; sorry about that). Now, is there anything wrong in my description?

It is true that when cgo and therefore pthreads is not involved, Go allocates all the stacks. But then the question of where 'system code' runs is not terribly relevant: it doesn't.

If you don't have any C in your executable or its libraries, then you don't need to have any C stack. For the purpose of system calls, the stack is just the memory pointed to by the stack pointer. It could be anywhere - heap, bss, a subrange of an existing stack allocation.

If you want to call into C code (it's rather unrealistic not to have any C in any of your library dependencies), then you need to make a C-style stack available, following the conventions of the system (e.g. growing downwards vs upwards, using a guard page for incremental allocation, etc.) along with enough reserve for typical C maximum call stack depth. It needs to be set up before the call to C, but it doesn't need to hang around when C code is not in the chain of current procedure activations.

The requirements of the stuff pointed to by the stack pointer for it to be considered stack is defined by the platform ABI, which in turn is heavily dependent on CPU architecture. But as long as you meet the rules, you're fine, with regard to the OS. Third party code may be more sniffy.

Its not that surprising. If you want full coroutines that can yield from inside function calls then each coroutine is going to need to own a separate stack. Since you only have one C stack the coroutine stacks need to live on the heap.

Some coroutine implementations copy the coroutine stacks back into the main C stack before resuming execution but usually its not worth the trouble to do that and you just run everything on the heap.

Just to add what others have said, one of the Go developers has added support to GCC for segmented stacks with the -fsplit-stack flag. And it appears clang also supports this.

The question remains how Go's new runtime deals with calling out to third-party libraries that don't obey their new stack protocol.

The same way the old runtime did: call out on the pthread-provided stack.

Ah, right, Go is M:N threaded, so the scheduler runs multiple goroutines on a shared POSIX stack, mostly unused until calling out to C code. Presumably, then, only Go code can yield.

So first let me say I assume there's a good answer to this. I'm asking to find out what it is, not as any sort of criticism.

Why not leave the segmented stacks, and simply refuse to shrink them? When a stack grows, then shrinks back down, you could just leave yourself a note at the very end of the stack about where the next stack segment is, so you won't be reallocating. Plus you could start trimming off the end of the stacks if you shrank some % of the way back down to zero.

It's not just the allocation—just the cost of switching between stack segments is really expensive. Function calls are 2 ns on most architectures; any overhead can easily make that 5x or 10x slower.


I know it's the "wrong language" for you, but after years of working in Perl and occasionally profiling it, it was a shock to switch to Go and profile anything in "nanoseconds". To see even a "microsecond" appear in Perl code is often a bit of a pleasant surprise. To some extent I'm still not used to being concerned that adding 2ns to a code path might have noticeable performance impacts....

tl;dr The stack is grown in chunks. Previously, Go maintained a linked list of chunks, but repeated allocation/deallocation of chunks within loop was bad for performance. They switched to a "realloc-type" of stack growth (allocate a bigger region and copy the old stack over).

Now my question: why didn't they simply keep a pool of available stack chunks to reuse instead of constantly allocating/freeing them?

> They switched to a "realloc-type" of stack growth (allocate a bigger region and copy the old stack over).

This literally hurts my brain. They are moving pointers in memory for no reason, because they cling to the notion that tightly packed stacks are needed, because they want to support 32-bit processors as first class, because they want to have billions of threads at once.

But every step of that thought process is wrong. You don't want to have billions or even millions of threads at once because then your latency is large and scheduling and order of operations are unpredictable and unstable. You don't want to support 32-bit processors as first-class because even new phones are 64-bit nowadays. You don't want tightly packed stacks because it's a complicated waste to move them and grow them and virtual memory does the same job better.

I just don't get it, there's just so much wasted time and talent wrapped up in this language. I mean they created an entire toolchain just for something that should be a few k of runtime, and all it does in the end is create inefficiencies and problems with interop with everything else.

There is (was) a pool. It's not the alloc/free that is the problem. It's all the bookkeeping that is pure overhead compared to a simple CALL or RET instruction.

Am I missing something or is it strange to speak about "a single CALL instruction"? It implies you'll have to execute some function, so it can't be really considered as a single instruction.

If there was a pool, then the article is at least misleading on that point because it seems to imply that in loops, you repeatedly allocate and free, which would not be necessary with a pool.

But yes, in the end it comes down to see whether pool/linked-list bookkeeping is better or worse to data copy + pointer bookkeeping.

A single CALL instruction on the x86 does a very limited amount: it pushes the address of the next instruction onto the stack, and then it sets the instruction pointer to the thing that was called. What happens next is up to the function and not part of the CALL instruction.

Allocate and free are perfectly reasonable terms to describe the use of a pool too.

This article is not technically accurate.

Most importantly, the distinction between virtual-memory stacks and copying stacks is incorrect: if the process uses more stack space than it has physical RAM, it is going to (try to) swap, whether virtual or copying.

I believe the process will crash.

I think Rust also moved away from segmented stacks? Seems like a pretty strong signal that segmented stacks aren't the way to go when everyone is moving away from them.


Moving the coroutine stack when its too big is also how Lua does it, isn't it?

So, they're handled basically how most coroutine runtimes handle them...

Would be great to see this added to fibers in Node.js.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact