I’m not a go developer: is this because of the green threads usage?
There is also a philosophy among the Go maintainers that you shouldn't call into C unless you have to. Unfortunately you can't draw pixels efficiently (at all?) in pure go, so...
Kinda. C's stacks are big (1~8MB default depending on the system IIRC), and while that's mostly vmem Go still doesn't want to pay for a full C stack per goroutine, plus since it has a runtime it can make different assumption and grow the stack dynamically if necessary.
So rather than set up a C stack per goroutine, Go sets up its own stack (initially 8K, reduced to 2K in 1.4) and if it hits a stack overflow it copies the existing stack to a new one (similar to hitting the limit on a vector).
But C can't handle that, it expects enough stacks, and it's got no idea where the stack ends or how to resize it (the underlying platform just faults the program on stack overflow), so you can't just jump to C code from Go code, you need an actual C stack for things to work, and that makes every C call from Go very expensive.
Rust used to do that as well, but decided to leave it behind as it went lower level and fast C interop was more important than builtin green threads.
Erlang does something similar to Go (by default a process has ~2.6K allocated, of which ~1.8K is for the process's heap and stack) but the FFI is more involved (and the base language slower) so you can't just go "I'll just import cgo and call that library" and then everybody dies.
The problem you're going to have is that if 10K goroutines all call PCRE you need 10K stacks, because all the calls are (potentially) concurrent.
What makes go work is that the compiler calculates how much local memory a goroutine requires and so after a serialised bump of the stack pointer the routine cannot run out of stack. Serialising the bump between competing goroutines is extremely fast (no locks required). Deallocation is trickier, I think go uses copy collection, i.e. it copies the stack when it runs out of address space on the stack, NOT because its out of memory (the OS can always add to the end), but because the copying compacts the stack by not copying unused blocks. Its a stock standard garbage collection algorithm .. used in a novel way.
The core of Go is very smart. Pity about the rest of the language.
There is no "machine stack", and yes in the details it tries to set up and memoise C stacks, but it still need to switch out the stack and copy a bunch of crap onto there, and that's expensive.
brb time to bench
Possibly gccgo pays less heavy a price?