So, preemptive green-threads? This seems like a counter-productive and highly co...

jerf · on March 29, 2018

Why do you think cooperative goroutines are easier to reason about? In Node, you can say with a straight face that once you start executing a bit of code, you can tell that that code will execute without pre-emption until a fairly obvious point where it stops running. In Go, the runtime inserts pre-emption points basically whenever it feels like it; yes, there are some pragmatic rules about where they may be, but even if you know the rules you can't count on them without intensely studying whether or not certain functions got inlined or not or things like that, and you can have no confidence about what other versions of Go (or other implementations) may do.

From what I can see, you could theoretically "reason" about your code if you compile it with the optimizer output on, study it carefully, and understand it intimately, but it would be a dangerously fragile line of reasoning subject to change without notice.

And I say all this ignoring the fact that whatever you are reasoning about is almost certainly already a bug according to the spec and something the race detector will scream about. I would not feel confident running a program that has known race conditions in it, but that you feel like you've reasoned via intimate knowledge of the runtime that they can't actually happen. I'd much rather see you just write the code to be concurrently correct anyhow, and then I can do things like "upgrade to the next version of Go" or "cross-compile to another architecture that may do things that violate your fragile reasoning" without stark terror.

arghwhat · on March 29, 2018

Just to get things right, I said:

> It is much simpler, naturally more performant, and quite easy to reason about

That is, easy to reason about, not easier to reason about. Technically, preemptive scheduling is harder to reason about, but the point of it all is that you get to pretend it doesn't exist. Thus, from a high-level perspective, preemptive is easier than cooperative.

However, cooperative scheduling is very easy to reason about, like in the case of ECMAScript. In ECMAScript, most API's are async, but you can write "await" to preempt a task and wait for promise completion. In Go, things that cause blocking (I/O, synchronization primitives) automatically preempt.

jerf · on March 29, 2018

"However, cooperative scheduling is very easy to reason about, like in the case of ECMAScript."

And my point is that cooperative scheduling isn't Platonically easy to reason about. It is easy to reason about for specific, concrete reasons, reasons which do not apply to Go code. It makes no sense in your post to ask for goroutines to stay cooperative, because cooperative is easy to reason about, because the way Go does it is not easy to "reason" about. Which is why you're 99.9999% of the time better off pretending Go is already fully pre-emptive and writing your code that way, which is close enough to 100% to just go ahead and do it, as the effort of detecting when you may be in that less-than-once-per-career case exceeds just writing code to be preemptively-correct.

I suppose I'll concede that a very, very narrow reading of your text is technically correct, but in that case, that narrow technical reading consists of facts that make no sense to bring up in this context, so I'm not feeling all that guilty about my slightly more expansive reading.

(I also don't even particularly agree that cooperative threading is easy to reason about. In practice I have found that even if you know that your code isn't going to be pre-empted in the middle of a block, that is still very difficult to correctly turn into code that is correct in the face of high concurrency. I actually find it easier to write code with sensible concurrency constructs like channels or actors than to try to write correct code based on lots of assumptions about cooperative multithreading. Or, if you prefer, it is certainly academically easy (not easier, but easy) to reason about but in practice I'd take channels, actors, Clojure/Haskell STM, or any of a bunch of other sensible constructs any day. By which I mean, I do; my skin's in the game here, I'm not just abstractly pontificating.)

arghwhat · on March 29, 2018

You definitely shouldn't feel guilty about your "expansive reading", but I do prefer to not have claims I did not make assigned to me. :)

Your comment is rather empty without examples or elaboration. It also seems that all but one point is not actually related to cooperative vs. preemptive multitasking.

You claim that the reasons for cooperative multitasking being simple do not apply to Go code (from your previous post, I assume that you find them to apply to ECMAScript). However, you do not present any reasons for this. Would you mind elaborating?

Go is a little bit more magic than ECMAScript, but ECMAScript is for all intends and purposes a single-threaded language (Web workers exist, but they cannot share resources due to the memory problems it would create, and instead rely on a simple message passing mechanism). This makes it a very different beast from Go (read: simpler).

However, it is not very magic. You can basically just consider all I/O and synchronization primitives to be prefixed with "runtime.Gosched()", or the equivalent yielding function from other green thread implementations.

Elaboration of your comment about the difficulty of turning cooperative code into "correct code in the face of high concurrency" would be also nice. Some types of parallel code can be quite hairy to guarantee correctness of, but I don't see how this is related to cooperative vs. preemptive multitasking.

You are mentioning channels and actors as if that is mutually exclusive from the Go model, which I find odd considering that Go caters specifically to the use of channels and actors. They are also completely detached from the choice of multitasking model, and exist just fine in both preemptive and cooperative environments.

I do quite like the channel/actor model, but that is entirely unrelated to the discussion. I would also find the implication that all other primitives are not "sensible" to be a very, very large claim. And a broken one at that. :)

zenhack · on March 29, 2018

The go compiler (already) inserts yields in a bunch of places that aren't IO or explicit synchronization points. And it's hard to predict the exact locations they'll be because much of it happens after some optimizations have been made. It is very much already attempting to look preemptive, which is why the authors didn't use the term coroutine for the language feature -- it's an implementation detail.

This proposal is about fixing some of the edge cases where the current implementation doesn't do what the interface is supposed to. It outlines another, simpler approach which just has the compiler add more yields, but apparently the performance impact of that is too high. But it's just a question of perf & implementation complexity, not semantics.

mseepgood · on March 29, 2018

> So, preemptive green-threads? This seems like a counter-productive and highly complicated re-implementation of what the OS thread model does, but inside the Go runtime.

You can't easily spawn 10.000 OS threads.

> I would very much like for goroutines to remain cooperative. It is much simpler, naturally more performant, and quite easy to reason about.

Cooperative is less simple to reason about, as the proposal explains. Preemption fully delivers what the developer already expects of goroutines.

arghwhat · on March 29, 2018

> You can't easily spawn 10.000 OS threads.

Of course you can, without any problem at all. It's just more expensive than a green thread (they're full processes with stacks), and unlike green threads, it affects how processing time is sliced. It is, however, much cheaper than people tend to believe.

> Cooperative is less simple to reason about, as the proposal explains.

Cooperative scheduling is very simple to reason about. In the simplest form, you do not give up control unless you use I/O, a synchronization primitive, or explicitly give up control (runtime.Gosched() in Go).

As a Go developer, and as a developer in any other language, this is exactly what I expect from any green thread implementation. That's how M:N scheduling has always been implemented, and how green threads have been implemented in other languages.

Technically, preemptive is harder to reason about, but you save having to reason about it. However, if you expect preemption from goroutines, you have misunderstood the primitive.

ngrilly · on March 29, 2018

> In the simplest form, you do not give up control unless you use I/O, a synchronization primitive, or explicitly give up control (runtime.Gosched() in Go).

This is not how Go works nowadays. There are other preemption points.

arghwhat · on March 29, 2018

That's why I said "in the simplest form".

ngrilly · on March 29, 2018

Yes, but it completely ruins the argument about "it's easy to reason about" ;-)

akvadrako · on March 29, 2018

> You can't easily spawn 10.000 OS threads.

This is what I thought until last week, but it seems it's no big deal. 1M threads starts to be an issue.

pcwalton · on March 29, 2018

> You can't easily spawn 10.000 OS threads.

Of course you can. Go on, compile and run this program:

    #include <pthread.h>
    #include <stdlib.h>

    void *f(void *arg) {
        pthread_mutex_lock((pthread_mutex_t *)arg);
        pthread_mutex_unlock((pthread_mutex_t *)arg);
        return NULL;
    }

    int main() {
        pthread_t threads[10000];
        pthread_mutex_t mutexes[10000];
        for (unsigned i = 0; i < 10000; i++) {
            pthread_mutex_init(&mutexes[i], NULL);
            pthread_mutex_lock(&mutexes[i]);
            pthread_create(&threads[i], NULL, f, &mutexes[i]);
        }
        for (unsigned i = 0; i < 10000; i++)
            pthread_mutex_unlock(&mutexes[i]);
        for (unsigned i = 0; i < 10000; i++)
            pthread_join(threads[i], NULL);
        return 0;
    }

Executes in a second for me, even on macOS.

vidarh · on March 29, 2018

> You can't easily spawn 10.000 OS threads.

I just spawned 10.000 OS threads running separate mRuby instances in each thread on my laptop, to see if it'd cause any problems (because I happened to have an app sitting around that spawns separate threads; note: these are not Ruby threads, mRuby does not have built in threading). None at all. So yes, you can. I didn't check how much memory it'd take, however, or attempt to measure overheads in any way, so I'm not saying there aren't potential issues with it.

jashmatthews · on March 29, 2018

Simply using 1,000+ OS threads is often very close to as efficient as using coroutines or an event loop: https://www.slideshare.net/e456/tyma-paulmultithreaded1

zzzcpan · on March 29, 2018

If your event loop is as slow as 1000+ coroutines or OS threads, you are doing it very very wrong.

jashmatthews · on March 29, 2018

AFAIK the main benefit to coroutines is getting to manually manage tiny stacks.

For example, in Go, they mark a goroutine's stack as "hasn't executed since the last time we marked", which makes GC with thousands of threads a lot faster.

arghwhat · on March 29, 2018

[Citation needed]

jashmatthews · on March 29, 2018

https://github.com/golang/go/issues/12061

"Mark termination does not have to rescan stacks that haven’t executed since the last scan"

With pre-emptive scheduling of OS threads, you potentially dirty a ton more stacks, and have to scan them all for pointers to heap memory. I'm not sure how VMs using lots of OS threads deal with this.

coldtea · on March 29, 2018

>So yes, you can. I didn't check how much memory it'd take, however, or attempt to measure overheads in any way, so I'm not saying there aren't potential issues with it.

The "easily" part wasn't referring to feasibility, but to how effortless it is to do AND have performance and scalable code.

chrisseaton · on March 29, 2018

> This seems like a counter-productive and highly complicated re-implementation of what the OS thread model does, but inside the Go runtime.

But I think you still get advantages such as being able to manage stacks yourself, which is what allows for many more coroutines than native threads.

jnwatson · on March 29, 2018

This is the part I don’t understand. You know you can specify the stack size explicitly in POSIX? I can set the stack size to the same 2k in Go and have less overhead. The only difference is it won’t auto-expand like in Go.

chrisseaton · on March 29, 2018

> The only difference is it won’t auto-expand like in Go.

Well yeah that’s the difference I was talking about.

coldtea · on March 29, 2018

>The only difference is it won’t auto-expand like in Go.

Which makes all the difference.

jakewins · on March 29, 2018

Stacks in Linux grow dynamically, so if you want go-like stacks you just set the max stack size == system memory.

See http://man7.org/linux/man-pages/man2/setrlimit.2.html

4ad · on March 29, 2018

No, the stack only grows for the initial thread in a process. Subsequent threads don't have growable stacks. And this is true for virtually every operating system that matters, not just Linux.

Note that Go stacks can also shrink.

jakewins · on March 31, 2018

I don't think this is correct, do you have a source for this?

man PTHREAD_ATTR_SETSTACKSIZE(3) says:

> The stack size attribute determines the minimum size (in bytes) that will be allocated for threads created using the thread attributes object attr.

and:

> A thread's stack size is fixed at the time of thread creation. Only the main thread can dynamically grow its stack.

My understanding is that it is referring to virtual memory. The kernel would allocate a giant blob of RAM, [stacksize + heapsize + some other stuff] large. My reading of the manpage above is that the main thread can change this allocation, while other threads are stuck with what they started with.

But why would the kernel actually realize the stack portion of the allocation? Surely if I create a 1G stack child thread, it will not realize those stack pages until I actually use them?

4ad · on April 2, 2018

All the discussion was about virtual memory.

The initial thread in a process, on every operating system that is still relevant today, uses guard pages to grow its virtual memory allocation for the stack segment. Physical memory for the stack segment's pages may or may not be mapped into the process address space (it will usually be mapped). This stack can't shrink, and also once the thread touches a page, it becomes commited memory and it will always stay commited.

Other threads in a process do not use guard pages and use fixed virtual memory allocation for their stacks. Their stacks are fixed in size and can't grow. Just like for the initial thread once a page is touched, it becomes commited memory and it will always stay that way.

In Go, stacks start small and use a variable amount of virtual memory. Go stacks can shrink, freeing both address space and physical memory.

chrisseaton · on March 29, 2018

> Stacks in Linux grow dynamically

I think Linux can defer committing memory for the pages of the stack until they're used, but you need to reserve the entire virtual memory in the first place don't you? Otherwise how can you have the ability to dynamically allocate more stack virtual address space to an arbitrary thread, without relocating it? Or if you do relocate how do you update pointers to the stack?

arghwhat · on March 29, 2018

That is already the case for the cooperative green threads that goroutines currently comprise of.

chrisseaton · on March 29, 2018

But the cooperative green threads don't get the advantages of preemption, which is the entire point of the exercise.

dozzie · on March 29, 2018

> So, preemptive green-threads? This seems like a counter-productive and highly complicated re-implementation of what the OS thread model does, but inside the Go runtime.

It would be indeed a reimplementation of what OS does, but note that such a reimplementation is the very reason why quite straightforward Erlang code can easily handle hundreds of thousands active connections on modest hardware.

imtringued · on March 29, 2018

This is completely wrong. The Beam VM is not pre-emptive. It is cooperative. The trick is that it is not possible to write blocking code in Erlang. There are no for loops. The only way to implement iteration is via recursion. The scheduler will only switch to the next actor only after processing a message in it's entirety. It will not interrupt an actor that is currently processing a message.

Preemption is good because it prevents bad processes from starving the rest of the system but it's far from efficient.

mononcqc · on March 29, 2018

The BEAM will totally interrupt a process that is currently processing a message. It does so based on a 'reduction' count, roughly equivalent to each function call. Since there is no looping construct, any iteration can be interrupted halfway through.

But this means that the VM is entirely allowed to interrupt you halfway through a call to sorting a list obtained through a message. No qualms about it. You reach you function call limit, you're phased out and put back in the queue.

The three ways I know about to hog a scheduler for yourself are:

- write C code that misbehaves and stalls the scheduler (there is now a dirty scheduler if you want the code to run there)

- hardcode a sequence of arithmetic operations (which are not subject to reduction counts) that can take a very long time to execute -- something that never happens naturally in code

- set your process to the max priority and do busy looping while the rest of the system is at a lower priority. Your process will now have a higher priority and prevent others from running

That's about what you can do. It requires a kind of explicit effort to do and is very easy to spot from afar.

jerf · on March 29, 2018

"The scheduler will only switch to the next actor only after processing a message in it's entirety. It will not interrupt an actor that is currently processing a message."

Have they changed it? Last I knew, processes get a "reduction count", which is basically "the number of opcodes in the VM this process gets to run before it gets suspended". Technically, this is exactly like what Go does, except instead of sticking pre-emption points at function calls, it places them at every opcode, which is to say, quite often multiple times per line of code. Technically, you can still freeze the interpreter (or a thread of it) by calling into some C code that counts as one "reduction" but never returns. In practice, this causes about as much trouble for Erlang programmers as it does Go programmers, namely, "it's something you should know but can literally go your entire career without hitting". (Depends on what kind of code you're writing. Someone trying to use Erlang or Go for heavy numerical processing has a good chance of hitting it, but if you're writing network code, you'd have to reeeaallly work at it in either environment.)

derefr · on March 29, 2018

Erlang's pre-emption points are specifically inside the implementations of the CALL and RET instructions, not on every opcode. It's just that every stack frame effectively takes O(1) runtime before hitting one or the other of those instructions, because of the lack of primitive loop instructions. (Mind you, there are O(N) instructions—there's cryptographic-hashing BIFs, for example—but they're internally reduction-budgeted, so the scheduler can yield from inside them.)

> Technically, you can still freeze the interpreter (or a thread of it) by calling into some C code that counts as one "reduction" but never returns.

Actually, you can't! (Or, well, at least as long as you've given a moment of thought while coding your NIF you can't.)

Avoiding this used to require explicit calls to the reduction budgeting functions in your native code. But as of recent ERTS releases, you just have to mark your NIF as "dirty" and it'll get offloaded to a separate scheduler-thread-pool that exists just to run dirty (i.e. long CPU-time) NIFs. There's no reason, today, to not just mark your NIF as dirty when you start writing it, and then take the dirty flag off only when you're ready to add the calls to the reduction-budgeting logic.

(If you flip it around in your head, having a dirty flag is essentially the same as having a "clean" flag—or, in other words, an "I know what I'm doing, I've guaranteed this has reduction-budgeting" flag. Which is no more and no less than something like Rust's unsafe{} blocks. To say you can freeze a runtime with unsafe{} code shouldn't be surprising ;) What's more surprising is that you can run native code within ERTS that's not unsafe{} in this sense! ...though, as every Erlang maintainer will tell you, "it's a last resort and you would probably have been fine with a port program, which provides much better isolation.")

benwilson-512 · on March 29, 2018

weberc2 · on March 29, 2018

> naturally more performant

Is this true? The proposal seems to indicate that preemption is more performant? Or at least it says "no runtime overhead", which is presumably in contrast to the current implementation.