"The advantage of doing it in the userspace code is that it can be done in a simpler, more specialized way."
No, the advantage of doing it in userspace code is that you can paper over the deficiencies of the layers under the one you are working in with significant manual effort. Node.js introduced the concurrency problems when it selected Javascript as one of the layers, it doesn't get much credit in my mind for then solving them at great effort and with horrible damage done to the resulting program structures. The problems Node.js solves are not fundamental to programming, they are fundamental to Javascript.
Pick something like Erlang and the problem never exists in the first place. You don't have to paper over the deficiencies in the lower levels, because the levels below the code you're writing aren't deficient for concurrency in the first place.
I should point out that in general this is not necessarily a bad thing; alas, there's always some way your lower layers are deficient, and it's far worse when they make it impossible to paper over the problem. Still, you will never end up with simpler code. More specialized, oh my yes, but certainly not simpler. And the wisdom of picking a layer that is fundamentally deficient for your core target problem then papering over it seems pretty limited to me.
Pick something like Erlang and the problem never exists in the first place. You don't have to paper over the deficiencies in the lower levels, because the levels below the code you're writing aren't deficient for concurrency in the first place.
Now, be fair: Erlang papers over the deficiencies of the lower layers (namely, the kernel) with significant manual effort. It's just that someone else has already gone to this effort. BEAM is an event-based server behind the scenes, with lots of syntactic sugar to make it look multithreaded. The PLT Scheme webserver is another example of this.
I also didn't mean to suggest that the code itself is necessarily simpler. It can be, but (as the node.js example proves) it is certainly not always. The scheduling algorithm, on the other hand, is typically much simpler. Namely, it's typically "round-robin cooperative multitasking." No serious operating system since Windows for Workgroups has actually tried to use round-robin cooperative multitasking.
All the other "thread stuff" is typically simplified as well: smaller stacks (if any at all), simpler context-switching code, that sort of thing. The actual encoding of the business logic? That depends on the problem.
"Now, be fair: Erlang papers over the deficiencies of the lower layers (namely, the kernel) with significant manual effort."
Since the various Xpolls were added to the kernel, I would disagree. Even based on a select loop the kernel has still been happy to do many things asynchronously for a long time now, it's just a bit of a klunky API. What stopped it from being easy was that C had no good concurrency story except "lots of manual effort", so nobody could really effectively use those things in C. (The kernel has actually had the "lots of manual effort" applied to it.) The vast majority of concurrency failure has been layered in at the high-level language point. There is a large class of "better C than C" languages (or the implementations thereof as the case may be) that simply shoot you before you can even think about fine-grained concurrency, CPython, Perl, Ruby, Javascript, etc. And another class of languages that permit it but don't really help much, like C(++), Java, C#, and anything where you might seriously use a semaphore directly.
This is the core error that Node.js hype and its partisans make, mistaking the deficiencies of a set of languages for deficiencies in programming itself. It's not even in the OS; the OS doesn't mind threadlets/green threads/OS processes, it's all in the high-level languages.
The kernel may not have provided you a threadlet system out of the box, but the Erlang VM isn't particularly fighting the kernel either, it's building on it. In this context, that's not particularly what I mean by deficiency. Missing things can be added, I'm talking about things where the underlying layers actively fight you.
Also, yes, in some respect I'm still really responding more to the hype than directly to you. Saying that BEAM is an event-based server behind the scenes is basically the point I think needs to be made more clearly. You can be "asynchronous" and "event based" without having to embed the asnchronony and eventedness visibly in every function and nearly every line in the code base.
Since the various Xpolls were added to the kernel, I would disagree. Even based on a select loop the kernel has still been happy to do many things asynchronously for a long time now, it's just a bit of a klunky API.
I have the suspicion that we're in violent agreement. One way to look at it is that select/epoll/kqueue/etc are hooks that allow the programmer to work around the kernel's threading model by doing several tasks at once in the same thread.
It's not even in the OS; the OS doesn't mind threadlets/green threads/OS processes, it's all in the high-level languages.
No, the OS does not mind these things. On the other hand, User-Mode Linux proved years ago that the OS doesn't mind you implementing a whole other OS on top of it. This is essentially what BEAM is: though it outsources the "interacting with hardware" stuff to the underlying OS, it has all the other fundamental parts of an OS built-in. One could argue that, given this, the OS does not actively fight you on anything.
In a perfect world, there would be no need to do this. The process architecture would suit Erlang's purposes all by itself, and BEAM could be a regular old VM that outsources scheduling to the OS. Unfortunately, this is not a perfect world, and real-world schedulers generally need to worry about a wide variety of things. Not only that, but things like permissions are typically strapped onto scheduler primitives for historical reasons. BEAM punts on many of those things, and thus trims a whole lot of overhead.
Missing things can be added, I'm talking about things where the underlying layers actively fight you.
On that basis, I can argue that C is perfectly happy to do all those same things. BEAM is written in C, so anything BEAM can do is empirically possible in C.
When it comes down to it, the hardware doesn't know a thing about "events", "processes", "threads", "semaphores", or even "concurrency". The hardware performs calculations, and it does not care whether those calculations are specified in "The Operating System" or "A Userspace Program" or "A Library".
To the extent those can be papered over, they have been by most good runtimes; the remainder is evenly applied to all languages and runtimes under discussion because nobody can escape from them at all, no matter how awesome the top layers are.
This is what I was referencing when I offhandedly mentioned the fact that some layers can make papering over them actually impossible; if the kernel can not be convinced to do it with any series of syscalls, or worse yet the hardware itself can not be convinced, you lose, game over. Another example, Haskell's very strong type system can do a fairly good job of making really damned sure you don't escape from the system, which is useful for making an STM that is actually usable precisely because the smallest escape hatch tends to bring it down. But the downside is that a library or something can actually make it impossible to hack around a problem (with any reasonable degree of effort).
"Now, be fair: Erlang papers over the deficiencies of the lower layers (namely, the kernel) with significant manual effort. It's just that someone else has already gone to this effort. BEAM is an event-based server behind the scenes, with lots of syntactic sugar to make it look multithreaded. The PLT Scheme webserver is another example of this."
This really is not a valid critique. When you get down to it, I/O is going to be interrupt driven. Everything other than the interrupt handlers is just going to be "sugar."
Node.js is the worst of both worlds - all that "significant manual effort" is now pushed into the application logic, and all that it's doing for you is just thinly "papering over" the syscalls.
If you really want to see a well-done papering over, take a look at Gambit Scheme, instead of BEAM. Marc Feely really understands the issues with green thread scheduling, continuations, and concurrency.
Erlang has a very simple runtime model - processes basically don't have a stack across invocations (message receives) or a dynamic environment, the only concurrency is in mailboxes, and despite claims of being "soft real-time" there's actually not a lot you can do to influence scheduling policy.
Gambit supports pre-emptive multitasking for green threads, with continuations, exceptions, dynamic-wind, and (optionally inheritable) dynamic environments.
Gambit provides mailboxes, but also locks and condition-variables (with a much richer API than most implementations; usually CVs wake either one or all threads, Gambit has functions for both).
This is what makes me want to work on operating systems, and in particular exokernel operating systems. There's no reason why OS threads should be expensive, other than hugepages.
No, the advantage of doing it in userspace code is that you can paper over the deficiencies of the layers under the one you are working in with significant manual effort. Node.js introduced the concurrency problems when it selected Javascript as one of the layers, it doesn't get much credit in my mind for then solving them at great effort and with horrible damage done to the resulting program structures. The problems Node.js solves are not fundamental to programming, they are fundamental to Javascript.
Pick something like Erlang and the problem never exists in the first place. You don't have to paper over the deficiencies in the lower levels, because the levels below the code you're writing aren't deficient for concurrency in the first place.
I should point out that in general this is not necessarily a bad thing; alas, there's always some way your lower layers are deficient, and it's far worse when they make it impossible to paper over the problem. Still, you will never end up with simpler code. More specialized, oh my yes, but certainly not simpler. And the wisdom of picking a layer that is fundamentally deficient for your core target problem then papering over it seems pretty limited to me.