I've also have come full circle on N:M / green threads. Now I'm back to thinking they aren't worth the effort / pain. My experience comes from C / C++ land and not Go and Rust but I think same lessons apply. In experience I ended up using a few different framework from libcoro to Mordor to raw swapcontext().
I was initially sold on the N:M model as a means of having event driven programming without the callback hell. You can write code that looks like pain old procedural code but underneath there's magic that uses userspace task switching whenever something would block. Sounds great. The problem is that we end up solving complexity with more complexity. swapcontext() and family are fairly strait-forward, the complexity comes from other unintended places.
All of a sudden you're forced to write a userspace scheduler and guess what it's really hard to write a scheduler that's going to do a better job that Linux's schedules that has man years of efforts put into it. Now you want your schedule to man N green threads to M physical threads so you have to worry about synchronization. Synchronization brings performance problems so you start now you're down a new lockless rabbit hole. Building a correct highly concurrent scheduler is no easy task.
A lot of 3rd party code doesn't work great with userspace threads. You end up with very subtle bugs in you code that are hard to track down. In many cases this is due to assumptions about TLS (but this isn't the only reasons). In order to make it work you now can't have work stealing between your native threads and then you end up with performance problems and starvation problems.
Next thing you realize is that you're still spending lots of memory on creating stacks for your green threads. Then you realize pthreads just let you create small stacks for your native threads and then you realize that 8Mb stacks don't mater much due to delayed allocation. I think that both Rust and Go have back tracked on spaghetti stacks since they are a lot of work require the compiler to generate extract code and they have some bad worst case scenario behavior (where you can get in a look of growing and shrinking a stack reputably due to a function call in a loop).
The final nail in the coffin for me was disk IO. The fact is that non network IO is generally blocking and no OS has great non-blocking disk IO interfaces (windows is best but it's still not great). First, it's pretty low level, eg. difficult to use. You have to do IO on block boundaries. Second, it bypasses the page cache (at least on Linux) which in most cases kills performance right there. And in many cases this non-blocking interface will end up blocking (even on windows) if the filesystem needs to do certain things under the covers (like extend the file or load metadata). Also, the way these operations are implement require a lot lot of syscalls thus context switches which further negate any perceived performance benefits. The bottom line is that regular blocking IO (better yet mmaped IO) outperforms what most people are capable of achieving using the non-blocking disk IO facilities.
This is clearly based on my own experiences. It looks like the Rust folks had similar experiences. So my hope is that anybody thinks long and hard before going down the N:M rabbit hole. I ended up studying my mistakes and history is chock full of people abandoning the N:M model. You can read about the history of NTPL (which is the threading model in Linux 2.6+/ glibc) versus NGPT which was the N:M threading model purposed. The 1:1 NTPL model was simpler and performed better. Freebsd and Solaris moved from their N:M threading models to 1:1 models.
I think the N:M model is going to keep rearing it's head in academic papers about performance of highly scalable systems but in the real world it's benefits / performance will keep being elusive. The only counter point to this is Go that seams to be making a run for it with Go routines.
I should have titled this comment "How I learned to stop worrying and love plain old threads."
The Erlang VM is a successful implementation of N:M. It addresses some but not all of the issues you mentioned.
* IIRC each new process consumes 284 bytes so it's not "spending lots of memory on creating stacks for your green threads"
* Synchronization is a non-issue as everything is message-passing.
* Disk I/O is blocking, but the VM has dedicated disk threads.
* Dealing with 3rd party code is a problem, which is why relatively few libraries exist for Erlang. However, there is some hope now that NIFs (native interface functions, i.e. C code) can interact with the scheduler, i.e. report how much time has been used and return control if necessary.
You are right that "Building a correct highly concurrent scheduler is no easy task." The Erlang scheduler was not an easy task, and has received a ton of work from extremely talented engineers over the course of decades. Definitely worth a look to see how N:M can be made to work well.
C integration is the killer for N:M threading. C libraries do blocking IO, are not wrapped into a VM, are not compiled for segmented stacks, use thread-local storage, etc. You can work around this problems, which increases complexity. For close C integration, you are basically forced to adopt Cs view of threading.
This is why the R17 release of Erlang might get "dirty schedulers" for this kind of work. Essentially it gives you background thread pools where such work can be carried out.
Currently, you can call C functions directly, but it is not meant for long-running tasks which can block and so on.
You can create threads with enif_thread_create and build a queue to handle long running tasks (or a pool of threads), and use enif_send to return the result. Not quite as hacky as it sounds, since afaik port drivers rely on message passing to return values to the VM as well. Maybe not ideal, but better than naively bumping reductions. I haven't played with enif_timeslice yet, so I'm not sure how it compares (it's newer than NIF threads).
A lot of the issues surrounding M:N threads are a result of poor operating system support. The central problem is that a syscall blocks a kernel thread even when there are more user level threads to run.
If operating systems supported something like scheduler activations (a 20 year old technique), then this becomes less of a problem. The gist of it is that every time the kernel thread would block, or gets scheduled, instead of returning back to where it was executing, it upcalls into a user level scheduler which can then choose to schedule user threads. Its a shame that this technique isn't more common.
Because if I use M:N threading then logically one of my user threads blocked and a different one should run during my timeslice. The kernel however is unaware of how I use the kernel thread and will block, believing that I cannot proceed.
1:1 threading lacks this problem but operations such as creating threads or deleting them require syscalls and are therefore relatively expensive.
Here's my experience developing a JVM lightweight thread library[1]:
1. For the scheduler, we use the JDK's superb and battle-tested ForkJoinPool (developed by Doug Lea), which is an excellent work-stealing scheduler, and continues to improve with every release.
2. For synchronization, we've adapted java.util.concurrent's constructs (we use the same interfaces so no change to user code) to respect fibers, but users are expected to mostly use Go-like channels or Erlang like actors that are both included.
3. As for disk IO, Java does provide an asynchronous interface on all platforms, so integrating that wasn't a problem.
4. Integrating with existing libraries is easy if they provide an asynchronous (callback based) API, which is easily turned into fiber-blocking calls. If not, then ForkJoinPool does handle non-frequent blocking of OS thread gracefully.
All in all, the experience has been very pleasant: callbacks are gone and performance/scalability is great. Things will get even better if Linux will adopt Google's proposal for user-scheduled OS threads, so that all code will be completely oblivious to whether the threads are scheduled by the kernel or in user space.
Regarding performance, Linux does have a very good scheduler (unlike, say, OS X), but while there's little latency involved if the kernel directly wakes up a blocked thread (say, after a sleep or as a response to an IO interrupt), it still adds very significant latency when one thread wakes up another. This is very common in code that uses message passing (CSP/actors), and we've been able to reduce scheduling overhead by at least an order of magnitude over OS threads.
I would summarize this as follows: if your code only blocks on IO, or blocks infrequently on synchronization, then OS threads are quite good; but if you structure your program with CSP/actors then user-space threads are only sensible way to go for the time being.
I saw your Quasar library before and while I hadn't had the chance to use I'm excited to try it the next time I need to write event code in the JVM.
As far as the disk IO is concerned the Java APIs are only as good as the underlying OS interfaces. And, those are not that great.
I think you're spot on with the assertion that a mostly network bound workloads can benefit from N:M scheduling. Many of the apps that we build nowadays are exactly that.
That is really interesting. If it's not too much trouble to write out, could you explain what causes the latency difference between kernel wake-up and other thread wake-up?
Paul Turner explained this really well at this year's Linux Plumbers Conference. The whole talk is fantastic, but the explanation of what pron is describing in particular (and how it could be improved) starts around 8:39: https://www.youtube.com/watch?v=KXuZi9aeGTw#t=519
I honestly don't know :) I was simply reporting my results experimenting with this (I'll try to write a blog post about it some time in the near future), so I'll defer to those with a deeper knowledge of the Linux kernel.
I have read that the Linux scheduler exploits some heuristics if it can guess how soon a blocked thread will need to be woken up, so this might have something to do with that.
I'm not too familiar with the details but I think they mention that a thread can specify a callback that will be called if it blocks on IO, and the callback can specify another thread to switch to.
I'm working on my own userspace concurrency framework (early draft intro: http://team.s4y.us/) and having fun with this stuff.
> A lot of 3rd party code doesn't work great with userspace threads.
Agreed, especially since third-party code often calls standard library functions (`connect()`, `read()`, etc.) instead of the magic ones that are aware of your concurrency model. At OkCupid, we have our own database driver and RPC implementations that know how to use events for networking, and it sucks to have them in our codebase.
> The final nail in the coffin for me was disk IO.
Why not make it "somebody else's problem"? My thing uses libuv, which provides a common API and uses a thread pool or new async APIs under the hood. At OkC most filesystem access transparently goes through a separate process — it's just network access from the app's point of view.
I guess what I'm trying to say is that if you can solve the problem once, you can make a nice abstraction on top of it and never think about it again (unlike dealing w/ third-party code and scheduling).
Either way, I'd love your input on what I'm doing so far, you have way more experience.
Looking at your project, did you ever take a look at the Mordor C++ library (https://github.com/mozy/mordor)? It looks like it predates your project by a quite a bit of time.
It looks like you have a fun project on your hands and a lot of learning.
My word of caution about having different scheduler for network / disk IO is that you end up creating latency. Depending on what you're doing this may or may not be an issue in your applications. If you're sensitive to that a large number of small size IO request will quickly do this. Then you're stuck trying to optimize this away (trading problems).
If you're edge application does filesystem calls via some network abstraction you might be in a better place. If you're doing a network filesystem I recommend you check out Ceph. It both provides an in kernel Client (shameless plug I contributed code code there) which is nice because you can rely on the OS do cache things for you. It also provides a client library and the underlying RADOS object store library has non-blocking access to it. Check it out.
Don't count go out, yet. Its green thread method works really well, and it's moving to contiguous stack next release. I've never ran into any problems with this setup as long as I'm smart about control.
I can imagine that trying to replicate myself in c would be an absolute nightmare, so I trust these more experienced folks, and they've not let me down yet...
It has certainly had kernel support for a long time, but no evidence that it is significantly better or worse. If Oracle uses it then it is probably good. Almost no one uses the Linux aio support.
Oracle uses AIO on most operating systems. It implements it's own block and buffer cache so the kernel page cache just gets in the way. Since it does it's own block cache, the block size and alignment limitations are not really a problem for it. KVM also uses AIO for similar reasons (guess OS has it's own page cache so we want to avoid double caching).
AIO on Linux was basically designed for Oracle. Only a few classes of apps can really benefit from it. It also doesn't really guarantee to be non-blocking as many filesystem metadata operations do block (read the long threads on Linux mailing lists).
Unless you're a special class of app that can live with the AIO limitations then you have to stick to blocking IO. Otherwise you're stuck rewriting the page cache and it's unlikely that you'll do a better job the the built in OS.
I think Solaris has similar limitations compared to Linux. Windows has a better API with its async overlapped file API and even there various NTFS metadata operations can block.
I went through the callback-hell problem myself and came to the conclusion that swapcontext is the only way to go. I am starting to begin a large personal project that will be entirely based on swapcontext model (but using boost::coroutine), so I would like to know more from your experience before I invest into possibly the wrong model. Since you haven't explained your solution in-detail, I have a question:
Did you use multiple N:M schedulers (each owning an isolated group of os-threads) or your entire process uses a single N:M threading library that owns all os-threads of the process?
Here is what I am doing:
My project does lot of disk and network IO. So, IMO, being able to keep CPU, Disk and Network busy is my design goal. Having a single N:M scheduler that handles all threads makes it very difficult to understand/analyze/manage the system performance, so I made the choice of having multiple N:M schedulers (that communicate through message passing), each deals with separate resources of the system. For example, os-threads of a N:M cpu-scheduler will only do compute work and when a disk (or network) io is necessary, the greenlet/task/whatever will be queued into the disk-io-scheduler (or network-io-scheduler). When the os-threads of disk-io-scheduler (or network-io-scheduler) finish the io, they queue the greenlet/task back into the cpu-scheduler.
This IMO keeps managing/analyzing/tuning the system performance and complexity little easy. For example,
> All of a sudden you're forced to write a userspace scheduler and guess what it's really hard to write a scheduler that's going to do a better job that Linux's schedules that has man years of efforts put into it. Now you want your schedule to man N green threads to M physical threads so you have to worry about synchronization. Synchronization brings performance problems so you start now you're down a new lockless rabbit hole. Building a correct highly concurrent scheduler is no easy task.
Agreed, writing a good scheduler that deals with mixed workloads is very challenging. In my model, I am hoping that a poor scheduler wouldn't be an issue because, os-threads in the cpu-scheduler are only competing for cpu-bound tasks; similarly os-threads of io-scheduler are only competing for io-bound tasks. The real scheduling between io and cpu operations is still left to the kernel (because kernel decides to pick which scheduler's os-thread to run.) From kernel point of view there are threads that only do IO work or threads that only do CPU work. I am not yet sure if kernel schedulers are optimized for such workload or not.
> A lot of 3rd party code doesn't work great with userspace threads. You end up with very subtle bugs in you code that are hard to track down. In many cases this is due to assumptions about TLS (but this isn't the only reasons). In order to make it work you now can't have work stealing between your native threads and then you end up with performance problems and starvation problems.
My design-solution to this is to segregate 3rd party code into a fixed set of threads that do not ever participate in N:M threading library. Would that solve the above issues?
> The final nail in the coffin for me was disk IO. The fact is that non network IO is generally blocking and no OS has great non-blocking disk IO interfaces (windows is best but it's still not great). First, it's pretty low level, eg. difficult to use. You have to do IO on block boundaries. Second, it bypasses the page cache (at least on Linux) which in most cases kills performance right there. And in many cases this non-blocking interface will end up blocking (even on windows) if the filesystem needs to do certain things under the covers (like extend the file or load metadata). Also, the way these operations are implement require a lot lot of syscalls thus context switches which further negate any perceived performance benefits. The bottom line is that regular blocking IO (better yet mmaped IO) outperforms what most people are capable of achieving using the non-blocking disk IO facilities.
Yes, lot of disk/file-system operations do not have non-blocking equivalents. This is one of the main reasons for having a separate disk-io-scheduler (with its own thread-pool) in my design. In my model any blocking-only operation is made asynchronous by queuing the operation into a dedicated io thread pool and let it inform me when the operation is completed. This gives me flexibility in tuning the IO parallelism specific to my needs. For example, SSD can handle more concurrent operations than a HDD, so I can create an ssd-io-scheduler with more number of threads than a hdd-io-scheduler.
It would be great if you could share your thoughts on my analysis. Thanks.
I haven't used boost::coroutine but I have experience working with Mordor (https://github.com/mozy/mordor), swapcontext directly and implementing my own swapcontext. Why implement your own swapcontext? Well, the default one in Linux actually is glibc can be a bottleneck if you're going to be calling it frequently. It makes the sigprocmask() syscall, if you can avoid it's a substantial speed up. You can read up more on this here: http://rethinkdb.com/blog/making-coroutines-fast/
Another thing you'll see give you a big performance boost is stack pooling (eg. not calling mmap for every makecontext). More on that on the rethinkdb article I mentioned. If you scheduler targets multiple OS threads you should also be careful here to avoid synchronization slow downs. Either some kind of lockless list or pre thread pools.
Now lets talk about 3rd party code. If you have a 3rd party library that internally uses TLS and you swap its context onto a different thread it's bound to misbehave and when it does it's usually subtle and hard to debug. So if you're using 3rd party libraries you either have to audit them (and make sure you didn't miss anything), disable context migration (and risk unbalanced workloads) or have a separate scheduler that only runs those tasks. Pick your poison.
It doesn't even have to be 3rd party code that miss behaves when green threads are migrated. I pulled out my hair for a couple weeks trying to debug an issue with a call to accept(). It was returning -1 but errno was set to 0. What gives? Well it turns out that on linux in glibc errno is a macro, that calls a function to get the address for errno for your thread. And that function is marked with the gcc __attribute__((pure)). So what it means that once the address of errno is calculated once in the body of the function the compiler is free to assume it'll always be that address (it's a pure function without side effects). Here's the sequence:
1. accept() == -1
2. errno == EAGAIN
4. errno = 0
3. scheduler_yield()
5. accept() == -1
6. errno == 0 (although it should be something else)
This will happen on Linux with glibc if your scheduler_yield() call returns but is running on a different thread when it returns. So even your own innocent code that doesn't use TLS can break in interesting ways.
If you have very small green threads and you have a naive stealing scheduler with a mutex you can be sure that you'll be spending significant on synchronization. You can get fancies with non-blocking queues and atomic instructions to overcome this.
I did have multiple schedulers for both CPU bound tasks and IO bound tasks. I would say that that if you're doing disk IO and you're just forwarding the data (versus having to process it) you're better of with non-blocking sendfile() or non-blocking vmsplice() (plus mmap) in your event loop. If you're doing lots of disk IO on a SSD array that can push 2GB/s you're going to needs lots of IO threads the latency of the message passing between the two scheduler is going to add up. Again this may or may not be problem in your application.
Those are some of my own experiences, they may or may not apply to you but I hope it helps.
I was initially sold on the N:M model as a means of having event driven programming without the callback hell. You can write code that looks like pain old procedural code but underneath there's magic that uses userspace task switching whenever something would block. Sounds great. The problem is that we end up solving complexity with more complexity. swapcontext() and family are fairly strait-forward, the complexity comes from other unintended places.
All of a sudden you're forced to write a userspace scheduler and guess what it's really hard to write a scheduler that's going to do a better job that Linux's schedules that has man years of efforts put into it. Now you want your schedule to man N green threads to M physical threads so you have to worry about synchronization. Synchronization brings performance problems so you start now you're down a new lockless rabbit hole. Building a correct highly concurrent scheduler is no easy task.
A lot of 3rd party code doesn't work great with userspace threads. You end up with very subtle bugs in you code that are hard to track down. In many cases this is due to assumptions about TLS (but this isn't the only reasons). In order to make it work you now can't have work stealing between your native threads and then you end up with performance problems and starvation problems.
Next thing you realize is that you're still spending lots of memory on creating stacks for your green threads. Then you realize pthreads just let you create small stacks for your native threads and then you realize that 8Mb stacks don't mater much due to delayed allocation. I think that both Rust and Go have back tracked on spaghetti stacks since they are a lot of work require the compiler to generate extract code and they have some bad worst case scenario behavior (where you can get in a look of growing and shrinking a stack reputably due to a function call in a loop).
The final nail in the coffin for me was disk IO. The fact is that non network IO is generally blocking and no OS has great non-blocking disk IO interfaces (windows is best but it's still not great). First, it's pretty low level, eg. difficult to use. You have to do IO on block boundaries. Second, it bypasses the page cache (at least on Linux) which in most cases kills performance right there. And in many cases this non-blocking interface will end up blocking (even on windows) if the filesystem needs to do certain things under the covers (like extend the file or load metadata). Also, the way these operations are implement require a lot lot of syscalls thus context switches which further negate any perceived performance benefits. The bottom line is that regular blocking IO (better yet mmaped IO) outperforms what most people are capable of achieving using the non-blocking disk IO facilities.
This is clearly based on my own experiences. It looks like the Rust folks had similar experiences. So my hope is that anybody thinks long and hard before going down the N:M rabbit hole. I ended up studying my mistakes and history is chock full of people abandoning the N:M model. You can read about the history of NTPL (which is the threading model in Linux 2.6+/ glibc) versus NGPT which was the N:M threading model purposed. The 1:1 NTPL model was simpler and performed better. Freebsd and Solaris moved from their N:M threading models to 1:1 models.
I think the N:M model is going to keep rearing it's head in academic papers about performance of highly scalable systems but in the real world it's benefits / performance will keep being elusive. The only counter point to this is Go that seams to be making a run for it with Go routines.
I should have titled this comment "How I learned to stop worrying and love plain old threads."