
The future of M:N threading - joshbaptiste
https://mail.mozilla.org/pipermail/rust-dev/2013-November/006550.html
======
kibwen
Background: The author of this proposal, Daniel Micay (a.k.a. strcat, a.k.a.
thestinger), is the most prolific volunteer contributor to the Rust compiler
[1], and is fairly obsessed with eliminating design hazards that could lead to
Rust not being competitive with C++ in terms of performance. For example, he
was the driving force behind Rust's recent widespread switch from internal
iterators to external iterators, and wrote most of the external iterator
libraries himself.

[1]
[https://github.com/mozilla/rust/graphs/contributors](https://github.com/mozilla/rust/graphs/contributors)

EDIT: To whoever removed the [Rust-dev] annotation from the title of this
submission, please put it back. It adds valuable context, and it's _in the
title of the page itself_ , so you're violating HN's naming policy by removing
it.

~~~
saosebastiao
But the M:N system is already written, and it seems as though it works
extremely well. In fact, it is one of the things I like most about the
language, and one of the reasons why I prefer it over C++ (as well as D and
Nimrod for that matter).

This decision probably puts it closer to C++, but I can't see how that is a
good thing.

~~~
pcwalton
Note that this is not a decision; it's not by a core team member.

------
arielweisberg
100k+ concurrent threads is a realistic starting point for M:N threading. As
long as native threads require a separate preallocated stack that never
shrinks there will always be a mismatch compared to the capabilities you get
running thread per core event based code.

Making M:N threads a first class POSIX feature is a good starting point as
long as it also allows you to express the intended locality of groups of
threads. Locality isn't optional anymore. The slowdowns he reports also map to
scale up issues you see when you don't think about keeping frequently accessed
mutable data local and shared nothing to an OS thread.

M:N threads are just an abstraction around a stack and registers that frees
you from having to bind up a bunch of related state into a series of objects
and maps. It's awesome, I want it, but nothing is quite there yet. At least
not if you want something like Java or C/C++ performance.

Work stealing across cores is really nice, but I am not sure it is always the
behavior you want. A programmer can tell when migrating a task across cores is
going to hurt more than waiting a little longer to get to it.

I also think any discussion of M:N threads or multi-threading is also
incomplete if it doesn't also include garbage collection as a means of passing
data between threads. Right now we have languages that are garbage collected
and treat native memory as a second class citizen and non-GC languages that
make multi-threading a headache.

GC is great for multi-threading, but not so great for hundreds of gigabytes of
variable lifetime objects. Even if the GC can handle it, GC still has
unacceptable 2x space overhead.

I don't want much :-)

The youtube video doesn't load for me :-(

~~~
rdtsc
The size of the OS threads with their stacks is what I am worried about as
well. It is interesting the author hints that:

> It drops further when asked to allocate 8MiB stacks like C is doing

Memory, just like CPU is constrained and costs money especially if provisioned
in the cloud. If I have 2G of memory available I can spawn 250 threads and
keep them around without swapping.

EDIT: actually scratch the above, _wmd pointed out that stack memory is just
like virtual memory, it is allocated but brought into physical memory as
needed.

This changes how code is written. It is not just an internal detail! Now you
cannot think "one-connection = one-task" now it is about locks, queues and
mapping control blocks to connections. That is the biggest loss.

> GC is great for multi-threading, but not so great for hundreds of gigabytes
> of variable lifetime objects. Even if the GC can handle it, GC still has
> unacceptable 2x space overhead

Good point on GC. I see the main issues in a large concurrent system (and
presumably Rust wants to be a good tool for concurrent systems) is non-
blocking GC. A slow GC might be acceptable, but if one errant task allocated
memory in certain pattern and GC has to lock all the tasks out to run, that
could be a serious issue.

Responsiveness (liveliness) of a concurrent system is something that is often
forgotten and everyone wants to talk about how fast one can transpose a
matrix.

Now a concurrent GC is possible, Azul built one, for Java, it is very
interesting how it work. I enjoy reading their whitepapers:

[http://www.azulsystems.com/resources/whitepapers](http://www.azulsystems.com/resources/whitepapers)

~~~
_wmd
Just highlighting a small, but common miscomprehension (that on one project,
caused a PM to mail everyone at 2am about a HUGE nonexistent memory leak)..

Thread stacks are virtual memory like everywhere else, so in reality an "8mb"
stack means "8mb maximum size". Sure, they won't shrink once pages are faulted
in to back them, but in the average application, especially on 64bit, this
should never be a problem

If for the lifetime of your thread, its stack only ever grew to 32kb, then the
OS will only have allocated 32kb of memory to back it.

~~~
arielweisberg
You won't hit the worst case, but the common case is that threads will burst
stack usage during task execution. If you have 100k threads your effective
stack usage will be the high watermark.

It varies by workload, but for mine that is between 128 and 512 kilobytes per
thread.

If you churn all your threads no problem. If you are hosting 100k persistent
connections I suspect it would detract from available memory.

~~~
yongjik
That's interesting. So if a thread briefly uses 1M stack, and relinquishes
most of it, then the kernel still keeps that unused 1M memory in the page
table?

I expected the linux kernel to be more intelligent, but maybe there's a reason
it behaves like that?

~~~
_wmd
Under pressure, it will eventually page the memory back out, but generally you
never want to get your machine into a state where paging happens. Aside from
that, there is no mechanism that allows the kernel to know the memory is no
longer used unless you explicitly tell it, e.g. via madvise(2).

Though I can't think of a sane way to call madvise(2) involving a thread stack

------
brassybadger
Interesting analogy: Erlang implements an M:N scheduling model, where it
starts an OS thread for each CPU core (by default), and (preemptively)
schedules its own lightweight processes on top of these threads. However,
unlike OS threads, these are very lightweight: when they are created, they
only need a few hundred bytes. Erlangers have been using the one process per
request model for a very long time, and Erlang applications achieve massive
concurrency via this.

To me it seems it is possible to do M:N right, but you need more abstraction
and a different design. M:N seems to work well in Erlang-like cases (very
lightweight processes on top of a kernel threads).

~~~
stormbrew
This is what Rust was doing, more or less, before they removed segmented
stacks.

------
notacoward
M:N threading has been a great idea in theory for at least twenty years, but
during that whole time actually _implementing_ it has always ended in tears.
The unexpected blocking in I/O or page faults etc. always trips it up. That's
why it works great for HPC, not so much elsewhere. There's far more "bang for
the buck" in making 1:1 threading work better.

~~~
burntsushi
I'm under the impression that M:N threading has been implemented in Haskell,
Erlang and Go. I think it's been reasonably successful in all three.

~~~
pcwalton
From what I've heard Go is considering moving to this 1:1 model as well, at
least on Linux. This is the reason for the Linux patches mentioned in the
thread.

~~~
fmstephe
Can you post a link to some discussion around this? A google didn't turn up
anything obvious. Thanks.

~~~
dbaupp
It's hinted at in the video linked in the email[1]:

 _> Over the past year we've been looking at how a uniform, performant,
synchronous programming model can be implemented across all of Google's core
languages (C++, Java, Go, Python), while maintaining compatibility with
existing code._

[https://www.youtube.com/watch?v=KXuZi9aeGTw](https://www.youtube.com/watch?v=KXuZi9aeGTw)

------
skybrian
If you're interested in this area you might like this talk from the Linux
plumbers' conference:

"Over the past year we've been looking at how a uniform, performant,
synchronous programming model can be implemented across all of Google's core
languages (C++, Java, Go, Python), while maintaining compatibility with
existing code."

[https://www.youtube.com/watch?v=KXuZi9aeGTw](https://www.youtube.com/watch?v=KXuZi9aeGTw)

------
saosebastiao
I'm not sure I understand. Are you saying that a 1:1 threading model without
preemption will be more performant than an M:N threading model without
preemption?

How would this affect the performance of applications that spin up thousands
of tasks?

~~~
limsup
If you restrict yourself to 64 bit systems, you can essentially go back to
using OS threads with blocking I/O because the address space is so large. You
can fit thousands of 2MB call stacks into the processes address space and rely
on the OS and MMU to manage the memory.

The problem that created the whole non-blocking, async domain was the memory
needed for the thread's call stacks. You don't have enough address space to
place a bunch of 2MB call stacks for each thread, if you're handling tens of
thousands of connections.

~~~
rdtsc
2MB call stacks?

Mine are 10MB:

    
    
        $ ulimit -s
        $ 10240
    
    

I can only spawn 200 threads in the worst case and keep them active before
system starts swapping. Yes memory space is large by anytime those have to run
they'd have to be brought up into the working set.

EDIT: nevermind, _wmd pointed out that stack memory is still virtual and it
only be brought into the physical memory as needed.

~~~
MrBuddyCasino
Now I'm confused. This comment says that the stack memory is not allocated by
the OS unless actually needed:
[https://news.ycombinator.com/item?id=6728030](https://news.ycombinator.com/item?id=6728030)

Is this not true?

~~~
rdtsc
That is right, the full 8MB or 10MB or whatever of max is not being paged in.

I am stupid. I'll correct the initial comment.

Thanks

------
MichaelGG
The note about Windows User-Mode Scheduling is really neat - it makes mmap
much more usable when you can switch out on a page fault. Is there such a
thing available for Linux?

~~~
onedognight
There is no User-Mode scheduling in the Linux Kernel to my knowledge, but the
video[1] mentioned the post describes Linux patches that add support for it.

[1]
[https://www.youtube.com/watch?v=KXuZi9aeGTw](https://www.youtube.com/watch?v=KXuZi9aeGTw)

~~~
MichaelGG
Shame. mmap is so handy, but blocking on page faults makes it annoying to work
with.

------
bch
Some jump points for the interested on historical implementations other M:N
projects (I remember this specifically for the case of NetBSDs work in this
area):

[http://en.wikipedia.org/wiki/Scheduler_activations](http://en.wikipedia.org/wiki/Scheduler_activations)

[http://web.mit.edu/nathanw/www/usenix/freenix-sa/freenix-
sa....](http://web.mit.edu/nathanw/www/usenix/freenix-sa/freenix-sa.html)

------
steven2012
If I remember correctly, this was the threading model used by Solaris. I'm
sure there are many applications that make use of M:N threading, but for the
server I worked on (~3000 concurrent connections and 50+ threads), this
actually hindered the multi-tasking performance of the server. I remember
investigating and figuring out this was the reason, and switching back to 1:1
made the performance increase.

------
jmount
I can't find the reference, but I remember a grand old rant that threads only
progressed so far because Solaris at the time had an very costly process-fork
cost compared to other Unices and Windows.

~~~
dbaupp
What about shared memory?

~~~
jmount
True: forks have very different semantics than threads. Just thought the
thread architecture would be less complicated if divisions of responsibilities
were different.

------
halayli
Lthread
([http://github.com/halayli/lthread](http://github.com/halayli/lthread)) is
basically an M:N threading model.

You can run an lthread scheduler per core.

~~~
dbaupp
How is this relevant? The post is about the _problems_ with M:N threading
models, and why they aren't a good fit for Rust.

------
eeperson
If you have 1:1 threads in a language, can't you implement M:N threading as a
library? Looking at projects like Akka[1] makes me think this is the case. Is
there reason this can't be done that I am missing? Why would you want M:N
threads baked in to the runtime?

[1] [http://akka.io/](http://akka.io/)

~~~
arielweisberg
Those are usually leaky abstractions that don't allow you to do blocking
operations with the idiomatic APIs and libraries of the language. You have to
use the wrappers provided by the threading library which are typically
incomplete and prevent you from using standard tools.

~~~
eeperson
Aren't blocking operations also a problem if M:N threads are baked in to the
runtime? If all of the idomatic APIs are using non-blocking IO then would
there still be a problem?

~~~
arielweisberg
When M:N threads are truly baked in (either at the runtime level or the OS)
blocking an M thread doesn't block the underlying N thread.

An example would be Erlang which is a runtime built from the ground up to
support lightweight threads where reading or writing to a socket of using the
built in APIs won't block the underlying OS thread. You can still sabotage the
runtime by writing your own native library and doing blocking operations from
there, but the pitfalls are more obvious in that scenario.

If the underlying concurrency and network/disk IO primitives are truly non-
blocking for lightweight threads then everything built on top of them will
also be non-blocking. Non-blocking for the underlying OS thread that is.

That is where co-routines and other co-routine like libraries are a leaky
abstraction. If you step into one of those blocking APIs in a third party or
standard library you will block the underlying OS thread. This means you can't
integrate with existing code or tools easily and you are always vulnerable to
accidentally doing so.

~~~
eeperson
How do run blocking code in a way that doesn't block an OS thread? Is there
some abstraction that allows this that I'm missing?

~~~
Qerub
You let the blocking call do a non-blocking call internally and then instruct
an IO-aware user-mode scheduler to reschedule the user-mode thread when the
non-blocking call has a result. This is what Haskell/GHC, Erlang and Ruby 1.8
do. There are some great papers on GHC's IO manager. In general, it is always
possible to build a blocking interface on top of a non-blocking.

~~~
eeperson
I guess I wasn't clear in my previous question. I know you can build a
blocking interface our of a non-blocking one. My question was how you go the
other way. How do you create a non-blocking interface of of a blocking one.
The parent post seemed to imply that Erlang was able to do this. As far as I
can tell this can't actually be done. If you have a blocking call you still
have to block an OS thread somewhere.

If this can't be done, then that brings me back to my previous point. It seems
like the important thing is that you have non-blocking IO and not that you
have an M:N threading model in the runtime. If you have non-blocking IO then
it seems like you could easily implement an M:N threading model as a library
without making it awkward to interact with.

Is there some assumption I'm making that doesn't make sense?

~~~
Qerub
The Erlang runtime uses non-blocking IO calls to the kernel but exposes
blocking functions on top of this. Is such a function really blocking? It
depends on what level you're looking at; it's blocking at Erlang level but
non-blocking at kernel level. I think this is the source of the confusion. You
are of course correct in saying "If you have a blocking call you still have to
block an OS thread somewhere." if you're looking at kernel level, but not
Erlang level (since it can translate blocking to non-blocking).

> If you have non-blocking IO then it seems like you could easily implement an
> M:N threading model as a library without making it awkward to interact with.

Yes, but your language/system has to have some mechanism to manage control
flow. Examples: F# does what you ask for via asynchronous workflows. Ruby via
fibers/EM-Synchrony. Java via a bytecode weaver like Kilim.

------
deweerdt
> Rust's tasks are often called _lightweight_ but at least on Linux the only
> optimization is the lack of preemption.

I'm surprised by this, I would have expected that voluntary preemption saves
you from having to save the FP registers as well. Is the absence of the
optimization a non-x86 thing?

