
On the Performance of User-Mode Threads and Coroutines - carimura
https://inside.java/2020/08/07/loomperformance/
======
scott_s
I was massively surprised a few years ago how efficient Linux thread context
switching could be.

I designed and implemented a dynamic scheduler for a streaming dataflow
language a few years ago [1]. We wanted a runtime system which could have
hundreds of OS-level threads execute thousands of dataflow operators that
communicate in a dataflow manner. Threads should _not_ be statically assigned
portions of the dataflow graph so that we could elastically add or remove OS-
level threads based on observed performance.

We compared that scheduler to some other options, including just giving every
operator in the graph its own dedicated thread. One test application was a
simple 1,000 operator pipeline. We used two machines, one with 176 cores, the
other 184 cores. To my surprise, with the pipeline application, the dedicated
thread model beat my fancy scheduler in raw performance by up to a factor of
2. Keep in mind that that's 1,000 threads, all doing work, on machines with
only 176 and 184 cores.

Of course, you would not want to do this in practice, even though the raw
performance was so high: the machine was so massively oversubscribed during
such experiments that it could barely keep up with a simple interactive shell.

But, my intuition had been wrong: I had thought that surely having 10x the
number of threads as cores would mean the overall performance would crawl
because of context switching time. It did not. See section 5.1 of my paper
below for the experiment.

[1] Low-Synchronization, Mostly Lock-Free, Elastic Scheduling for Streaming
Runtimes, PLDI 2017, [https://www.scott-
a-s.com/files/pldi2017_lf_elastic_scheduli...](https://www.scott-
a-s.com/files/pldi2017_lf_elastic_scheduling.pdf)

~~~
corysama
Similarly, Microsoft is advising that Fibers and User-Mode Scheduling are not
actually very useful anymore.
[https://devblogs.microsoft.com/oldnewthing/20191011-00/?p=10...](https://devblogs.microsoft.com/oldnewthing/20191011-00/?p=102989)

Combine this with Linus's recent rant about how avoiding the kernel during
thread synchronization is counter-productive leads me to think that threads +
a synchronized queue built with a mutex and a condition variable is simply the
simplest and most effective way to go wide.

~~~
pron
Windows fibers and the fibers Gor Nishanov rants about as being bad _in C++_
have very little to do with user-mode threads in managed runtimes like Java or
Go despite sharing superficial similarities. In particular, a managed runtime
knows exactly how code uses the stack and understands its representation, and
Java does not have pointers into the stack. This means that stacks can be
moved and resized very, very cheaply. The constraints and capabilities that
matter the most simply don't transfer from one language or environment to
another.

------
pcwalton
This seems to be a response to Google's switchto work in the Linux kernel,
which is motivated by task-switching costs, as per this 2013 LPC presentation:
[http://pdxplumbers.osuosl.org/2013/ocw//system/presentations...](http://pdxplumbers.osuosl.org/2013/ocw//system/presentations/1653/original/LPC%20-%20User%20Threading.pdf)

Note the parenthetical in the article: "And that is how user-mode threads
help: they increase L by orders of magnitude with potentially millions of
user-mode threads instead of the meager thousands the OS can support (but
don’t expect a 1000x increase in capacity; we’ve neglected computation costs
and are bound to hit bottlenecks in the auxiliary services.)"

This is something I'd like to see more of a focus on. For the generator use
case, I can easily see how the kernel thread spawn operation is the
bottleneck. But for the thread-per-connection server use case, I'm not sure
how expensive this cost is relative to all the other work that the thread
does. My suspicion is that Amdahl's Law is going to quickly rear its head
here. Take stack size for example: assuming the kernel stack is 10kB, if your
thread itself uses 10kB of stack, you've cut the theoretical memory advantage
of M:N down from the cited 1000x to a mere 2x…

~~~
pron
> This seems to be a response to Google's switchto work in the Linux kernel

Actually, it's a response to a discussion where somebody asked, "isn't it all
about context-switch cost?" But user-scheduled kernel thread is something we
had in mind when designing Loom, and we've made sure we can be compatible with
them. We've introduced the concept of a pluggable custom scheduler, and that
could be used for kernel threads just as it's used for virtual threads. The
user can choose a thread implementation -- all kernel, all user-mode, or part
kernel/part usermode -- without changing any code. It's all an implementation
detail.

> My suspicion is that Amdahl's Law is going to quickly rear its head here.

That depends. Amdahl's law is about accelerating one job by parallelising it,
while here we're more concerned with Little's law, which is about the rate of
independent requests you can process.

> Take stack size for example: assuming the kernel stack is 10kB, if your
> thread itself uses 10kB of stack, you've cut the theoretical memory
> advantage of M:N down from the cited 1000x to a mere 2x…

Ah, except that's not so easy to do. It's very hard to have "tight" stacks
that are managed by the kernel for the reasons I mentioned here:
[https://news.ycombinator.com/item?id=24082951](https://news.ycombinator.com/item?id=24082951)

~~~
scott_s
> Amdahl's law is about accelerating one job by parallelising it,

Amdahl's law is about the limitations of improving the speed of something, but
not necessarily through parallelization. Its main point is that if you have
some overall task, and you then speed up some subtask within it, the
improvement to the overall time is limited to the contribution from the
subtask.

In context of pcwalton's comment, I think they meant that thread creation time
may be tiny compared to the amount of time a newly created thread will do work
in a server context. If that is the case, improving the thread creation time
will have limited benefit to serve time.

~~~
pron
> Amdahl's law is about the limitations of improving the speed of something,
> but not necessarily through parallelization.

Right, but the goal here is not to improve the speed of _something_ , but
rather to handle as many different, mostly independent _somethings_ as
possible, without necessarily improving their speed (latency), at all.
Amdahl's law comes into effect in the delta between _mostly_ and _completely_.

> they meant that thread creation time may be tiny compared to the amount of
> time a newly created thread will do work in a server context. If that is the
> case, improving the thread creation time will have limited benefit to serve
> time.

Right, this is the same argument as for the context-switch overhead. Still, I
have to say that both virtual thread creation and context-switching are much
better than for OS thread, but the point of the post was to show that in many
common use cases that is not where most of the benefit comes from; rather it
comes from the number of threads you can have.

For example, if thread creation time was high but you could create millions of
them, you could create all of them up front and pool them and still get
most/all of the benefits. But if thread creation time was low but you could
only have a few thousand, then you'd still lose big because of Little's law.

~~~
scott_s
Agreed, strongly. I was just concerned the original comment may not have been
fully understood.

------
vii
The article explains that the primary benefit of user-mode green-thread fibers
is not switching speed, but that they are much cheaper than OS threads, so you
can have many more of them. The costs are paid in memory usage and also the
operating system also has to do considerable bookkeeping for threads.

However, Netty has offered strong support for callback style IO under the JVM
for a long time. This effectively allows the same efficiencies. Of course it
is also possible to do without Netty. Therefore the real advantage of Loom
user-mode threads and co-routines is syntactical programming convenience.
That's the real innovation in Loom!

~~~
free_rms
I thought the memory cost of threads was more of a JVM thing (default 1mb
stack) than an OS thing (pages that aren't mapped/resident don't cost much).

What's the cost to the kernel besides a few structs?

~~~
pron
Once a page is committed, it cannot be uncommitted until the thread dies,
because the OS can't be sure how much of the stack is actually used. It cannot
even assume that only addresses above sp are used. Also, the granularity is
that of a page, which could be significantly larger than a whole stack of some
small, "shallow" thread, and we want lots of small threads.

~~~
marvy
How many parked "fibers" do you think a 32-bit JVM might be able to handle
once loom is ready for prime time? A thousand? A million? Somewhere in
between?

~~~
chris_overseas
I doubt 32bit is an issue for this and I'd assume well over a million, if
Kotlin's coroutines are anything to go by:
[https://kotlinlang.org/docs/tutorials/coroutines/coroutines-...](https://kotlinlang.org/docs/tutorials/coroutines/coroutines-
basic-jvm.html#lets-run-a-lot-of-them)

~~~
marvy
32-bit means you have at most 4 gigs of address space to play with; pron
implied that lower memory usage is the key savings of fibers vs threads, so I
assume a 32-bit JVM will hit a memory limit a lot sooner.

------
jillesvangurp
The word performance is a bit misleading here. A key reason to use co-routines
is not necessarily maxing out CPUs but IO. Non-blocking IO and co-routines
allow handling many connections on very modest hardware. Blocking IO and
languages that don't handle concurrency very well (scripting languages like
ruby or python) deal with this by forking multiple processes, each of which
typically can only do one thing at the time. Java historically worked around
this by using threads.

Using processes doesn't scale nearly as well and you typically run out of
memory before you run out of CPU this way. Using threads like Java does scales
a bit better but there are only so many threads you can juggle. Co-routines or
green threads combined with non blocking IO is much better and is also
becoming common on the JVM where pretty much most modern frameworks support
this. And of course if you use Kotlin, co-routines provide a really solid
implementation and programming model for this.

This is also the reason for the popularity of go and node.js, which aren't
known for their particular ability to get the last bit of performance out of
CPUs but do scale rather nicely when it comes to handling non blocking IO
asynchronously.

~~~
game_the0ry
> Co-routines or green threads combined with non blocking IO is much better
> and is also becoming common on the JVM where pretty much most modern
> frameworks support this.

Appreciate the insightful comment.

Basically, the ideal utilization of compute resource (cpu and ram) is
engineering for _both_ async I/O and and maximizing cpu threads, correct?

I agree, mostly.

I can't say I am very confident in understanding / implementing concurrent
threading in any language. I am working on trying to get it, but sometimes I
don't know if it is worth the mental over head or the added code complexity.

Single threaded async alone took me a while to grasp, and I am sure I still
have gaps in my knowledge. But mentally modeling a single threaded execution
is trivial.

I guess I try maximize for my own understanding and efficiency versus
maximizing for performance.

------
1f60c
This is the first time I've heard of the .java TLD, but sadly only Oracle and
its affiliates are allowed to register domains[0].

[0]: [https://www.oracle.com/a/ocom/docs/registration-policy-
java-...](https://www.oracle.com/a/ocom/docs/registration-policy-
java-2418402.pdf)

------
MaxBarraclough
See also this thread from 2 days ago on _Sub-10 ms Latency in Java: Concurrent
GC with Green Threads_. There's some discussion of Project Loom there.

[https://news.ycombinator.com/item?id=24059335](https://news.ycombinator.com/item?id=24059335)

------
The_rationalist
I wonder if kernel aware green threads will obscolete conventional green
threads such as in Loom.

[https://www.phoronix.com/scan.php?page=news_item&px=Google-U...](https://www.phoronix.com/scan.php?page=news_item&px=Google-
User-Thread-Futex-Swap)

~~~
pron
User-scheduled kernel threads mostly address the context-switch question (and,
more generally, the choice of an appropriate scheduling algorithm), but don't
help as much with the footprint, and so with the level of concurrency, _L_ ,
and _that 's_ where the biggest win is. To do that requires a deeper knowledge
of how the language makes use of the stack than the OS can do. Still, they
have their uses, too, and the expectation is that when they arrive, Java will
support them, too, as a third implementation of threads.

~~~
The_rationalist
_To do that requires a deeper knowledge of how the language makes use of the
stack than the OS can do._ I often hear that argument but isn't the reverse
argument also true? While the JVM has access to information and language
semantics useful for threading, that the kernel doesn't have access to.
Doesn't the kernel have access to useful information for threading that the
JVM green threads doesn't have access to? The one information that come to my
mind is being aware of all threads from all programs and not only the ones
from the JVM. So could next generation green threads have some access to some
of this information? Could this enable greater performance? If so, how? Could
Google futex wait help there?

 _they have their uses, too, and the expectation is that when they arrive,
Java will support them, too, as a third implementation of threads._ Great to
hear that but I hope that their use cases will be well defined and not overlap
too much with loom. _wild idea_ : could the JVM loom scheduler at runtime
convert a green thread into google thread & vice versa? That could help
performance, especially on pathological cases?

Finally, I _believe_ that say 1000 green threads actually reduce to a thread
pool of kernel threads encapsulated by the JVM, if this belief is correct,
then the real kernel threads underlying the green threads could benefit from
being futex wait kernel threads?

~~~
pron
> Could this enable greater performance? If so, how?

I don't know. Any ideas?

> Great to hear that but I hope that their use cases will be well defined and
> not overlap too much with loom.

Supporting them would be a part of Loom.

> could the JVM loom scheduler at runtime convert a green thread into google
> thread & vice versa? That could help performance, especially on pathological
> cases?

Virtual threads' schedulers aren't in the VM. They're written in Java and are
pluggable. But the answer to your question is yes, it's possible -- and not
too hard, even -- but I don't know how much it would help.

~~~
The_rationalist
_Any ideas?_ I'm not an expert, the only idea that I have in mind is simple:
Say that you want to maximize cpu cores usage. So in that case you want to
create ((N * 2) - Y) kernel threads underlying the user mode threads. Where N
is the number of cpu cores and Y is the number of current kernel threads (but
maybe that user space can access that information through a syscall?) This
information allow optimal number of kernel threads backing the green threads.
Optimal when you only want to maximize computation (and not concurrency like
you explain in your blog)

I would like to ask you two other things: Firstly I have seen in the Loom
document that it will use restartable sequences. I find such creative use of
modern Linux technology an awesome thing to do. But you might need to delay
support for them: [https://www.phoronix.com/scan.php?page=news_item&px=Glibc-
Re...](https://www.phoronix.com/scan.php?page=news_item&px=Glibc-Restart-
Restartable-RSEQ)

Finally, I would really like you to attempt a comparison of Loom with current
Kotlin coroutines. I'm sure that you have learnt a bit from them as they
honestly are the state of the art most ergonomic coroutine I have ever seen,
especially regarding the following points: Exceptions just works Beautiful
structured concurrency You can trivially specify and adjust on the fly the
"scheduler" through both its context and its dispatcher Great cancelation
support Colorless (no pollution with a completable future) Beautiful
integration with reactive streams reinvented, through flow.

I don't expect Java Loom to be able to syntactically be as beautiful and
simple as Kotlin coroutines but it would really be an accident to not be as
expressive. So I hope that you will borrow all of their concepts as it's well
thought, maximally expressive yet simple.

So a comparison on the surface level features would be nice. But what interest
me the most would be a comparison of how technically kotlin coroutines
currently works and in what way loom will outperform them? (as you can already
have ~1 million of coroutines) I expect better performance as you have access
e.g to io_uring.

I hope that Kotlin coroutines will be able to be transparently reimplemented
using Loom threads, and therefore it would be very nice that the Loom's
architects communicate with Kotlin architects on interoperability and reuse /
synergies. Both communities bring great progress on the JVM and it's beautiful
to contemplate.

------
ciconia
I think the Go scheduler makes a good case for multiplexing a large number of
user-mode threads (or goroutines) over a limited number of OS threads:
whenever a goroutine needs to make a blocking syscall (e.g. writing to a
socket), the work is offloaded to a "network poller" thread, and meanwhile the
Go scheduler can continue to service other runnable goroutines.

It seems to me that this allows Go programs to maximize the use of their OS-
thread time slices without being preempted on each syscall. Does anybody care
to comment on how this impacts throughput and latency?

------
nhoughto
In my mind user-mode threads are much more about engineering productivity than
raw performance. The comparison is wasteful blocking network calls using
threads written in a simple synchronous way vs efficient user mode threads
making network calls in a non blocking way also written synchronously.

Today to get non blocking efficiencies you need to write async code bc without
loom there is no otb user mode threads, even if user mode threads aren’t as
efficient in raw numbers it’s still waaay ahead of the predominant way threads
are used today in Java (mostly wrt network calls).

------
kgoutham93
Are there any online resources that provide an overview of various advanced
concurrency patterns in various programming languages.

I would like to understand various trade-offs between multiple approaches.

------
carimura
If interested in aggregated Project Loom news:
[https://inside.java/tag/loom](https://inside.java/tag/loom)

------
barbarbar
This site seems to have a lot interesting articles. I have been trying to read
about things like this on OpenJDK. But this site seems to be very easy to
navigate on (unlike openjdk). It is also highly likely that I am an idiot.

