
Why you can have millions of goroutines but only thousands of Java threads - navigaid
https://rcoh.me/posts/why-you-can-have-a-million-go-routines-but-only-1000-java-threads/
======
ChuckMcM
Fwiw, the original implementation of threads in Java were 'green' threads (it
was project Green after all). These were implemented in SunOS (vs System V)
using setjmp/longjmp and an some auto-yield logic in the JVM (basically when
you did certain things it would start that thing and yield your thread while
it completed). When I started I had the most System V experience so I ported
it over the Solaris to get to what was affectionately called the 'hello hello
hello world world world' stage. The choice to use the Solaris native threads
over green threads was motivated by the work that had gone on in the Solaris
kernel to do CPU scheduling efficiently. That gave Java multiprocessing for
'free'. Except it exposed a bunch of bugs in the mutex / synchronized code
when you actually got parallel execution :-).

The magic, as the author points out, is dynamic stack allocation. If you
disassociate a thread with a required stack size, then you are back to using
just the amount of memory that your object heap and reference stores are
using. GC gets to be trickier to of course.

This should nominally be a non-issue on 64 bit address machines as you only
need _map_ the pages that actually have data in them (minimum one page) but
that still puts a huge load on the page tables. This is especially true with
huge pages which are used to avoid tlb thrashing and to save on the number of
levels you have to go through to get back the physical address of the page you
are trying to access.

Its the kind of optimization problem that systems folks really like to dig
into and optimize around.

~~~
_pmf_
Is it possible today to have a compliant Java implementation using green
threads? Seems like something that would appear in Azul first.

~~~
kitd
Project Loom [1] is intended to provide green threads, or 'fibers', at the JVM
level.

There have been other projects like Quasar from ParallelUniverse [2] that
retrofits them by processing the compiled bytecode to enable saving/restoring
the IP & stack.

[1] [http://cr.openjdk.java.net/~rpressler/loom/Loom-
Proposal.htm...](http://cr.openjdk.java.net/~rpressler/loom/Loom-
Proposal.html)

[2]
[http://docs.paralleluniverse.co/quasar/](http://docs.paralleluniverse.co/quasar/)

~~~
eternalban
Kilim [1] (afaik) was first in this space. (edit: just FYI)

[1]:
[http://www.malhar.net/sriram/kilim/kilim_ecoop08.pdf](http://www.malhar.net/sriram/kilim/kilim_ecoop08.pdf)

------
cryptonector
It's all about fixed resource allocations. The default stack size for user-
land threads tends to be 1MB or 2MB, and there's also (smaller) kernel stacks.

In implementations of languages like Scheme or Haskell where no stack may even
be allocated, or where stacks might be allocated in chunks, the cost of a co-
routine can be very small -- as small as the cost of a closure. If you take
that to the limit and allocate every call frame on the heap, and if you make
heavy use of closures/continuations, then you end up with more GC pressure
because instead of freeing every frame on return you have to free many of them
via the GC.

In terms of ease of programming to deal with async events, the spectrum runs
from threads on the one hand to callback hell on the other. Callback hell can
be ameliorated by allowing lambdas and closures, but you still end up with
indentation and/or paren/brace hell. Co-routines are somewhere in the middle
of the spectrum, but closer to threads than to callback hell. Await is
something closer to callback hell with nice syntax to make things easier on
the programmer.

Ultimately it's all about how to represent state.

In threaded programming program state is largely implicit in the call stack
(including all the local variables). In callback hell program state is largely
explicit in the form of the context argument that gets passed to the callback
functions (or which they close over), with a little bit of state implicit in
the extant event registrations.

Threaded programming is easier because the programmer doesn't have to think
about how to compactly represent program state, but the cost is higher
resource consumption, especially memory. More memory consumption == more cache
pressure == more cache misses == slower performance.

Callback hell is much harder on the programmer because it forces the
programmer to be much more explicit about program state, but this also allows
the programmer to better compress program state, thus using fewer resources,
thus allowing the program to deal with many more clients at once and also be
faster than threaded programming.

Everything in computer science in the async I/O space in the last 30 years has
been about striking the right balance between minimizing program state on the
one hand and minimizing programmer pain on the other. Continuations,
delineated continuations, await, and so on -- all are about finding that sweet
spot.

~~~
TimJYoung
Very good summary. I would only add that it pays to mention that thread/fiber
stacks typically only _reserve_ 1-2MB of VM, and only a small amount is
committed initially (1 to 2 pages). With 32-bit address spaces, this is a
minor distinction because the size of the address space is what matters the
most, but with 64-bit address spaces, it can mean that you can aren't putting
as much pressure on the VMM as one might think. If all of the threads in a
process only use the default two pages, and the system page size is 4K, then
10,000 threads are only going to result in a commit size of 81MB of VM for the
stacks. Of course, that's a lot of "ifs", and it completely ignores the
effects of that many threads on the per-process VM mappings/CPU caches.

~~~
cryptonector
I keep getting comments about this, but it's not that simple. First of all,
unless you're overcommitting, those stacks consume more resources than one
might think at first blush. Second, the apps I am familiar with (on the Java
side) get pretty deep stacks (you can tell from the stack traces in
exceptions!). Third, if you want millions of stacks on a 64-bit address-space,
I'm thinking you'll want larger pages if much more than a handful of 4KB pages
per-stack get used (as otherwise the page tables get large), and then you're
consuming large amounts of RAM anyways. I may be wrong as to the third item,
but I suspect the actual depth of the stacks as-used makes a big difference.

The point is that thread-per-client is just never going to scale better than
C10K no matter what, and that the work we see in this space is all about
finding the right balance between the ease-of-programming of thread-per-client
and the efficiency of callback hell (typical C10K).

For any size computer (larger than tiny), C10K designs will wring out orders
of magnitude better scaling than typical thread-per-client apps no matter how
much you optimize the hardware for the latter.

Thread-per-client can only possibly compare in the same ballpark as C10K when
the actual stack space used is tiny and comparable to the amount of explicit
state kept in the C10K alternative, and even then, the additional kernel
resources consumed by those per-client threads will dwarf those needed for the
C10K case (thread per-CPU), thus adding cache pressure. In practice, thread-
per-client is never that efficient because the whole point of it is that it
makes it trivial to employ layered libraries of synchronous APIs.

Now, I'm not saying that the alternatives to thread-per-client are easy, but
we should do a lot better at building new libraries so that async I/O can get
easier.

------
BenoitP
I'm surprised Loom hasn't been mentioned yet in this thread.

Proposal: [http://cr.openjdk.java.net/~rpressler/loom/Loom-
Proposal.htm...](http://cr.openjdk.java.net/~rpressler/loom/Loom-
Proposal.html)

It's in the early phase, but there is a prototype currently.

The commits in the mailing list mention a recent zero copy continuation
thawing if I'm not mistaken.

[http://mail.openjdk.java.net/pipermail/loom-
dev/2018-October...](http://mail.openjdk.java.net/pipermail/loom-
dev/2018-October/thread.html)

EDIT: I'm mistaken, it is about lazy stack walking (but I remember zero copy
continuation thawing might be attempted)

~~~
aardvark179
Heap allocated stacks require teaching the garbage collectors new tricks since
they need to parse the frames for references, whereas those can be handle
during the freeze/thaw operation if we copy the stack.

We think we can achieve the level of performance we want with stack copying.

~~~
cryptonector
Stack copying strikes me as oh so wrong...

Anyways, allocating frames or stacks on the heap doesn't seem like it should
require teaching the GC all that much. They would essentially be objects of
such a class that the GC can traverse them... just like a normal stack (from
current frame through return closures), which it already has to know how to
do. Their location shouldn't be so important.

------
davidw
> In Apache, each request is handled by 1 OS thread, effectively capping
> Apache to thousands of concurrent connections

That's wrong. There are a variety of different Apache mpm's:
[https://httpd.apache.org/docs/2.4/mpm.html](https://httpd.apache.org/docs/2.4/mpm.html)

In the footnotes, it also points out that Erlang uses a similar system, which
is true, and worth a look as well.

------
throwaway83628
In typical Java fashion, there is a project, Quasar, the adds true green
thread support to the language. Done by instrumenting bytecode rather than JVM
support, but the benchmarks are still excellent. I'm very curious whether the
Java or Go implementation is faster. Java does have fast async support which
is all you need to add green threads on a library level

Also in Java fashion, the support isn't perfect, and requires some fiddling
with JVM parameters (and annotations, obivously). It's still really solid from
my experience however, and a shame that it isn't more widely known.

A partner project Comsat provides fairly comprehensive library support for
databases and such. I've used it before with Vert.X and the performance was
insane. Possibly faster than Go.

Another aside, there's an active research project to add greaan threads to
Javas core as well. Project Loom

~~~
majewsky
> I'm very curious whether the Java or Go implementation is faster.

Since it's Java, I expect it to be slightly faster than Go while using double
the amount of memory.

~~~
throwaway83628
This sounds very accurate

------
denormalfloat
Go may have more threads, but it is not better at concurrency. It lacks
several key things that make Java much easier to scale up, even if Go is
easier to get started with. On my wish list:

* Thread Local storage. A million Goroutines means nothing is they are all fighting over shared storage. Consider trying to implement Java's LongAdder class, which does intelligent sharding based on contention.

* ConcurrentHashMap. sync.Map is horrendously contentious for writes. It's not even possible to write one yourself because Go hides the built in hash function it generates for each type.

* Goroutine joining. This is between a rock and a hard place, because if you don't wait till Goroutines are done, you risk leaks, and if you do wait, you need to put WaitGroups all over the place.

All those millions of goroutines don't help if you don't have the tools to
coordinate them.

------
UseofWeapons1
The article is off by an order of magnitude in one of the more headline
grabbing comparisons. It states 2.5 million goroutines in a gigabyte of RAM,
when it should be 250k at 4KB per stack. Still impressive on its difference,
but less so.

------
apoorvgarg
I think it is so important to not arbitrarily throw words like 'RAM' in such
discussions, and use more unambiguous terminology such as - virtual address
space and the resident set size (RSS).

On a 64-bit machine, with 1MB stack size (this is a claim on the virtual
address space, not the physical memory), you can have millions of threads too
(that you may hit a lower limit due to other OS level knobs is another
matter).

~~~
timClicks
But wouldn't the consequence of spawning millions of native threads be to
spill those virtual addresses into many, many pages and incur many more TLB
misses and page faults generally?

~~~
nostrademons
Only the portions of the stack that are actually touched trigger a TLB miss
and page fault. If you've got a million threads but each only touches the
first 4K of stack space, you end up only touching 4G of RAM even though you've
mapped 1T of it.

~~~
cryptonector
When you have terabytes of RAM you want large pages anyways.

~~~
gpderetta
As each stack will need to use [1] at least one page of physical memory, large
pages are not great for stacks as they will normally waste memory.

[1] for classic contiguous stacks of course. That doesn't apply for separately
allocated frames or segmented stacks.

------
sorokod
The article conflates Java and JVM. Kotlin ( on the JVM) has coroutines and
can easily run 100k at a time.

~~~
freshhawk
And Java has Quasar. Hasn't everyone on the JVM had various coroutine/green-
thread options for a long time now?

~~~
weberc2
I'm curious if there are any advantages to having all of your libraries use
the same concurrency mechanism. I've been dealing with a lot of "sync vs
async" (aka "What color is your function?") hell lately; I wonder if it would
be similar dealing with some libraries that use native threads while others
use one coroutine library and other libraries use other coroutine libraries?

------
omazurov
_> Supporting truly massive concurrency requires another optimization: Only
schedule a thread when you know it can do useful work!_

The problem is the scheduler doesn't know when a goroutine can do useful work.
It thinks that if a new value arrives to a channel there is a good chance the
goroutine will do something useful but that's not the case in general. My
favorite example is Game of Life (CA) implemented as a network of goroutines
communicating their state changes through channels: you have to collect 8
updates from predecessors to update your state, meaning you'll be scheduled to
run 8 times before you finally make a real progress. Not good for scalability.

~~~
benwilson-512
In Erlang (and presumably Go) green threads end up in a waiting queue when
waiting on further work to do, and as such don't need to be polled or consume
any CPU at all while idle. It's only when they receive a message indicating
more work can be done that they move into the work queue.

~~~
chrisseaton
I think the person you're replying to was saying that 'useful work' does not
including waking up, reading one value, and then immediately sleeping again
for another response. That's a big expensive context switch for really nothing
very useful, and it could have waited until all inputs were ready and there
was significant computation that could be run.

~~~
Ndymium
In most cases, BEAM's (the Erlang VM's) context switches are not big and
expensive, though, which is something you get with user space threads.

------
choxi
> CPUs can only execute about one logical thread per core truly concurrently.1
> An inherent side effect: If you have more threads than cores, threads must
> be paused to allow other threads to do work, then later resumed when it’s
> their turn again.

When would millions of goroutines be useful if you can only run ~4 threads in
parallel?

~~~
icebraining
If you're a server with a million clients holding an open connection, you can
have a goroutine for each, holding the connection state while waiting for them
to communicate again.

~~~
yen223
This is especially important if you have long-lived connections, e.g.
websockets.

------
platz
joe damato claims the reason threads were historically slower than userspace
threading is because of an accident of map_32bit in glibc

> So I think I've identified the culprit and I believe the culprit is XFree86.

> it took seven years after this change went in for someone to figure out all
> the different parts that were damaged and fix them later, right? So threads
> were broken for a really long time.

[https://www.deconstructconf.com/2017/joe-damato-all-
programm...](https://www.deconstructconf.com/2017/joe-damato-all-programmers-
must-learn-c-and-assembly)

~~~
asveikau
But people were saying that long before the events of that story, which start
in 2001.

And people were saying it on platforms other than Linux.

------
steenreem
What's the use-case for many virtual threads? I thought the main case for
using threads was performance, where you need to use them to fully utilize the
machine, although rather you wouldn't be using threads.

Another use-case for threads that I see is when you have work-queues that have
different priorities, so you put each queue into its own thread with a
matching priority. For example, 1 thread for responding to UI events, and 1
lower priority thread for all background work.

Anyways, none of these use-cases require many threads. What use-case am I
missing that makes go-routines useful?

I guess doing something in a thread is a fail-safe way to make sure that 'some
time' is spent processing it, so for example for the UI tasks, it would be
easy to just dump all handling of UI events in virtual threads.

~~~
e_proxus
Coming from the Erlang world (very similar to Go as described in the blog
post) they're not used for raw speed (in the sense of utilizing more cores).
The Erlang VM (or the Go runtime) uses them internally in that way, but for
the programmer the benefit is somewhat different.

Processes or co-routines are often used more as concurrent objects. That is,
they're used to separate data and concerns into separate, isolated memory
areas. However, they're also sequential programs in their own right, so the
system becomes much easier to reason about. If you have a few OS thread per
core, you are responsible for distributing work among those threads, whereas
if you have many co-routines per "job" you can let the runtime decide and
schedule intelligently.

A concrete example is a web server. Implemented in Erlang or Go you would
typically spawn a co-routine for each incoming connection. There you would get
isolation and a simple implementation (parsing the request, performing the
task, returning response and exiting). A crash or bug handling a specific
request would not affect the other clients.

Other great use cases are messaging and communication apps (examples like
Rabbit Q or Discord come to mind), concurrent databases, etc.

------
brown9-2
_For use cases where a high degree of concurrency is required, it’s simply the
only option._

You can get high concurrency in Java by avoiding the 1 thread per
request/connection model and using NIO with easy-to-use libraries like Netty.

~~~
virmundi
To do so you’d step out of the JEE spec. Spring and others are trying to
bridge this. Legacy DB drivers are a problem. For example, last I checked PG
driver was still tread focused for concurrency and transactions.

~~~
cryptonector
So write a new driver. Yes, I know, that's a lot of work.

The thread-per-client sin derives in large part from having synchronous APIs
first.

Every API, every project, needs an async-first approach.

Even then, many many all-synchronous-all-the-time thread-per-client apps will
get written, but at least it will be possible to write thread-per-CPU,
async-I/O apps where those apps need to scale.

~~~
gpderetta
While I would love for every API to have an async counterpart, there is a non-
trivial cost in term of complexity, latency and performance to implement an
async API.

~~~
cryptonector
Cost in developer time, yes. Cost in latency and performance? No! That's the
whole point of async I/O: to improve latency, performance, and scalability.

It's much harder to rewrite software to use async I/O than it is to write it
to use async I/O in the first place. It's also much harder to write software
using async I/O than sync I/O. Therefore it's all about economics and
forecasting. If you can forecast needing to scale soon enough, then start with
async I/O, otherwise increase revenue and profits first then rewrite. For many
companies developer time is a larger item on their budgets than the cost of
extra servers and power and all that, which is why we have so much thread-per-
client software -- it's a natural outcome.

~~~
gpderetta
Async operations might scale better, but the cost and latency per single
operation is higher than a synchronous one.

~~~
cryptonector
Do you have a source for this?

------
romed
This article is wrong and reflects what appears to be a widely-held belief
(practically canon here on HN) that a thread has a fixed-size stack. This is
not the case on every operating system! Linux, which I've heard is popular,
initially commits one page for a thread stack and continues to add pages until
that thread's stack limit is hit. So this statement from which this article
draws its conclusion is incorrect:

"each OS thread has its own fixed-size stack."

100% wrong.

~~~
chmaynard
The article makes a number of authoritative claims without providing much in
the way of references to other more scholarly work. One might characterize it
as an opinion piece masquerading as serious analysis. Nevertheless, I found it
thought-provoking and I'm learning a lot by reading the comments here.

~~~
hazeii
One error I spotted is that at 4kb each, 1GB only gets you 250,000 goroutines
(i.e. article is wrong by an order of magnitude). Easy mistake to make, but
bad in a comparison article.

FTA: "With 4KB per stack, you can put 2.5 million goroutines in a gigabyte of
RAM – a huge improvement over Java’s 1MB per thread"

------
Agrosis
Comparing Java threads and goroutines doesn't really work because Java's
thread model maps threads 1:1 to OS threads, whereas goroutines are simply
continuations that can consume any OS thread. Java threads couple both the
concerns of scheduling and task definition. Java Runnables address only the
concern of task definition and are closer in nature to what goroutines are,
but they still don't possess the ability to suspend and yield their execution.

You can implement Go's thread model (more generally known as green threads,
coroutines, fibers, lightweight threads) on the JVM. This is precisely what
Project Loom intends to do and what Scala effect systems already do. You
construct a program in a type called IO that only describes a computation that
can be suspended and resumed, unlike Java Runnables. These IO values are
cooperatively scheduled to run on a fixed thread pool. Because these IO values
just represent continuations, you can have millions of them cooperatively
executing at the same time. And there is no context switch penalty for
switching the continuation you are executing for another.

~~~
gpderetta
The 1:1 mapping is just an implementation detail of of current JVMs.
Originally green threads (with an M:N scheduler) were used instead. IIRC they
weren't even guaranteed to be preemptively scheduled, I don't know if this has
changed since.

Technically os threads are also simply continuations.

------
cryptos
You can easily have millions of coroutines on the JVM with Kotlin coroutines:
[https://kotlinlang.org/docs/reference/coroutines/coroutines-...](https://kotlinlang.org/docs/reference/coroutines/coroutines-
guide.html)

------
ptx
The article claims that aside from memory consumption the main advantage of
the Go approach is that the scheduler is integrated with the channels concept
and thus knows which goroutines are blocked and don't need to be scheduled
until some data shows up in their channel.

But don't OS threads work like this as well, by being integrated with file
descriptors among other things? If a thread is blocked on a read, the OS knows
this and won't keep scheduling that thread - right?

If so, the article's argument about time wasted on scheduling doesn't make
sense.

~~~
imtringued
I'm so tired of reading this over and over again.

You're basically asking if generalized preemptive multitasking that has to
solve every problem equally well is really that much slower than specialized
cooperative multitasking that is tailored to specific language?

Of fucking course user level scheduling is going to be superior. Just think
about it. Languages like erlang and ponylang dispatch to the scheduler after
finishing the execution of every single function. Do you really think that
switching to a new thread on every function call is going to be faster than
not doing that? Consider that the way you're supposed to do iteration is via
recursion which means you will have one function call on every iteration and
therefore invoke the scheduler on almost every iteration. The user level
scheduler can instantly switch to the next pending actor/goroutine/whatever as
if it was just a regular function call. Meanwhile your regular threads will
have to switch to the kernel first, now has to load the scheduling tree from
main memory because the cache are filled with user level data, then has to
make a complex scheduling decision and finally switch back and reload whatever
data you just flushed out of the cache.

So yes the article's argument about time wasted on schedule makes a whole lot
of sense because not only is the scheduling of preemptive threads itself more
expensive, it also incurs extra costs through context switches and emptying
the cache.

~~~
ptx
> You're basically asking ...

No.

> So yes the article's argument about time wasted on schedule makes a whole
> lot of sense because ... context switches ...

Right. I understand that doing multi-tasking closer to the application code
can be more efficient by avoiding involving the kernel and context switching.

What I was wondering about was this specific argument that the article was
making:

 _Suppose for a minute that in your new system, switching between new threads
takes only 100 nanoseconds. Even if all you did was context switch, you could
only run about a million threads if you wanted to schedule each thread ten
times per second. More importantly, you’d be maxing out your CPU to do so.
Supporting truly massive concurrency requires another optimization: Only
schedule a thread when you know it can do useful work! If you’re running that
many threads, only a handful can be be doing useful work anyway. Go
facilitates this by integrating channels and the scheduler. If a goroutine is
waiting on a empty channel, the scheduler can see that and it won’t run the
Goroutine._

Doesn't the OS also perform this optimization, i.e., only scheduling threads
that can do useful work? If you have thousands of threads that are all
sleeping or blocking on read, it was my understanding that the OS will not
schedule them – contrary to what the article is saying above. Am I wrong about
this?

------
pcwalton
This post is wrong. OS threads do not have a fixed-size stack in the 1:1
model. After all, if they did, then Go couldn't do stack growth, because
goroutines are implemented as a layer on top of native threads. There are
language runtimes, such as SML/NJ, that _heap allocate_ all stack frames, and
the kernel doesn't complain.

Additionally, as others have pointed out, Java (and, for that matter, all
native code on Linux) used to use the Go model and switched to 1:1.

~~~
emfree
Goroutine stacks are in fact allocated on the heap. All the details are in
here:
[https://github.com/golang/go/blob/master/src/runtime/stack.g...](https://github.com/golang/go/blob/master/src/runtime/stack.go)

~~~
pcwalton
By heap allocating stack frames I mean that each time you call a function,
SML/NJ allocates space for that individual frame on the heap. It then uses the
garbage collector to clean up frames after the function call has returned.

~~~
twic
Smalltalk-80 did this too. Later implementations try to avoid it, but
sometimes still do:

[http://www.wirfs-brock.com/allen/things/smalltalk-
things/eff...](http://www.wirfs-brock.com/allen/things/smalltalk-
things/efficient-implementation-smalltalk-block-returns)

------
lostmsu
gorutine's direct counterpart in Java is Runnable, not thread.

------
exabrial
Ok there's a technical gaff somewhere... In my understanding. I regularly have
hundreds of threads running in jvms with between 128mb-384mb of heap. Im
guessing the stack isn't kept on the heap?

------
dekhn
I'm not really a java user but I vaguely recall that java started with green
threads to deal with this issue long before you could just run an absurd
number of threads in the OS without issues

------
amelius
Would it be possible/efficient to write an OS in Go, and use its scheduler to
schedule threads living in the implemented OS? What if you ran Go in that OS?

~~~
Yetanfou
[https://news.ycombinator.com/item?id=18171447](https://news.ycombinator.com/item?id=18171447)

"The paper contributes Biscuit, a kernel written in Go that implements enough
of POSIX (virtual memory, mmap, TCP/IP sockets, a logging file system, poll,
etc.) to execute significant applications. Biscuit makes liberal use of Go's
HLL features (closures, channels, maps, interfaces, garbage collected heap
allocation), which subjectively made programming easier. The most challenging
puzzle was handling the possibility of running out of kernel heap memory;
Biscuit benefited from the analyzability of Go source to address this
challenge."

------
cjmartinkoski
Kotlin coroutines can do it at Java/JVM. goroutines and JVM threads are
different things

------
mbfg
Who creates 10k threads, really? The hypothesis seems like a strawman

------
RickJWagner
This is why I like Hacker News. Thanks go to the author, good stuff!

------
exabrial
Had anyone considered using virtualization instructions to assist with
managing page tables in the jvm?

------
the_other_guy
Creating millions of coroutines or "goroutines" isn't really interesting by
itself. I believe the real questions for me are:

1\. is Go scheduler better than Linux scheduler if you have a thousand
concurrent goroutines or threads?

2\. is really creating goroutines that much faster than creating a native
thread?

3\. are gorutines relevant for long lived and computationally intensive tasks
or just for short lived I/O task?

4\. What is the performance of using channels between goroutines compared to
native threads?

tbh, I have read several of respected articles that criticize Golang
goroutines in terms of performance and I am not really sure that Golang's only
virtue imho which is simple concurrency is performant at all

~~~
weberc2
These are good questions and I'd be interested in more authoritative answers,
but I'll hazard guesses.

1\. Go's scheduler is better at scheduling Go. There are well defined points
at which a context switch can occur, and the scheduler is in userland so you
never really have expensive context switches.

2\. Not sure. I would guess so, if only because it saves a couple context
switches.

3\. Goroutines are absolutely relevant for both. The stdlib webserver is a
high-performance, production-grade implementation and it forks a goroutine for
each request, and applications frequently fork more goroutines within a
request handler.

4\. I assume you're comparing Go channels to pipes? In which case the answer
is probably always "faster", but how much faster depends on the size of the
thing you're transferring since IPC requires serialization and deserialization
while in Go the overhead is just some locking and a pointer copy.

5\. Go has dynamically-sized stacks. While you can change stack size on Linux
(to use very small stacks and thus spin up more threads), I don't think you
can change the size of an individual stack. Plus with Go, you don't have to
configure stack size at all; it works out of the box on every platform.

For me, that Go's goroutines are performant is just gravy. I like that they're
a platform agnostic concurrency abstraction. I don't have to deal with sync vs
async ("What color is your function"; which incidentally cost me the last 1.5
workdays tracking down a performance bug in Python) or posix threads vs
Windows threads.

~~~
ridiculous_fish
How does Go handle non-async syscalls these days? If you spawn a thousand
goroutines which all call stat(), does this still spawn a thousand kernel
threads?

This took some of the bloom off the rose for me. For each goroutine I had to
predict ahead-of-time whether it would require a kernel thread, and if so send
it through a rate limiter. Effectively it was sync-vs-async but hidden.

~~~
denormalfloat
Yes, it spawns a kernel thread if all the threads are currently blocked. I ran
into this issue running Go on a FUSE mounted file system. If the network hung,
Go would spawn thousands of threads and OOM. Go can't _not_ spawn them, as
that would risk deadlock.

~~~
toast0
(disclaimer: I know very little about Go)

An expected way to manage this would be to have a limit on the number of
threads to run blocking syscalls. Up to you how to do it (threadpool for
syscalls, anythrrad can run a syscalls but checks a mutex first, etc)

I don't think there's a danger of deadlocks here -- your blocking syscalls
shouldn't be dependant on other goroutines. Eventually the calls will succeed
or timeout (one hopes) and the next call will commence.

In my experience, you can usually find a safe number of parallel syscalls to
run -- and it's often not very many.

~~~
ridiculous_fish
Imagine 1000 pipes, each of which has a goroutine calling write() and another
calling read(). If you schedule all of the writers and none of the readers (or
vice versa) you'll get a deadlock.

~~~
leaveyou
AFAIK when a goroutine "calls write" (on a channel or on a socket, stream,
mutex etc).. and the write "is blocked" it yields the execution and the
scheduler can activate something else.. which can be a reader goroutine (after
all the writers are blocked for example). So there's no deadlock as long as
you have at least one reader for any writer.

~~~
gpderetta
That requires the underlying syscall to support an async or non blocking mode
though. Disk io or stat doesn't on linux for example. The usual alternative is
some sort of background pool for blocking tasks (which adds non trivial
overhead), or, where supported by the OS, scheduler activation.

~~~
leaveyou
You are right, see Synchronous System Calls in

[https://www.ardanlabs.com/blog/2018/08/scheduling-in-go-
part...](https://www.ardanlabs.com/blog/2018/08/scheduling-in-go-part2.html)

------
edoo
I imagine there are actual use cases for it but what are programs doing
spinning up thousands of threads? I've never seen a program like that.
Whatever happened to worker thread pools?

~~~
tschellenbach
Well say that you're building a micro service that needs to handle a high
number of concurrent requests. A server running Go will handle ~100x the
traffic of a Python or Ruby app. Your milage will vary depending on the use
case, but the difference is huge.

~~~
weberc2
Found out yesterday that even with async in Python someone can sneak in some
sync code and totally ruin your day. All you know is your performance is
plummeting all across your system and CPU is spiky as hell. Took me most of
the day to realize the issue. From now on I’ll start every performance
investigation with ‘grep requests’. Problems like this simply don’t exist in
Go.

------
kolderman
Good luck with hundreds of JVM threads.

~~~
marsdepinski
I've easily ran thousands with an -Xss at 128kB. It wasn't a great
implementation and we reduced this to a nice thread pool of 400 but it worked
quite well.

Threads are quite useful in many situations and have great performance
characteristics. It comes down to people knowing the language they are using
and using it correctly.

------
lovelle
There makes no sense to comparing goroutines with os threads, they are totally
different things, this is like comparing apples with oranges, of course both
of them can be used to solve the same problem, but they are totally different.
Like others have mentioned, is well known that a thread-per-connection
solution does not scale well, a quite better approach is having a pool of
threads with fixed size and using an asynchronous non-blocking event driven
model. Actually in our company we did some benchmark messuring golang (1.9
with well known http servers) and java 8 with Netty, and the latter always
won.

~~~
anothergoogler
What does "won" mean here? Better throughput? Lower latency? Less variance in
latency? Better efficiency? re: the last one, I'd be shocked if Netty could
match Go stdlib http in memory usage even if you benchmarked higher
throughput. I'd expect the Go server to have lower variance too due to its GC
design.

~~~
lovelle
Yes, this sentence is quite vague and does not give proper info about the
topic, you only have to believe me or not, I'm not writing any blog about
this, sorry.

Means higher throughput with lower jitter, that in our case was what we were
looking at that time. For go defense we were using PooledByteBufAllocator for
recycle Output Streams but memory usage was not a concern in benchmark as much
as GC does not affect throughput. But I also believe that go http stdlib
memory usage is better than netty, sorry for not being more helpful with this
topic.

~~~
anothergoogler
Thanks for going into more detail, I'm interested that you got lower jitter.
My experience with high performance JVM networking stacks is that they achieve
incredible throughput but are quite hungry (with a hungry baseline too) and
that 99P can be not so great due to GC. I'll have to give Netty another look.
I'd be interested in the RSS of each process after warmup.

~~~
lovelle
In my experience tunning a jvm application in order to gain better throughput
is far from a trivial job, although jvm ecosystem's requires a much more
experienced developer on the platform to reach some levels in comparison to go
simplicity which results in really high performance solution easier. For
instance deciding an appropriate concurrent algorithms and measuring it. e.g.
using AtomicInteger vs a per-thread-counter-based is not easy (specially in
more complicated scenarios)

We have both java8 and go, high throughput critical microservices with
excellent 99pctl, and in general memory is not a concern (as much as we do
fine tuning on gc and don't have any memory leak) and generally for the really
critical and portable solutions we choose java over go (unit testing and
library versatility is a big player in this discussion)

