1, because the Goroutine context switch is Insanely cheap. There's no preempting, the scheduler runs on function calls and the context switches are very cheap - just swapping a few registers I believe. This means a single thread could blow through an enormous amount of goroutines in no time. I know, kernel mode switches are fast, we optimize this all the time, but doing nothing in place of doing something will always be inordinately cheaper.
2, because the Go scheduler has awareness about the state of the goroutines that an OS scheduler would not, so it can make more intelligent decisions about what goroutines to wake up and when.
You can really have pretty much as many goroutines as you want. Hundreds of thousands of OS threads lands you firmly into tweaking kernel settings land. My local thread max on my desktop is just 127009 - that wouldn't fly for a huge machine running many Go apps, which is exactly the kind of circumstance I was in (using Kubernetes, to be exact.)
Completely true, but in a realistic small workload situation with a 0.5ms response time the pthread context switch is already only 2usec or 0.4% of the total time. Goroutines can be infinitely faster and not be able to meaningfully improve overall performance.
This is why thread per request servers like jlhttp are right up there with fasthttp etc in terms of total throughput.
Something notable is that jlhttp and fasthttp are both using worker pools. fasthttp uses worker pools of goroutines, and jlhttp uses traditional thread pooling.
Thread pooling is an effective solution to improve webserver performance, and it generally works well. In these synthetic benchmarks, you can't even really see much of a downside. In reality a lot of these benchmarks are only so good because they have requests that complete very quickly, and I think if you add a random sleep() into them many of them will just die outright because they can't handle that much concurrency and block waiting for free workers. You might think that is unrealistic, but consider that many people have Go servers that are making RPCs all over the place, and an actual large amount of time can just be spent waiting on other RPCs. It's a real thing!
And if the world were just responding to HTTP requests, obviously something like Goroutines would be overkill. But, one of my favorite uses of Goroutines was implementing a messaging server where each consumer, queue, exchange were all their own Goroutines. I was inspired by RabbitMQ for the design but unfortunately could not use it in this use case. Luckily Goroutines worked really great here and I was able to scale this thing up hugely. To me this is where they're really great: they're super flexible. They work pretty well for short-lived HTTP requests, but also great for entirely different and more complicated use cases.
Looking back at the benchmark, one of the more interesting approaches here is go-prefork[1], which works by spawning n/2 executables with 2 threads each. I can only imagine the optimal amount of threads was complicated to determine and maybe even has something to do with hyperthreading. Of course the advantage here is the reduced amount of shared state that leads to less contention, and it does indeed show up on the benchmark. In this setup, it looks weird because there's no load balancer (could be something as simple as some iptables rules) or anything in front. In practice, this would be much akin to running separate instances on the same box, which is also a reasonable approach, and I used this approach myself when scheduling servers. Oddly, I don't think they tried the same approach for fasthttp.
I think what else the benchmark shows is how clever you don't have to be in Go to get good performance. go-postgres is routinely in the middle of the pack and it is literally just using the standard library and goroutines in the most basic fashion. It's effectively not optimized. And in reality, in many cases with more complex servers, the overhead is low enough that it isn't worth your time to optimize it much more.
2, because the Go scheduler has awareness about the state of the goroutines that an OS scheduler would not, so it can make more intelligent decisions about what goroutines to wake up and when.
You can really have pretty much as many goroutines as you want. Hundreds of thousands of OS threads lands you firmly into tweaking kernel settings land. My local thread max on my desktop is just 127009 - that wouldn't fly for a huge machine running many Go apps, which is exactly the kind of circumstance I was in (using Kubernetes, to be exact.)