
Cost of a thread in C++ under Linux - eaguyhn
https://lemire.me/blog/2020/01/30/cost-of-a-thread-in-c-under-linux/
======
drmeister
Threads are very expensive if you start throwing C++ exceptions within them in
parallel. You see the overall time to join the threads increases with each
thread you add. There is a mutex in the unwinding code and as the threads grab
the mutex they invalidate each other's cache line. I wrote a demo to
illustrate the problem [https://github.com/clasp-
developers/ctak](https://github.com/clasp-developers/ctak)

MacOS doesn't have this problem but Linux and FreeBSD do.

~~~
gumby
There’s an easy optimization to avoid inspecting every frame when unwinding
which c++ could not implement (for policy reasons) though a platform could:
add a pointer to the next frame that needs unwinding to the frame setup. This
is like move elision.

If my caller has destructors to run or a catch clause this pointer is null and
inspection proceeds as normal. If It does not it stores the value from _its_
frame there. Then if I throw an exception I jump to the next frame that needs
inspection; if I don’t then any throw further down the call stack won’t even
look at me.

The C++ standard can’t call for this because of the “zero cost if you don’t
use it” rule. But a Linux ABI could. The MacOS takes advantage of this kind of
freedom.

~~~
ajross
To be fair: very few C++ applications are limited by exception performance,
it's a feature that's very much out of favor at the moment. So penalizing
everyone else (despite the fact that most new code doesn't use them, it's not
at all uncommon to find projects with exception generation enabled for the
benefit of one library or two) to make parallel exceptions faster actually
does seem like a bad trade to me in the broad sense.

Apple does indeed have more freedom, and it may be that specific MacOS
components need this in ways that the general community doesn't seem to. But
I'd want to see numbers from a bunch of real world environments before
declaring this a uniformly good optimization.

~~~
gumby
> To be fair: very few C++ applications are limited by exception performance,
> it's a feature that's very much out of favor at the moment.

It is distressing that Sutter's survey showed that half the respondents had to
disable exceptions for part of all the code. I've often heard the argument
"well google's coding standard prohibits exceptions" which is bizarre, as
google's standard says "exceptions are great but we have some legacy code that
can't use them, so we're stuck"

The biggest argument seems to be that they are expensive, which is crazy
because there's no cost if you don't raise one and if you do you're already in
trouble and generally have plenty of time to deal with it (this is different
from, say, Lisp signalling which not only permits continuing (!) but is on
theory supposed to be common. Probably a mistake in retrospect). But they
allow you to make the uncommon stuff uncommon (as opposed to error codes which
must be sprayed like shrapnel through your code).

There are two legit arguments against exceptions: one is when you are
constrained in space (e.g. embedded systems) and/or time (hard realtime
systems that need predictable timing, even if it is slower). The other is a
philosophical argument that it embodies a second, parallel flow of control.
Since C++'s exception system is an error system only, and since destructors
are run automatically, it's hard for me to find this second argument
convincing.

~~~
pjmlp
Sometimes I miss C++'s flexibility from the managed languages that usually
use, then I remember that the community is now driven by the performance at
all costs crowd, without exceptions, RTTI, STL and let that thought go.

That is not the C++ I enjoy using, rather the language I got to love via Turbo
Vision, OWL, VCL, MFC, Qt, which is not what drives the language nowadays.

~~~
gumby
I wouldn’t characterize that group as “the community”. True there are a lot of
such people, mostly clustered In the game industry where superstition is rife.

Take a look At C++ (or c++ 20!) as if it were a brand new language you’d never
seen before and forgetting that it’s name includes “c”. _That_ language is a
pretty clean, expressive and straightforward language IMHO. I like programming
in it.

It’s not claiming it’s unicorns farting rainbows, but it’s definitely pretty
good.

~~~
pjmlp
If the community wasn't busy discussing those issues, and constexpr of all
things, we would already have reflection, with a concurrency and networking
story that isn't put to shame for what Java 5 already had, let alone in modern
managed languages.

Yeah, if everyone plays ball, it might come in 5 years from now, assuming
C++23 gets done on time, plus the compiler support stabilization.

Right now SG14 seems to drive some of those decisions, at least from outside.

~~~
gumby
Those are important issues and people who care about them work on them and
come to committee meetings. There is less consensus on the concurrency and
networking side which I also find frustrating but as I’m not pushing those
balls forward I can’t complain. I do think at least that the direction they’re
moving in is a fruitful one.

The standard can move quickly: consider formatted output which lingered
unchanged with a broken model but was rapidly reformed when someone with a
good model _and_ implementation was encouraged to come forward. Admittedly a
smaller topic than concurrency or networking!

~~~
pjmlp
Right now, the way I see it, I rather help the managed languages I work on
reach the point where binding to C++ is kind of last option when nothing else
helps.

The other language communities manage to drive language progress over the
Internet, which apparently ISO has yet to get in touch how it goes.

~~~
gumby
C++ does this as well, for example with boost, where several things that
entered the standard got their start. And the format example I gave. It’s the
ISO blessing that is complex, but also acts as a forcing function tromtrhnto
make new features as orthogonal as possible. Sure, it’s not to everybody’s
taste, but you don’t need to follow ISO if you don’t wish to.

------
boulos
I find Eli Bendersky’s writeup [1] more useful as it actually goes closer to
the details. For readers less familiar, it also makes it more clear what the
time spent will depend on (how much state there is to copy). Eli’s post is
actually a sub-post of his “cost of context switching” post [2] which is more
often applicable (and helps answer all the questions below about threadpools).

[1] [https://eli.thegreenplace.net/2018/launching-linux-
threads-a...](https://eli.thegreenplace.net/2018/launching-linux-threads-and-
processes-with-clone/)

[2] [https://eli.thegreenplace.net/2018/measuring-context-
switchi...](https://eli.thegreenplace.net/2018/measuring-context-switching-
and-memory-overheads-for-linux-threads/)

~~~
amiga_500
Thanks both of these were excellent.

------
bluetomcat
For CPU-bound tasks, it is best to pre-create a number of threads whose count
roughly corresponds to the number logical execution cores. Every thread is
then a worker with a main loop and not just spawn on-demand. Pin their
affinity to a specific core and you are as close as possible to the “perfect”
arrangement with minimized context switches and core-local cache data being
there most of the time.

~~~
hinoki
One thing to worry about is that you’re effectively taking over the job of the
OS scheduler. This can be a good thing since you know more about your workload
than the generic heuristics the scheduler uses, but it also means that you
might need to reimplement some things.

Like only scheduling work on logical cores that share a physical core after
all physical cores have a busy logical core (I.e. fill up the even cores
first).

~~~
jandrewrogers
Taking over the job of the OS scheduler is explicitly the reason for doing it,
there are some classes of macro-optimization that have this as a prerequisite.
It is done for the same reasons that high-performance database kernels replace
the I/O scheduler too.

To your point, it is a double-edged sword. Writing your own schedulers
requires a much higher degree of sophistication than using the one in the OS.
It is a skill that takes a long time to develop and requires a lot of first
principles thinking, there is loads of subtlety, you can't just copy something
you found on a blog. It also isn't just about being able to predict the
behavior of your workload better than the OS, you can also adapt your workload
to the schedule state since it is exposed to your application, the latter
being a greatly overlooked capability.

Once you know how to design software this way, it not only generates large
increases in throughput but also enables many elegant solutions to difficult
software design problems that simply aren't possible any other way. While the
learning curve is steep, once you are accustomed to writing software this way
it becomes pretty mechanical.

~~~
amiga_500
Any reading recommendations?

------
shin_lao
Great reminder.

Even if you pre-create a thread (thread pool), when the task is small enough
(less than 1,000 cycles), it is less expensive to do it in place (for example,
with fibers), because of the cost of context switching.

~~~
iforgotpassword
Agree. A few years ago I noticed a C program we used in production spawned a
new thread for each incoming connection. Since the vast majority of these just
served two small requests (think two HTTP gets) I tried adding a very simple
thread pool that would keep up to four idle threads around. To make a thread
wait for work I used an eventfd (Linux). I tried a linked list and an array
for the idle threads. I tried protecting the get/return code with a mutex and
spin lock, and then made it lock free with C11s atomics. Two days later I
still couldn't get this to be faster than just spawning a new thread every
time, so I gave up this experiment.

It seems at least the Linux folks optimized the crap out of clone() over the
last years.

~~~
frankzinger
Thread spawning should only become a problem for very high concurrent client
counts (ie, large N). How many concurrent clients did this program have?

When I ran benchmarks to compare a thread-per-client model to a single-
threaded, event-based one, the single-threaded throughput was around 2 to 3
times higher for as few as 1000 clients.

~~~
iforgotpassword
Not many, slightly above 800 on average iirc. It wasn't even a bottleneck or
anything, I was just curious how much impact it would make. That's also why I
didn't bother trying to change it to event based with single thread or one
thread per CPU core, as that would've been way more work than two afternoons
so just not justified.

------
hrgiger
Using taskset pinning my numbers improves:

$taskset --cpu-list 8 ./costofthread avg: 11000~

$taskset --cpu-list 8,11 ./costofthread avg: 33000~

$./costofthread avg: 60000~

------
saagarjha
Is a std::thread a thin wrapper around pthreads on Linux?

~~~
foo101
A related question if anyone knows good answers here.

What programming languages' de-facto thread implementations are not wrappers
around pthreads? I think Go has its own thread implementation? Or am I
mistaken?

~~~
signa11
erlang has its own process/thread implementation with, iirc, 64b per process.

~~~
toast0
The docs [1] say:

> A newly spawned Erlang process uses 309 words of memory in the non-SMP
> emulator without HiPE support. (SMP support and HiPE support both add to
> this size.)

And a word is the native register size, so 4 or 8 bytes these days, so fairly
small, but not 64 bytes small.

[1]
[http://erlang.org/doc/efficiency_guide/processes.html](http://erlang.org/doc/efficiency_guide/processes.html)

------
known
On any architecture, you may need to reduce the amount of stack space
allocated for each thread to avoid running out of virtual memory

[http://www.kegel.com/c10k.html#limits.threads](http://www.kegel.com/c10k.html#limits.threads)

~~~
CJefferson
Is this even possible on a 64 bit architexture? The default stack size is, I
think, 2mb, and i have previously allocated terabytes of VM space without
issues.

~~~
wbkang
No this is more of a 32bit issue.

------
isatty
Why is there such a big difference in timing between Skylake and Rome?
Something compiler specific? The number of steps required to create a thread
should be identical.

I’ll also be interested to see the same benchmark but using pthread_create
directly.

~~~
thedance
Could be as basic as clock speed differences.

------
maayank
Why the relative high cost of threads on ARM? If anything, I'd imagine it is
more geared towards "massive parallel" scenarios (i.e. dozens of cores).

------
Koshkin
Intel’s excellent TBB library is the answer to all your worries about threads
in C++. (IMHO it should be made part of the standard library.)

~~~
ncmncm
All your worries, if throughput is all you worry about, and not latency. Or,
if you have interaction between threads. Or, if you might need to run on other
archs.

An equivalent to TBB or GCD will be in C++23 std libraries, but you can often
do better with coroutines, in 20.

TBB and GCD still need to sychronize sometimes, and they randomize workload
assignment, which is bad for cache locality (i.e. bad). If you can arrange
static assignment and avoid need to synchronize, you can do better, sometimes
much better.

~~~
Koshkin
> _other archs_

See, for instance, [https://www.theimpossiblecode.com/blog/intel-tbb-on-
raspberr...](https://www.theimpossiblecode.com/blog/intel-tbb-on-raspberry-
pi/).

~~~
ncmncm
Interesting, thank you.

------
signa11
imho, if _cost_ of thread creation is where the bottleneck is, then more
likely than not, you are doing things wrong.

~~~
Ensorceled
This is just another way of saying what the article just said.

------
brainscdf
My personal best practice is to always create a thread pool on program startup
and distribute your tasks among the thread pool. I use the same best practice
in all other languages too. Is this best practice sound or can it lead to
problems in some corner cases?

~~~
hinoki
There are lots of details that might cause problems:

* Do your tasks block? How many threads do you need to make sure you can use all your CPUs.

* Do your tasks access different sets of memory? Would keeping similar tasks on the same CPUs reduce cache misses.

* Do your tasks have different priorities? You might need a pool for each priority.

For a UI program that isn’t doing anything really intensive or real-time,
having a common thread pool makes a lot of sense, and can reduce resource use
(stacks add up once you get to many 10s or 100s of threads...), and improve
latency (a work queue with many threads will get more CPU than another with
the same amount of work but fewer threads)

~~~
londons_explore
Case in point:

I used nodejs for a project, and assumed that "it's all javascript on one
thread" would leave threading issues behind.

My application curiously stopped responding whenever I had 5 or more users.
Connected users could continue to do anything, but new users couldn't connect,
and existing users sessions would hang when executing any code that wrote to a
logfile, making debugging even harder. Using the nodejs debugger, the
internals of write(...., cb) were just never calling the done callback.

After hours of head scratching I found that most IO from nodejs is _not_
asynchronous and callback based as the docs suggest, but is in fact blocking
IO done from worker threads. My process was using pipes to communicate with
other processes, and those pipes were doing blocking writes, and when blocked,
the worker thread was blocked.

There are 4 worker threads by default, so whenever 5 users were using the
system, all worker threads were tied up and it would fail. It would have been
nice for nodejs to at least have printed to the console "All worker threads
busy for >1000ms. See nodejs.com/troubleshooting/blockingfileio.htm" or
something.

~~~
dirtydroog
As far as I'm aware, node.js is a wrapper over libuv which is a truly
asynchronous socket IO library. It fakes file IO async ops with thread pools
because on Linux file IO isn't async at all.

