
A Concurrency Cost Hierarchy - signa11
https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html
======
terrelln
I've implemented counters exactly like this. I actually migrated away from the
tls_counter to something like the cas_multi_counter.

The problem with the tls_counter is memory usage that scales with the number
of threads. In this example you need 16 bytes per thread because only one
counter is supported, I believe we ended up with 48 bytes per thread. So with
10,000 threads you can end up with >=160KB of data for every counter. And
10,000 threads isn't that far-fetched if most are sleeping (though not ideal).
Then you end up with 10,000 counters and you're spending >= 1.6 GB of RAM on
counters. Where the cas_multi_counter would only be using 40.96 MB. And read()
scales with the number of threads, so can get pretty slow, but that is
secondary.

One optimization you can use to make the cas_multi_counter more memory
efficient is to share the backing memory. You have 4096 bytes, but are only
using the first 8 bytes of each 64-byte stripe. You can multiplex 8 counters
into the same 4096 byte storage. The first counter gets offset 0, the second
offset 8, and so on. This only adds 1 indirection in the hot operator++()
path, and doesn't add any extra false sharing as long as the 4096 byte buffer
is 64-byte aligned. Construction and destruction is more expensive, but not
terribly so, and that is probably cold anyway. Now you're at 512 bytes per
counter, and the size is independent of the cache-line size. If you can count
the number of active CPUs you can dynamically size your buffer to only have #
CPUs cachelines, further saving memory.

~~~
hinkley
I'm fairly sure that the spacing between the counters is so that writes on
separate cores don't fight over the cache line coherency logic. I'm not sure
how you conclude that putting multiple counters into the same line avoids
false sharing. That looks like a text-book case of false sharing to me.

If memory is an issue, is your suggestion going to behave any better than
simply collapsing each counter down to 8 bytes per thread? Perhaps a bit if
there's a wide disparity between counter frequencies and the independent
counters tend not to run simultaneously, otherwise you now have contention
between mostly unrelated tasks, which are now interacting less subtly.

~~~
reitzensteinm
I believe parent means you can pack multiple unrelated counters intended for
the same core. The cache line is still uncontended, you're just storing more
in it.

If you're building these things up from scratch this should be obvious, but if
you're pulling in a library it's easy to not realize how wasteful it is being.

~~~
terrelln
Yeah that’s right. All the counters use the same indexing scheme, so all
accesses to the same cache line should be on the same core.

------
scott_s
The author correctly identifies the problems with sched_yield() [1], but then
used it anyway. In my experience, using nanosleep() with an exponential
backoff and a threshold performs well and results in the kind of behavior
people want when they use sched_yield(). I think the difference is enough to
affect the results, so I’m surprised the author didn’t experiment with it.

[1] for an additional reference,
[https://news.ycombinator.com/item?id=10926654](https://news.ycombinator.com/item?id=10926654)

~~~
BeeOnRope
Yes, because the article isn't about the best approaches for concurrent
counters, but rather the good, bad and downright ugly ways you can do
concurrency, and the associated costs. The counter is used as an illustrative
device across all the levels, so of course the really slow levels are going to
involve something you don't actually want to do!

~~~
scott_s
I'm not concerned with the counter itself, either. I'm concerned with the
implementation of the synchronization at that level. I'm afraid the solution
at that level is a strawman. There is a better solution for the "system call"
level.

I also don't think it's useful to think of this hierarchy as "good, bad and
ugly." Sometimes the problem you're solving will require being at one of these
levels, and you can't perform the optimizations or re-architecture required to
drop down a level. When that is the case, it's useful to know the best options
at that level.

~~~
BeeOnRope
As the article shows, there are basically the reasonable levels, 0, 1, and 2
which are all valid approaches and depend on engineering tradeoffs. Levels 3,
4 and 5 are mistakes you want to avoid: so of course you aren't going to agree
with the implementation choices in those levels, because they are mistakes!
Those types of mistakes occur in practice along these lines (and in other
ways), so I think it is useful to show them.

Note that "system call" level doesn't mean "the best level you can implement
with a system call", although admittedly perhaps the naming was confusing.
Rather, it means "an approach which implies that a system call is made for
every interactive" \- that is, it is almost always mistake, since can
generally avoid this.

The levels are named after the dominant factor in their cost.

~~~
scott_s
Are you the author?

As I said, sometimes you _need_ to be at those levels for reasons outside of
your control. When you are at those levels, it's useful to know the best thing
to do at that level. In particular, sched_yield() is a known anti-pattern when
using system calls, and there are known better techniques. It's better enough
that I think it could have a significant impact on the performance results.

~~~
BeeOnRope
Yes, I am the author so I'm especially interested in feedback, thus my desire
to get to the bottom of this. If it's a misunderstanding I can fix by
clarifying something in the article, I'm open to it.

The goal of this post is to define and give a rough measurement for various
concurrency performance levels, and as a pedagogical technique I pick one
example (cross-thread counter) and stretch it across all the levels.
Specifically, the goal is not to create a "good" counter at all levels, but to
pick an example which can _plausibly_ hit all the levels, but that's going to
require some bad designs at the upper levels.

No all levels represent good engineering: some are simply dominated. For
example, I'd argue that 4 and 5 are dominated pretty much always. For a
specific requirement, even more levels will be dominated. E.g., for this job,
the simplest counter which uses an atomic add to increment the counter is a
good baseline: any slower implementation than this is dominated. You'll never
_need_ to be at a higher level for this task.

So yes, the higher levels "do something dumb" by definition, because if you
are making a system call to increment a counter every time, you are doing it
wrong. That sched_yield is an anit-pattern [1] is besides the point because
all of the higher levels are deep into anti-pattern territory. That's the only
way I could see to make one example work across all the levels (and having a
single example was important to me).

If I understand your concern, you would like (or believe the intent is) that
the various levels represent a "best possible" implementation within the
constraints of the level? I don't think it is possible, as above above it's
always going to be deeply artificial once you start forcing an atomic counter
to make a system call. For example, I could use a SysV semaphore as my
concurrency mechanism: this also always makes a system call. It is not
violating any "no sched_yield" rule, but I think it is equally artificial.

\---

[1] Actually, I don't know that sched_yield is _always_ an anti-pattern, but
let's not even go there: for the purposes of this discussion my point stands
even if we assume it is always an anti-pattern (you hint at some of the places
where it might be useful with your nanosleep comment).

~~~
scott_s
I feel like this discussion is slipping out from under me, as I thought we
agreed that the counter itself is a contrived example to illustrate
synchronization mechanisms. But your recent response seems predicated on the
fact the problem you're solving is just incrementing a counter.

I looked at it as just some work that needs to be done. But in the general
case, not all work can be reduced to a single, or even a few atomic
instructions. And sometimes your synchronization is needed not because you
have work to do, but because you are waiting for another thread to complete
their work so you can proceed. In those cases, you sometimes need system calls
and/or condition variables to correctly synchronize with another thread. And
when you find yourself in such a situation, sched_yield() is an anti-pattern.

And your footnote confuses me: I am not proposing using nanosleep() _and_
sched_yield(). I am proposing it _instead of_. What I linked to in my parent
comment explains, basically, that sched_yield() does not mean what you think
it means, so don't use it.

~~~
BeeOnRope
I agree the conversation seems to be going backwards and it seems like more
work than expected to get to a common understanding (with no guarantee of
success), so let's just drop it.

------
mrich
Great writeup to use as introduction and reference.

The striping to have a memory location per core is great for performance but
can have a noticeable memory overhead.

It would be interesting to have Intel TSX in the mix. It can be a benefit to
some workloads and does not require many adjustments to your code.

~~~
BeeOnRope
If I get a CPU with Intel TSX enabled, I'll definitely try to add TSX to the
mix either as an update to this article or as a new one!

------
DylanDmitri
Really enjoyed this article, thank you for sharing

------
dgski
Great article! Will definately keep this hierarchy in mind when writing
concurrent code.

------
doonesbury
Wow. Hellava good write up. Def appreciated. I learned a lot from last two
examples.

------
barumi
Great article. Loved it!

