
Lockless Algorithms for Mere Mortals - chmaynard
https://lwn.net/SubscriberLink/827180/a1c1305686bfea67/
======
alextheparrot
It seems the answer is “Not available right now, let’s work on an API”? I
think that matches my perspective on this, in that it isn’t worth a “mortal”
developer’s time right now. Rarely have I hit a scenario where a lock is the
primary source of inefficiency and crossing thread boundaries.

The Go approach of “Do not communicate by sharing memory” holds well, because
CPU and memory do not always compose as one would expect (As the author of the
post notes). Passing messages simplifies a lot of things because you learn to
now rely on a specific thread or order, or program defensively against that
possibility.

~~~
jiggawatts
My experience is that I've often seen locks being an _unseen_ source of
performance woes in enterprise systems. So when you say that "rarely have I
hit a scenario", I'd say that there's an even chance that you have, it's just
that you didn't know it.

Here's the acid test: Imagine if you got given a two-socket AMD EPYC server
with 128 cores to run your software on. No virtualisation, local NVMe SSD for
storage, 400 Gbps Ethernet for networking. No hardware bottlenecks of any
type! Could your software utilise all 256 hardware threads of this computer?
If not, why not?

Maybe you got lucky and you really haven't had issues with locks, but
certainly other developers have.

The most common issue I see is with implicit locks in things like logging
frameworks. E.g.: Sending output to a text file can be a bottleneck for larger
servers like in the example above. Similarly, console output in multi-threaded
software also requires locks, and also often has contention issues.

Even innocent-looking code that simply uses dynamic memory allocation ("new",
"malloc", etc...) can be bottlenecked by cross-thread contention on the heap
data structures and the associated locks. Some modern allocators have thread-
local pools, but this doesn't always work. E.g.: often large allocations go to
a shared pool. Code that over-allocates large buffers and rapidly frees them
can hit this issue all too easily.

Web application servers will often hit the wall on something like a shared
cache or the session-state store. Unless 100% of the shared data uses
efficient lock-free algorithms, then given enough threads eventually some
mutex somewhere will be the limit.

There was a study done recently that showed that no modern database engine can
scale past 64 cores properly, let alone 128. Not Oracle, not SAP, not SQL
Server.

At the rate TSMC is advancing with chip technology, they'll hit 300 million
transistors per square millimeter in just a couple years. I fully expect AMD
to release 128-core CPUs once they're on that process. A quad-socket server
with those will have 512 cores or 1,024 threads. Completely lock-free
algorithms will be the _only_ way to scale up to just one server at that
scale!

PS: I remember reading the content on
[http://www.1024cores.net/](http://www.1024cores.net/) a few years back and
thinking to myself that this guy has the _right idea_ , but he's thinking too
far ahead. Now... not so much. Now I think the author of that site is a
visionary that more people should have paid attention to.

~~~
scottlamb
> I fully expect AMD to release 128-core CPUs once they're on that process. A
> quad-socket server with those will have 512 cores or 1,024 threads.
> Completely lock-free algorithms will be the only way to scale up to just one
> server at that scale!

How often will we need that scale in a single address space?

It's certainly desirable for hypervisors/kernels, database servers, perhaps L4
load balancers and reverse HTTP proxies. Those aren't nothing but far more
people work on web application servers than all of them put together.

Web application servers are often written to be stateless (with the possible
exception of caches) so they can scale to multiple machines. That's important
for high-reliability sites even if they aren't large enough to fully saturate
a single huge machine like that. As long as you need to load-balance between
machines, it's not a big problem to also run multiple instances per machine.
If the application scales well to 32 cores, run 16 of them per 512-core
machine. Seems a lot easier than going to extraordinary efforts to make one
address space scale...

~~~
jlokier
Even if you have 1024 separate processes not sharing much of anything, there
are still locks in the kernel running them.

For example, a pair of threads (inside one of those 1024 processes)
synchronising with each other will often go through the kernel to do so. In
Linux this uses the futex syscall; Windows etc have similar. If to do that the
kernel takes a lock that is shared with other processes, even if just for a
moment, even if it's hashed on address and memory space, even if it's a
spinlock and there's little contention, that lock causes memory traffic
between multiple cores and separate processes.

Same for processes that are reading the same files as other processes, or (for
example) running in the same directory when doing path lookups. There's a lot
of work done in Linux to keep this scalable (RCU), but it's easy to hit
scaling barriers that nobody has tested or designed for yet. Once 1024 core
CPUs are common, of course the kernel will be optimised for that.

~~~
scottlamb
Yes, I included the kernel in my list of things that are desirable to scale
well for that reason.

That said, in some cases I don't think it's strictly necessary for even the
kernel to scale well as long as you have a hypervisor that does. It's not
unusual to deploy software in VMs on a cluster. Having more, smaller VMs per
machine is a way to handle poor kernel scalability, just as I suggested for
the web application server. VMs are higher-overhead than multiple containers
on a single kernel, so this wouldn't be my first choice, but many people use
VMs anyway.

------
eximius
It would seem to be a failing of the C language that ultimately prevents us
from coming up with re-useable recipes for this. Macros or generics of
sufficient power in a language with an advanced type system should be able to
accomplish this.

I wonder, if language is the primary barrier here, if this is a way to
integrate another language for specific small pieces like this.

~~~
anaphor
C actually had better support (than C++) for double-checked locking (which is
a lockless algorithm) up until relatively recently
[https://preshing.com/20130930/double-checked-locking-is-
fixe...](https://preshing.com/20130930/double-checked-locking-is-fixed-in-
cpp11/)

~~~
lalaland1125
Huh? Both C and C++ only added support for atomics and double-checked locking
in the same year: 2011 (C11 for C and C++11 for C++). How did C support double
checked locking before C++ when the standards were released contemporaneously?

