Lockfree Algorithms (2010)

chrsig · on Sept 22, 2022

Fedor Pikus has given a number of great talks on the subject at cppcon over the years

some links, in no particular order:

CppCon 2017: Fedor Pikus “C++ atomics, from basic to advanced. What do they really do?

https://www.youtube.com/watch?v=ZQFzMfHIxng

CppCon 2016: Fedor Pikus “The speed of concurrency (is lock-free faster?)"

https://www.youtube.com/watch?v=9hJkWwHDDxs

CppCon 2015: Fedor Pikus PART 1 “Live Lock-Free or Deadlock (Practical Lock-free Programming)"

https://www.youtube.com/watch?v=lVBvHbJsg5Y

CppCon 2015: Fedor Pikus PART 2 “Live Lock-Free or Deadlock (Practical Lock-free Programming)”

https://www.youtube.com/watch?v=1obZeHnAwz4

CppCon 2017: Fedor Pikus “Read, Copy, Update, then what? RCU for non-kernel programmers”

https://www.youtube.com/watch?v=rxQ5K9lo034

Tony Van Eerd has also given some talks on lock free queues, I'll leave it as an exercise to the reader to search youtube :)

shultays · on Sept 22, 2022

Isn't the very first example error prone? The one under Wait-freedom

https://www.1024cores.net/home/lock-free-algorithms/introduc...

A thread can easily decrease the reference count to zero and delete the object while another thread is trying to add reference

gpderetta · on Sept 22, 2022

For another thread to be able to add a reference, it must already have a reference to that object, so the reference count can't be less than two. Two threads can't be both operating on the same pointer object.

Reference counted pointers themselves are usually not atomic, only the reference count updates are. To atomically access the reference counted pointer itself, you need either a different algorithm (using hazard pointers for example), or some form of mutual exclusion.

pdpi · on Sept 22, 2022

> Reference counted pointers themselves are usually not atomic, only the reference count updates are.

Fun way to start an argument where everybody is wrong: ask a group of C++ programmers whether std::shared_ptr is thread-safe.

maxbond · on Sept 22, 2022

Related talk from cppcon: https://youtu.be/lkgszkPnV8g

TillE · on Sept 22, 2022

shared_ptr's thread safety is simple: it's perfectly thread-safe as long as each thread has its own copy. It of course does not provide any synchronization for the object it's pointing to.

kccqzy · on Sept 22, 2022

Almost exactly one month ago I have a junior developer share the same shared_ptr across threads, where some threads could be writing to it by calling reset(). They forget that accessing the shared_ptr itself needs to be protected.

Rust prevents this problem completely though. You are either going to deal with Arc<Mutex<T>> or Mutex<Arc<T>>, depending on the situation (in this case the latter), and forgetting the Mutex will simply cause compilation failure.

shultays · on Sept 22, 2022

Aha, makes sense. Thanks for the explanation

pkhuong · on Sept 22, 2022

Reference counting algorithms usually assume that code increments through a strong reference.

snakey · on Sept 22, 2022

Thanks for this! I've been diving into the basics of lock-free datastructures recently. Some posts that I've been working through which I've found useful:

- https://moodycamel.com/blog/2013/a-fast-lock-free-queue-for-...

- https://drdobbs.com/parallel/lock-free-queues/208801974?pgno...

int08h · on Sept 22, 2022

Dmitry Vyukov has forgotten more about non-blocking synchronization (lock- and wait-free algorithms) that most developers presently remember.

kasey_junk · on Sept 22, 2022

He’s invented more of them than most people know…

bsaul · on Sept 22, 2022

Side question : could anyone recommend exercises for mid-experienced developer that really has troubles spotting potential race conditions, and generally struggle with concurrency ? I’ve realized i’m spending a lot of time reviewing his PR, and i’d like to help him improve faster on that topic.

neilparikh · on Sept 22, 2022

I enjoyed http://deadlockempire.github.io/

avgcorrection · on Sept 22, 2022

Based on the dragon this Deadlock Empire must be Wales.

maldev · on Sept 22, 2022

It's actually really easy. Traditional threading has a very obvious design pattern, where if someone isn't doing this you can be sure there's an issue.

TL;DR, if you don't have a "getter/setter" you're doing it wrong and it's a sign of bad code.

Firstly, make sure that there's only ONE place to write. Write being changing the variable in anyway(Obviously make sure you have a lock or other primitive here). If there's multiple places, that's a big smell for a race condition, and a massive turd for a shitty design pattern.

Secondly, just like the write, all reads should be done from a single place.

And then for the advance developer, you also may need to make sure that the cache is synchronized. Pretty much all libraries will do this for you, but if you ever need to dive into internals that alot of "guides" or papers DON'T cover, you need Fences or whatever your architecture provides. I have this page bookmarked for when I need a quick refresher https://stackoverflow.com/questions/27627969/why-is-or-isnt-...

gpderetta · on Sept 22, 2022

Sorry but it is not easy at all. Getter and setters do not help and can give a false sense of security.

And caches do not need to be explicitly synchronized. On all common platforms the hardware will do it for you without user intervention. Fences have a different purpose.

maldev · on Sept 22, 2022

Honestly just to prove how wrong you are and what you're saying is blatant misinformation, i'll correct you, since I don't think you're talking in good faith, or have any idea what you're saying.

Regarding fences, In the intel system programming guide, section 8.1.3

" Handling Self- and Cross-Modifying Code The act of a processor writing data into a currently executing code segment with the intent of executing that data as code is called self-modifying code. IA-32 processors exhibit model-specific behavior when executing selfmodified code, depending upon how far ahead of the current execution pointer the code has been modified. As processor microarchitectures become more complex and start to speculatively execute code ahead of the retirement point (as in P6 and more recent processor families), the rules regarding which code should execute, pre- or post-modification, become blurred. To write self-modifying code and ensure that it is compliant with current and future versions of the IA-32 architectures, use one of the following coding options: (* OPTION 1 ) Store modified code (as data) into code segment; Jump to new code or an intermediate location; Execute new code; Vol. 3A 8-5 MULTIPLE-PROCESSOR MANAGEMENT ( OPTION 2 ) Store modified code (as data) into code segment; Execute a serializing instruction; ( For example, CPUID instruction ) Execute new code;"

It further states,

"The act of one processor writing data into the currently executing code segment of a second processor with the intent of having the second processor execute that data as code is called cross-modifying code. As with selfmodifying code, IA-32 processors exhibit model-specific behavior when executing cross-modifying code, depending upon how far ahead of the executing processors current execution pointer the code has been modified. To write cross-modifying code and ensure that it is compliant with current and future versions of the IA-32 architecture, the following processor synchronization algorithm must be implemented: ( Action of Modifying Processor ) Memory_Flag := 0; ( Set Memory_Flag to value other than 1 ) Store modified code (as data) into code segment; Memory_Flag := 1; ( Action of Executing Processor ) WHILE (Memory_Flag ≠ 1) Wait for code to update; ELIHW; Execute serializing instruction; ( For example, CPUID instruction *) Begin executing modified code;"

So please tell me how the intel developer manual is wrong, how i'm wrong, and how they all server a "Different purpose" than what is given. I can link more instances from the Intel Developer manual if you want. I assume you've read it if you're trying to talk about the subject.

gpderetta · on Sept 22, 2022

Fences are not mentioned anywhere in your quoted text. Note that serializing instructions are not fences (although there are overlaps: some fences are also serializing instructions).

In any case on x86 the instruction cache is coherent with the data cache all the time, the serializing instruction is needed to flush the instruction pipeline (not the cache).

I'll concede that there are some non-x86 architectures where the instruction cache is not necessarily coherent with the data cache, but it is not terribly relevant as the topic here is concurrency, not self modifying code.

maldev · on Sept 22, 2022

Fences are meant for serializing. Although the intel documentation says CPUID, you use a fence there. This should be obvious, as if you know what a fence is, you know they're entire purpose is serialization. I'll put what the manual says on them, but it's not a hard concept, the intel manualy literally says sfence = SERAILIZES STORES. Lfence = SERIALIZES LOADS. Mfence = SERIALIZES LOADS AND STORES. You clearly have ZERO idea what you're talking about if you don't even know what these simple instructions mean. So to answer your question in the quoted text, where it says to use a serializing instruction, it means to use the appropriate fence! Not only this, but your entire premise of a properly written getter/setter being "Wrong" is entirely wrong, as the entire reason for a race condition is having improper getters and setters!

Here are the following fences referenced in that article, and any good asm programmer will tell you that fences are used for this purpose(NOT CPUID)

LFENCE: (Intel manual Vol. 2A 3-585) *Description* Serializes load operations Performs a serializing operation on all load-from-memory instructions that were issued prior the LFENCE instruction.

MFENCE: (Intel manual 4-22 Vol. 2B) *Description* Serializes load and store operations. Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior

SFENCE:(4-620 Vol. 2B) Description: *Serializes* store operations

Orders processor execution relative to all memory stores prior to the SFENCE instruction. The processor ensures that every store prior to SFENCE is globally visible before any store after SFENCE becomes globally visible. The SFENCE instruction is ordered with respect to memory stores, other SFENCE instructions, MFENCE instructions, and any serializing instructions (such as the CPUID instruction). the MFENCE instruction.

I'm not sure what you think fences are, or what they're for. But they're clearly for serializing. And were talking about concurrency, that article is from the intel manual on multiprocessor programming, which is concurrency! Do you not even know basic terms? Do you not even read replies?

gpderetta · on Sept 22, 2022

Please do not resort to personal attacks, we are just having a technical discussion.

In intel parlance 'serializing instruction' is a Word of Power; Please see Vol 3A, under serializing instructions. CPUID is listed there.

In the same section, LFENCE, MFENCE, SFENCE are listed as 'memory ordering instructions' and the documentation explicitly says that:

  The following instructions are memory-ordering instructions, not serializing instructions. These drain the data memory subsystem. They do not serialize the instruction execution stream

Handling self modifying code requires serializing the instruction stream, so the memory fences are neither necessary nor sufficient[1].

Going back to the original argument, the memory fence instructions (and the various LOCK prefixed operations) do indeed serialize memory operations to guarantee ordering, which I don't dispute. But that's nothing to do with synchronizing caches like you claimed.

[1] IIRC MFENCE it is actually also a serializing instruction in practice, but Intel has not yet committed to document this behavior as architectural.

sethammons · on Sept 22, 2022

> Honestly just to prove how wrong you are and what you're saying is blatant misinformation, i'll correct you, since I don't think you're talking in good faith, or have any idea what you're saying.

Are you this rude and condescending in real life to people? A different phrasing: "you are incorrect; here are some sources." This is neutral and still conveys your knowledge. You could even level it up a bit with, "I believe you are incorrect and here is why."

maldev · on Sept 22, 2022

Yes, when someone acts like a complete idiot and acts with such bad faith, I always respond the same. The person wasn't here for intellectual discussion, he led with a downvote of my original post and posted that "UMMMM AKTUALLY" reply, while being wrong. He's a troll and trolls shouldn't be treated nicely.

People who do that, as well as leading with the downvote, "AKTUALLY UR WRONG", hamper a sites intellectual discussion and drives away contributers and people who want to have an actual interesting conversation, and should be treated with no tolerance. My responses treat him like the child he is, one who is closed minded and doesn't want a discussion.

Feel free to look at my profile, and I had a longstanding profile before I forgot the login to when I changed jobs. First time i've ever encountered a troll on this site, and first time i've ever had a dismissive tone with someone. He's unique, he's a troll, and he doesn't want a discussion, he wants to troll.

bob1029 · on Sept 22, 2022

I would also recommend this article to anyone who struggles with making many threads do useful amounts of work:

https://lmax-exchange.github.io/disruptor/disruptor.html

anonymousDan · on Sept 22, 2022

In a similar vein, this is also an interesting approach to avoiding synchronisation overhead that works for many data structures: https://www.cs.bgu.ac.il/~hendlerd/papers/flat-combining.pdf

bob1029 · on Sept 22, 2022

This sounds very similar indeed. The central thesis of the paper seems to be that the grain of locking is the concern, which is what Disruptor directly touches on as well. In a Disruptor implementation, you are typically processing batches of things, and locks can be acquired along lines of batches.

gpderetta · on Sept 22, 2022

That's very interesting, thanks.

dang · on Sept 22, 2022

reitzensteinm · on Sept 22, 2022

I was about to post here that with the upcoming Zen 96 core chip we're now 40% of the way there to the name of the site.

Then I clicked on your link and ten years ago I was making basically the same point about Haswell EP.

See you all in ten more years.

moonchild · on Sept 22, 2022

Azul boxes had 768 cores...

reitzensteinm · on Sept 22, 2022

Yeah, I didn't realize that until somebody corrected me on the linked thread. But it turned out to be an evolutionary dead end.

memco · on Sept 22, 2022

Tangentially related, but I’m interested in doing file system traversal for finding some paths in Python: using a simple scandir to process takes minutes on the data I’m working with: shelling out to find is faster but still takes minutes. I know ‘fd’ exists and does parallel traversal, but I’d like to learn more about how to implement something like that: what structures do file systems use and how do you know what limitations should be considered when working with files?

pgt · on Sept 22, 2022

Tigerbeetle uses LMAX: https://tigerbeetle.com/

zackmorris · on Sept 22, 2022

This is great! I learned networking back in the late 90s reading Beej's Guide to Network Programming:

https://beej.us/guide/bgnet/

Which is great and everything, but there was an overlooked alternative then called OpenTransport (OT) from Apple (warning, PDF):

https://developer.apple.com/library/archive/documentation/ma...

Now, OT was frankly a terrible way to do networking. I learned it because this was in the Mac OS 9 days before protected memory and multitasking, so if you wrote games, you often had to deal with interrupts. OT got lost in the weeds with its complete lack of understanding of UNIX sockets (everything is a file). So I went several years thinking in terms of packets instead of streams. Once I saw the enormous benefits of synchronous blocking streams, I threw OT away and never looked back, as did Apple.

But OT had fantastic atomic primitives and FIFO/LIFO lock-free queues (see pages 657, 641, 645 of the PDF). That stuff was so vastly better than anything I had seen in Windows that it made me sad inside to see people still using mutexes. The use of locking primitives generally sets one up for a number of failure modes like poor performance and deadlocks. Unfortunately, many of those bad habits show up today in asynchronous web programming and even patterns as fundamental as locking database transactions.

It's too hard to explain where I'm coming from in a forum post. But there really is this whole other way of doing things that's been all but dead for over two decades. And it ties in with my frustration around the lack of a true multicore CPU with perhaps 1024 cores (I like the article's URL!).

Basically we could be using self-parallelizing code where instead of putting special "concurrent" keywords or intrinsics around loops etc, compilers could analyze loops for side effects and unroll them internally into a sequence of operations that get sent to all cores simultaneously. That would be analogous to how shaders work (and potentially MATLAB, Julia, maybe Clojure), but on the CPU itself so we wouldn't need GPUs anymore. We would just download a library that provides an OpenGL interface or whatever, but runs in ordinary userland code without having to worry about manually passing data around in vertex buffers.

I'd really like to see this done in perhaps a small microcontroller for under $100, that would run circles around the most expensive processors for certain workloads. I was also thinking that maybe someone could make 1 core that provides a roughly 90s level of performance like a MIPS or PowerPC 601, or even DEC Alpha, running at perhaps 1 GHz with no cache, for under $1. Then a grid of those cores could be laid out on a board with self-organized mesh networking using content-addressable memory and copy-on-write to automagically ensure data locality. Then we'd have a scalable architecture and wouldn't have to rely on video card manufacturers. Moore's law allows for just exactly what I'm saying here, it's just that nobody's done it, because we're used to centralized control of the supply chain, and we all spend all of our time making rent, because the open source funding model is broken, etc etc etc. I've ranted about this endlessly in previous comments.

keepquestioning · on Sept 22, 2022

Lockfree algorithms are a misnomer. The lock is an atomic instruction.

Thiez · on Sept 22, 2022

A "lock" in programming generally refers to a mutex, semaphore, or similar mechanism implemented mostly in software, e.g. see here https://en.wikipedia.org/wiki/Lock_(computer_science) . The distinction between such locks and atomic instructions is useful, since atomic instructions have various nice properties that locks (or "other locks", if you want to count atomic instructions as locks) do not, such as having no chance of deadlocks, guarantee of (eventual) progress, no chance of interruption, and no chance of blocking other threads forever when a thread executing an atomic instruction gets unlucky and is never scheduled again.

Naturally lockfree algorithms built with atomic instructions may not have (all) of these nice properties.

What better name would you suggest for lockfree algorithms?

duped · on Sept 22, 2022

If you're being pedantic, atomic instructions are a misnomer. But "Lock free" is a property of parallel algorithms where if there's N > 1 processors evaluating an algorithm, there is no state where the algorithm cannot make progress ("locked" meaning "all N processors are waiting")

Whether you want to debate if a physical system can realize a lock-free algorithm is a different debate. Hell, half the "well known" lock-free algorithms assume memory allocation/reclamation isn't blocking.

SighMagi · on Sept 22, 2022

That was my take away from "lock-free" talks, as well as needing a deep understanding of the machine's memory model and instruction set.

sylware · on Sept 22, 2022

"lockfree"? You mean "heavy lockfree" I guess since atomic depends on implicit hardware locking for proper operations.

And those algos are so much critical and dangerous, a mathematical proof must be provided and peer-reviewed.

For example, as far as I can remember, reader-writer ring buffer with atomic read/write pointers. You have such ring buffers in USB xHCI, and AHCI and NVMe.

CyberDildonics · on Sept 22, 2022

Lockfree means one thread is always making progress. If two threads contend over an atomic operation, one succeeds and one fails.

If you spend some time using atomics they might not seem so scary anymore.

sylware · on Sept 22, 2022

both threads can make progress at the same time.

CyberDildonics · on Sept 22, 2022

Lockfree means at least one thread is always making progress and atomics allow that trivially. If an atomic operation fails it means that a different thread's atomic operation succeeded. I didn't make up this definition.

https://en.wikipedia.org/wiki/Non-blocking_algorithm#Lock-fr...