A collection of lock-free data structures written in standard C++11

bluGill · on May 10, 2023

From the FAQ:

> The biggest reason you would want to use a lockfree data structure in such a scenario would be performance. Locking has a non-neglegible runtime cost on hosted systems as every lock requires a syscall.

This is misleading. While a lock does have a runtime cost, in some cases that cost is less than all the force CPU cache synchronization calls that lock free needs to do. With a lock you only have to sync once, after all the operations are done. You need to carefully measure this to see which is more performant for your application.

ot · on May 10, 2023

This is true in principle and it is good calling it out, but in practice I've never seen a mutex-based data structure beat an equivalent lock-free data structure, even at low contention, unless the latter is extremely contrived.

A mutex transaction generally requires 2 fences, one on lock and one on unlock. The one on unlock would not be strictly necessary in principle (on x86 archs the implicit acquire-release semantics would be enough) but you generally do a CAS anyway to atomically check whether there are any waiters that need a wake-up, which implies a fence.

Good lock-free data structures OTOH require just one CAS (or other fenced RMW) on the shared state.

Besides, at large scale, no matter how small your critical section is, it will be preempted every once in a while, and when you care about tail latency that is visible. Lock-free data structures have more predictable latency characteristics (even better if wait-free).

kerkeslager · on May 10, 2023

I appreciate your polite tone here. To expand on this at the risk of sounding a bit rude: nobody should listen to anyone who speaks about performance in terms of reasoning about a system instead of profiling it.

Computers are shockingly complex. I can't tell you how many times I've reasoned about a system, ran the profiler, and discovered I was completely wrong.

When I was working on an interpreter for a Lisp, I implemented my first cut of scopes (all the variables within a scope and their values) as a naive unsorted list of key/value pairs, thinking I'd optimize later. When I came back to optimize, I reimplemented this as a hashmap, but when I ran my test programs, to my horror, they were all 10x slower. I plugged in a hashmap library used in lots of production systems and got a significant 2x performance gain, which was still slower than looping over an unsorted list of key/value pairs. The fact is, most scopes have <10 variables, and at that size, looping over a list is faster than the constant time of a hashmap. I can reason about why this is, but that's just fitting my reasons to the facts ex-post-facto. Reasoning didn't lead me to the correct answer, observation did.

Returning to parallel data structures, the fact is, I don't know why lock-free structures are faster than mutex-based structures, I just know that they are in every situation where I've profiled them.

Reasoning isn't completely useless--reasoning is how you intuit what you should be profiling. But if you're just reasoning about how two alternatives will perform and not profiling them in real-life production systems you're wasting everyone's time.

slaymaker1907 · on May 10, 2023

I don't think that's quite correct because there are many optimizations which are impossible (or nearly impossible) to do after a system is implemented. Daniel Lemire wrote an excellent post on this exact subject https://lemire.me/blog/2023/04/27/hotspot-performance-engine....

In terms of programming languages, I think python is an excellent example of a language which has many features that have ended up making it extremely difficult to optimize even compared to other dynamic languages like LISPs. Even if you don't have to worry about backwards compatibility, there are design decisions that can limit performance which end up necessitating a rewrite of the entire system to actually change.

kerkeslager · on May 10, 2023

> I don't think that's quite correct because there are many optimizations which are impossible (or nearly impossible) to do after a system is implemented.

Who said you have to profile after a system is implemented? Certainly I didn't: if anything, I prefer to profile during prototyping, although few companies outside the largest budget for any real prototyping these days it seems. Usually I settle for timing things and profiling as early as possible so that you can catch any performance issues before any calcifying structure is built around the non-performant code.

Yes, I did profile after the fact in the Lisp story, but my point in that story was that my reasoning led me to the wrong conclusions, not that I did everything perfectly (on the contrary, it's a story about learning from my mistakes!).

I agree that Daniel Lemire post is excellent, but nothing in that post leads me to believe he'd disagree with anything I've said.

kazinator · on May 11, 2023

Optimizing Lisp lexical environments with hashes is going to be a fool's errand. The way you optimize lexical environments is by compiling them, so that there is no searching at all.

The value in the interpreter is that it provides an alternative implementation of the semantics. This becomes particularly valuable in some area that happens to be under-documented, and the compiler and interpreter are found to disagree.

It can be easier to get the semantics right in the interpreter. A simple implementation of environments (and whatnot) reduces the likelihood of bugs and leaves the code readable. Interpreted behavior of special forms can usually serve as the reference implementation for compilation.

If an interpreter is slow, that just means that the build steps for bootstrapping the compiler using the interpreter takes longer.

OskarS · on May 10, 2023

Assuming low or no contention, it is easy to imagine a scenario where a mutex vastly outperforms it: if you need to push a 1000 things into the queue, it's still just two fences for the mutex but it's now a 1000 CASes.

Moreover: the point with mutexes is that your data structure can be the optimized assuming no thread-safety. There are lots of, like, hyper-optimized hash table variants (with all sorts of SIMD nonsense and stuff) that are just not possible to do lock-free. The very "lock-freedomness" of the datastructure slows it down enough that in low contention scenarios mutexes clearly would outperform them without being particularly contrived.

dnedic · on May 10, 2023

This is why you would use the Ring Buffer or Bipartite Buffer to place 1000 elements at a time, not the queue. Check the documentation for more info.

ot · on May 10, 2023

If you are going to do batch operations, your data structure should be optimized to support them, so you're back to one CAS. The same would apply to the locked scenario, where you probably don't want to copy 1000 elements in the critical section.

About the sufficiently smart optimizations, sure, everything is easy to imagine, but in my experience this never happened, and I'd be curious to hear practical examples if you have any.

sakras · on May 10, 2023

Here's one I had: I was trying to build a Bloom filter in parallel. Each thread had large-ish batches of hashes it wanted to insert into the filter. Naively, you'd just have each thread iterate through the batches and do __sync_fetch_and_or for each of the hashes (this was a register-blocked Bloom filter so we only needed to perform one 8-byte or operation per hash).

What ended up being MUCH faster was to partition the filter, and to have a lock per partition. Each thread would attempt to grab a random lock, perform its inserts into that partition, then release the lock and try to grab another random lock that it hasn't grabbed yet. Granted, these locks were just atomic booleans, not std::mutex or anything like that. But I think this illustrates that partitioning+locking can be better for throughput. If you want predictable latency for single inserts, then I'd imagine the __sync_fetch_and_or strategy would work better. Which maybe brings up a broader point that this whole discussion relies a lot on exactly what "faster" means to you.

ot · on May 10, 2023

This seems to me like a parallel accumulation problem, why not have each thread accumulate a filter on a subset of the data (so no locking involved), and then reduce the results (which is just an OR of all the local accumulations)?

sakras · on May 11, 2023

Parallel reductions are more heavy-weight synchronizations than locks. Say we have 64 partitions, then we need to perform 6 levels of tree reduction, or avoid parallelism completely and perform the reduction on a single thread. Either way it was slower.

The locking strategy very rarely had any reduction in parallelism due to the randomized lock-taking.

There were also other reasons, such as not wanting to replicate the filter per-thread.

T0pH4t · on May 10, 2023

^ This can definitely be the case in a multi-producer/multi-consumer (MPMC) scenario if CAS is involved with loops. Great care has to be taken when writing MPMC data structures without locks that are more performant then lock equivalents; they are far more complex. I think it should be called out that it seems most (if not all) of the data structures provided are single producer/consumer which generally always have much simpler designs (and limitations of use) then MPMC.

T0pH4t · on May 10, 2023

I could I should also say this applies to MPSC and SPMC. Basically anything other than SPSC.

josephg · on May 10, 2023

Are there any good benchmarks which demonstrate the performance characteristics you’re talking about? Or case studies where an application moved from mutexes to lock free data structures, and compared the resulting performance?

xfs · on May 10, 2023

https://youtu.be/_qaKkHuHYE0 (CppCon) A senior software engineer at Google tried to optimize tcmalloc by replacing a mutex with lockless MPMC queue. After many bugs and tears, the result is not statistically significant in production systems.

gpderetta · on May 11, 2023

A fully lock free allocator would be a huge result by itself: you would be able to allocate from a signal handler, or having truly lock free algorithms that do not need custom allocators, or avoid deadlock prone code in tricky parts like dlclose...

But I guess this was only one part and malloc wouldn't be fully lockfree.

In any case the lockfree mallocs designs I have seen use NxN queues to shuffle buffers around, but I guess it would be unsuitable for a generic malloc.

bluGill · on May 10, 2023

This is impossible to do in a useful way for publication. You can do case studies, but minor changes in various factors that seem minor can make a massive difference in benchmarks. As such you need to find real world data, used in a real world scenario, for your application: then benchmark it. Even then you have a benchmark useful for your application only, and not worth publishing.

kerkeslager · on May 10, 2023

I disagree. I've gotten a lot from reading publications about people profiling their applications and posting about the results with descriptions of their application design and load. Yes, obviously you can't read that and draw conclusions about how the same data structure or algorithm will perform in your application, but it helps you build an intuition for what is likely to work in applications with different characteristics.

Here's an example: https://queue.acm.org/detail.cfm?id=1814327

If you read that for the first time and don't learn something, I'd be quite surprised.

josephg · on May 11, 2023

Thanks for the link! I read that about a decade ago, and its still a great read. That article makes it seem ridiculous to implement software any other way.

And yet, yesterday I watched a talk about how ridiculous it is to use mmap to implement a database:

https://db.cs.cmu.edu/mmap-cidr2022/

And I ended up side tracked watching some talks from CMU about database page buffers and such.

Its weird how people with different backgrounds (OS / kernel people and database people) come to very similar problems in computing with very different perspectives, and they end up implementing very different systems as a result. And in each case, each community thinks the other way is basically wrong.

My understanding is that Linus Torvalds thinks O_DIRECT is a ridiculous flag that database people probably don't want. And from a database perspective, its crazy how difficult the linux kernel makes it to write high performance filesystem code that never corrupts data if the system crashes. fsync is a misshapen sledgehammer, and disk write barriers or IO completion events are totally missing from linux.

dragontamer · on May 10, 2023

Pikus has a number of C++ Parallelism performance talks that discusses this issue.

In particular, the "cost of synchronization" is roughly the price of a L3 memory read/write, or ~50 cycles. In contrast, a read/write to L1 cache is like 4 cycles latency and can be done multiple times per clock tick, and you can go faster (IE: register space).

So an unsynchronized "counter++" will be done ~4-billion times a second.

But a "counter++; sync()" will slow you down 50x slower. That is to say, the "sync()" is the "expensive part" to think about.

------------

A lock is two sync() statements. One sync() when you lock, and a 2nd sync() when you unlock. Really, half-sync() since one is an acquire half-sync and the other is a release half-sync.

If your lock-free data structure uses more than two sync() statements, you're _probably_ slower than a dumb lock-based method (!!!). This is because the vast majority of lock()/unlocks() are uncontested in practice.

So the TL;DR is, use locks because they're so much easier to think about. When you need more speed, measure first, because it turns out that locks are really damn fast in practice and kind of hard to beat.

That being said, there's other reasons to go lock-free. On some occasions, you will be forced to use lock-free code because blocking could not be tolerated. (Ex: lock-free interrupts. Its fine if the interrupt takes a bit longer than expected during contention). So in this case, even if lock-free is slower, the guarantee for forward progress is what you're going for, rather than performance.

So the study of lock-free data-structures is still really useful. But don't always think of for performance reasons, because these data-structures very well will be slower than std-library + lock() in many cases.

gpderetta · on May 10, 2023

A sync, assuming it is your typical memory barrier, is not bound by the L3 latency. You pay (in first approximation) the L3 cost when you touch a contended cache line, whether you are doing a plain write or a full atomic CAS.

Separately fences and atomic RMWs are slower than plain read/writes, but that's because of the (partially) serialising effects they have on a CPU pipleline, and very little todo with L3 (or any memory) latency.

Case in point: A CAS on intel is 20ish cycles, the L3 latency is 30-40 cycles or more. On the other hand you can have multiple L3 misses outstanding, but CAS hardly pipelines.

dragontamer · on May 10, 2023

Perhaps my brain is conflicting multiple things.

So here's what I know. L1 / L2 caches are "inside the core". To talk to other cores, your data must leave L1/L2 cache and talk to a network. It is on _THIS_ network that the L3 cache exists.

Its not really "L3 cache", its just the memory-network that implements the MOESI protocol (or whatever proprietary variant: MOESIF or whatever) that sits between L2 cache, L3 cache, and core-to-core communications. So I don't want to make it sound like "You're waiting on L3 cache", because you're right. Its not really related to L3 cache.

So I don't really know how that network works (I know the high-level details of MOESI, but I also know that's the textbook explanation and the real world is way more complex), aside from "It'd be roughly on the same order-of-magnitude cost" as the L3 cache.

-------

What's really going on, is that Core#1 is talking to all the other cores and basically performing "git pull" and "git push" to the data asynchronously. Core#1 prefers to perform "git checkin" and "git checkout" (aka: talk to L1/L2 cache), because that's faster.

But when Core#1 does a full sync (git push/git pull), it is required to talk on the network that leads to other cores and/or L3 cache.

Those "git push / git pull" messages are MOESI (F and others), meaning Modified/Owned/Exclusive/Shared/Invalid (and Forwarding, and other proprietary states). Which line up to version-control way better than most people expect.

gpderetta · on May 10, 2023

Indeed L3 being shared and also often working as the MOESI directory works out to the interthread latency being the same order of magnitude as the L3 latency.

My point is that sync has nothing to do with caches. Caches are coherent all the time and do not need barriers. In particular I don't think the git pull/push maps well to MOESI as it is an optimistic protocol and only require transfering opportunistically on demand what is actually needed by a remote core, not conservatively everything that has changed like in git.

The explicit sync model is more representative of non coherent caches, which are not really common as they are hard to use.

Memory barriers are for typical CPUs, purely an i internal matter of the core and synchronize internal queues with L1. In a simple x86 model, where the only visible source of reordering is StoreLoad, a memory barrier simply stalls the pipeline until all preceding stores in program order are flushed out of the write buffer into L1.

In practice things these days things are more complex and a fence doesn't fully stall the pipeline, potentially only needs to visibly prevents loads from completing.

Other more relaxed CPUs also need to synchronise load queues, but still for the most part fences are a core local matter.

Some architectures have indeed remote fences (even x86 is apparently getting some in the near future) but these are more exotic and, AFAIK, do not usually map to default c++11 atomics.

dragontamer · on May 10, 2023

> The explicit sync model is more representative of non coherent caches, which are not really common as they are hard to use.

Not that I'm a professional GPU programmer. But I'm pretty certain that GPU caches are non-coherent.

But yeah, cache-coherence is just assumed on modern CPUs. Your clarification on store-queues and load-queues is helpful (even if the caches are coherent, the store-queue and load-queue can still introduce an invalid reordering. So it sounds like your point is that the various sync() instructions are more about these queues?)

gpderetta · on May 11, 2023

> Not that I'm a professional GPU programmer. But I'm pretty certain that GPU caches are non-coherent.

Yes, I was specifically referring to general purpose CPUs; I'm quite unfamiliar with GPUs, but I don't think anybody has ever accused them of being easy to program. Also I understand that GPUs (and CPU-GPU links) is an area where remote atomics already exist.

> So it sounds like your point is that the various sync() instructions are more about these queues?

for the most part yes, specifically fences enforce ordering on any operation that can execute out of order (even on in-order CPUs memory ops can be reordered), but only up to the coherence layer (i.e. L1). Ordering from the coherence layer on is enforced by the coherence protocol. You could of course have a model where fences are needed for for global coherence, but it would be too slow (having to flush the whole cache), too hard to use (as you would need to specify which lines need to be sync'd) or both.

You could see something like the store buffer as a non-coherent cache (as reads can be fulfilled form it), with fences restoring the coherence, but I don't think it is a terribly useful model.

planede · on May 10, 2023

Moreover, is "every lock requires a syscall" accurate? Probably depends on target platform and standard library, but my impression was that at low contention it doesn't really require syscalls, at least on Linux and glibc's pthread.

kevincox · on May 10, 2023

Definitely not. I don't think there many standard libraries that use syscalls for every lock. It is near universal to attempt a lock in user space (maybe even spin a few times) and only call the kernel if you need to wait. So low-contention locks should very rarely make a syscall.

adzm · on May 10, 2023

Critical sections in Windows certainly behave this way with a configurable spin count.

m4nu3l · on May 10, 2023

That's my understanding too. In Linux mutexes are implemented by futexes (Fast User-Space mutexes). If there is no contention they are guaranteed to not perform a syscall https://en.wikipedia.org/wiki/Futex

cout · on May 10, 2023

Maybe this has changed, but last time I looked at futexes there was no syscall for locking (assuming no contention), but unlocking always made a syscall. This was many years ago so it could be different now.

m4nu3l · on May 10, 2023

The code isn't the easiest to read but in glibc it seems that the syscall is only performed if waiters are detected in userspace during an unlock operation

https://github.com/lattera/glibc/blob/master/nptl/pthread_mu...

tialaramex · on May 10, 2023

Although it's not the code C++ will be using the Rust implementation is a bit easier to follow:

https://doc.rust-lang.org/src/std/sys/unix/locks/futex_mutex...

Unlock is just:: self.futex.swap(0, Release) -- if the value we get back is 2 then we know at least one thread is asleep waiting on this futex, so we need a system call to wake them, but in the uncontended case we're done immediately.

gpderetta · on May 10, 2023

Indeed You only need to FUTEX_WAKE if you know there are waiters (or of you lost track of the number of waiters).

m4nu3l · on May 10, 2023

What can cause the mutex to lose track of the number of waiters?

ot · on May 10, 2023

Generally you have a small number of bits to count the waiters, because the mutex state has to be a word you can CAS and so you have either 32 or 64 bits to pack all the state you need. If your counter saturates you lose track of the waiters, and you have to fallback somehow.

m4nu3l · on May 11, 2023

Makes sense, thanks for the explanation!

maldev · on May 10, 2023

You can literally just do an atomic swap on some memory location. Three lines of assembly, like.

    mov rax, 1  ; load the value to exchange into rax
   acquire_lock:
    ; attempt to acquire the lock
    xchg byte [ADDR_LOCK], al  ; atomically swap the lock value with rax
    test al, al  ; test if the original lock value was 0 (unlocked)
    jnz acquire_lock  ; if it was not, loop until we can acquire the lock

The downside is you want a backoff to sleep the thread so it doesn't go into a loop. But the actual lock code is simple. You can easily have this be your function "AcquireLock()" and then do

while(!AcquireLock()) { //pause thread execution. }

And I think this is where they get the syscall being needed, since this will normally require a syscall to pause the thread from the scheduler.

RcouF1uZ4gsC · on May 10, 2023

For me the biggest benefit of lock free programming is avoiding deadlocks.

Locks are not composable. Unless you are aware of what locks every function in your call tree is using, you can easily end up with a deadlock.

bcrl · on May 10, 2023

This is why RCU, aka Read-Copy-Update, exists. RCU avoids the expensive synchronization part by delaying the release of the older version of the data structure until a point at which all CPUs are guaranteed to see the new version of the data. The patents have now expired, so it's worth investigating for people that need to write high performance multithreaded code that hits certain shared data structures really hard.

dnedic · on May 10, 2023

The linked talk from Herb Sutter says as much, but it is a good suggestion to make that visible upfrontm, thanks.

Additionally, cacheline alignment of indexes is something that's there to mitigate the false sharing phenomenom and reduce some of the cache synchronization cost.

ashvardanian · on May 10, 2023

Both lock-free and mutex-based approaches have their applications. The general rule of thumb, for 2-4 threads in the same NUMA node lock-free is faster. Need more cores? Use a proper heavy mutex.

smat · on May 10, 2023

I assume the author of the library works under a hard real-time constraint. Under such circumstances (an example would be low latency audio) you can not tolerate the latency impact of a sporadic syscall.

quietbritishjim · on May 10, 2023

Perhaps, but that is almost the opposite of what they said: in hard real time you can tolerate a longer average latency in return for needing a shorter maximum latency. That matches from what I would expect from a lock free data structure. But that doesn't match the (dubious) claim that locks are usually slower.

dnedic · on May 10, 2023

You make good points, I will try to rephrase that part of the README to be more accurate and explain benefits not mentioned yet like determinism.

duped · on May 10, 2023

You often can tolerate the latency. The problem of locks is they are potentially unbounded and you can miss a deadline.

It's not about performance so much as determinism.

bluGill · on May 10, 2023

Lock free doesn't solve this. One thread will always make progress, but you have no way to ensure it is your thread so you can miss a deadline with lock free. When the data is under a lot of contention across many cores this is an issue (most of us don't have hundreds of cores so we don't see this).

Generally lock-free is better for these situations as odds are when you hold a lock at least some CPU cycles are used for something that isn't directly modifying the data that needs the lock - those cycles the other CPU can touch it when lock-free. (note that when using a lock there is a trade-off, often it is better to hold the lock for longer than needed instead of dropping, doing a couple operations and then locking again)

tialaramex · on May 10, 2023

If you must make progress locally (not just globally) the guarantee you need is wait freedom which is even stronger than lock freedom.

(All wait free algorithms are lock free because necessarily if every thread is guaranteed to eventually make progress then overall progress is definitely made)

andersced · on May 10, 2023

Should be benchmarked against ->

https://github.com/Deaod/spsc_queue

If proven faster OK.. If not.. Well.. back to the drawing board.

I gave it a try -> https://github.com/andersc/fastqueue

Deaod is the kingpin.

jcelerier · on May 10, 2023

https://max0x7ba.github.io/atomic_queue/html/benchmarks.html for an existing set of benchmarks where this could be added

samsquire · on May 10, 2023

I think I lean towards per-thread sharding instead of mutex based or lock free data structures except for lockfree ringbuffers.

You can get embarassingly parallel performance if you split your data by thread and aggregate periodically.

If you need a consistent view of your entire set of data, that is a slow path with sharding.

In my experiments with multithreaded software I simulate a bank where many bankaccounts are randomly withdrawn from and deposited to. https://github.com/samsquire/multiversion-concurrency-contro...

  ShardedBank.java
  ShardedBank2.java
  ShardedBankNonRandom.java
  ShardedHashMap.java
  ShardedNonRandomBank2.java
  ShardedTotalOrder.java
  ShardedTotalRandomOrder.java

I get 700 million requests per second over 12 threads due to the sharding of money over accounts. Here I prioritise throughput of transacitons per second over balance checks (a consistent view of the data).

I am also experimenting with left-right concurrency control which trades memory usage for performance, it basically keeps two copies of data, one that is currently being read and written to and another inactive copy that is not active. You switch the data structures around periodically to "make changes visible" to that thread.

saagarjha · on May 10, 2023

This is a great way to structure your code if it’s possible to do so, but this isn’t always the case unfortunately :(

kilotaras · on May 10, 2023

- A lot of code won't work for types with no default constructors, but that is at least compile error

- Using memcpy[0] for arbitrary types is just wrong, see [1]

[0] https://github.com/DNedic/lockfree/blob/main/lockfree/inc/bi...

[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p11...

dnedic · on May 10, 2023

This is already noted in the Queue readme, only the Queue constructs the type, the other 2 data structures are meant for PODs only.

I will take a look at adding support for constructing in-place for the other 2 data structures, but at the moment, they are just for PODs.

cjensen · on May 10, 2023

An issue in C++ is that it only supports atomic changes to the builtin types. For example, you can only CAS a 64-bit value if your largest integer/pointer type is 64-bits.

Good lock free algorithms use double-width instructions like cmpxchg16b which compare 64-bits but swap 128-bits. You can then use the compared 64-bits as a kind of version number to prevent the a-b-a problem.

Using only the built-in atomics is working with a hand tied behind your back. With the wider version, it's trivial to write multi-producer multi-consumer stacks with no limits to the number of objects stored. It's also pretty easy (if you copy the published algorithm) to do the same with queues.

gpderetta · on May 10, 2023

Actually C++ only requires TriviallyComparable for std::atomic. The issue with 2CAS is that intel until very recently only provided cmpxchg16b[1] but no 128 atomic load and stores: SSE 128 bit memory operations were not guaranteed to be atomic (and in fact were observed not to be on some AMDs).

So a 128 bit std::atomic on intel was not only suboptimal as the compiler had to use 2cas for load and stores as well, but actually wrong as an atomic load from readonly memory would fault. So at some point the ABI was changed to use the spinlock pool. Not sure if it has changed since.

If you do it "by hand", when you only need a 2cas, a 128 bit load that is not atomic is fine as any tearing will be detected by the CAS and 'fixed', but it is hard for the compiler to optimize generic code.

[1] which actually does full 128bit compare and swap, you are probably confusing it with the Itanium variant.

moonchild · on May 10, 2023

128-bit aligned loads and stores are guaranteed to be atomic on all intel and amd cpus that support avx. And if your cpu doesn't support avx, it probably doesn't have enough cores that the performance of concurrent data structures matters that much.

ot · on May 10, 2023

> Not sure if it has changed since.

It hasn't, Clang/GCC emit a cmpxchg16b only if you opt-in with `-mcx16`, which changes the ABI.

aseipp · on May 10, 2023

The lack of DWCAS as a primitive in general is really annoying, C++ aside. RISC-V's -A extension has no form of it, either; you only get XLEN-sized AMO + LL/SC (no 2*XLEN LL/SC either!)

It's one of those features that when you want it, you really really really want it, and the substitutions are all pretty bad in comparison.

CyberDildonics · on May 10, 2023

> Good lock free algorithms use double-width instructions like cmpxchg16b which compare 64-bits but swap 128-bits

The instructions should compare 128 bits and swap 128 bits.

I don't know why 'good' algorithms would use these if they don't need to, because 128 bit operations are slower.

Not only that, 128 bit compare and swap doesn't work if it is not 128 bit aligned while 64 bit compare and swap will work even if they aren't 64 bit aligned.

gpderetta · on May 10, 2023

On x86, any CAS on a misaligned address that crosses a cache line boundary can fault in the best case (if the mis-feature is disabled by the os) or cost thousands of clock cycles on all cores. So it "works" only for small values of "works".

CyberDildonics · on May 11, 2023

That's over a cache line boundary, but 128 bit don't even work when they are unaligned, so you can't do things like swap two pointers, then move down 64 bits and swap two more pointers.

dnedic · on May 10, 2023

True, that would help immensely in creating MPMC data structures, but as these are SPSC there is no problem. Also to clarify, this is only for the indexes, the data members can be anything.

Using these intrinsics or inline assembly would break portability or create situations where platforms have different feature levels, which is not something I intend to do. I want the library to be compatible with everything from a tiny MCU to x86.

hedora · on May 11, 2023

I've had good luck assuming double-word CAS, portability-wise. Old ARMs have 32 bit pointers, so 64 bit CAS is pretty good. The main problem is that some algorithms go from a bit under 64 bits for a nonce to a bit under 32, which starts to get into "this could hit in practice" territory.

jylam · on May 10, 2023

Maybe I don't understand the concept, but aren't lock-free structures "just" delegating the locking mechanism to the CPU via atomic operations ? (although even if that's the case I can understand the speedup) (and if so, why aren't all those lockfree structures the default, instead of using mutexes ?)

gpderetta · on May 10, 2023

Atomic operations can't be meaningfully said to perform any locking.

In any case lock-free is defined in term of progress guarantees.

whiteboardma · on May 10, 2023

I've only looked at the queue implementation, but both push and pop contain obvious race conditions; I would highly suggest adding tests that actually use the data structures from multiple threads.

dnedic · on May 10, 2023

Could you elaborate on the alleged race conditions? Any advice on reliably testing the race conditions? The problem with adding those is the fact that they will give lots of false negatives and if you rely on them you have a problem.

gjadi · on May 10, 2023

You could use TLA+ to model the data structure operations and check the invariant. Checking the invariant with assert is also useful in my limited experience with concurrency.

https://lamport.azurewebsites.net/tla/tla.html

dnedic · on May 10, 2023

Thanks a lot, will check this out!

whiteboardma · on May 10, 2023

Looking at the Push operation defined in queue_impl.hpp, if multiple threads perform concurrent pushes, they might end up writing their element to the same slot in _data since the current position _w is not incremented atomically

dnedic · on May 10, 2023

This is a multi producer scenario, the README clearly states that these data structures are only single producer single consumer multi thread/interrupt safe.

I will also add disclaimers to the javadocs comments on methods just to reduce confusion.

whiteboardma · on May 10, 2023

Wops, my bad

ape4 · on May 10, 2023

Just add a lock ;)

up2isomorphism · on May 11, 2023

During my entire career as a system programmer, I only see 1 situation where lock free is actually justified, 99% percent of the time it is completely unnecessary and even harmful.

Notbrainiac · on May 10, 2023

Every datastructure is lock free. Locks are required when you have multiple writers. The article states the usefull only for certain circumstance: for single consumer single producer scenarios. So yea within these assumptions you can make something work.

Koshkin · on May 10, 2023

Note that "lock-free" is a technical term that has a very specific, precise meaning (implying that the user of the structure does not need to use locking explicitly).

dnedic · on May 10, 2023

Even in single producer single consumer scenarios you need locks for multithreaded/interrupt use if you're not properly using atomics and proper fences.

jimktrains2 · on May 10, 2023

You may need a lock if you have operations that are not atomic, even with a single writer, as a reader could find an inconsistency.

ot · on May 10, 2023

This seems unnecessarily pedantic. Lock-free conventionally implies concurrent, otherwise it's meaningless.

josephcsible · on May 11, 2023

Lock-free doesn't mean "doesn't have locks". It means "doesn't need locks to be used concurrently".