Hacker News new | past | comments | ask | show | jobs | submit login
Notes on Lock Free Programming (loonytek.com)
92 points by ingve on Mar 17, 2017 | hide | past | favorite | 52 comments

I don't like usage of the term "lock free" to mean "locks that are less costly".

There's a difference between (A) locking (waiting, really) on access to a critical section (where you spinlock, yield your thread, etc.) and (B) locking the processor to safely execute a synchronization primitive (mutexes/semaphores).

CAS is "lock free" only in the sense that it doesn't require the processor to stop the world in order to flip the mutex boolean. It's still a mutex, and it still gates access to a critical section, and you still need some kind of strategy to deal with waiting for the critical section to become available (e.g., spinlocking, signaling the OS to sleep thread execution).

  An example of operation using coarse locks would be a method in Java with “synchronized” keyword.
  If a thread T is executing a synchronized method on a particular object,
  no other concurrent thread can invoke any other synchronized method on the same object.
That's just putting one giant mutex around all access to the object. Having finer granularity != "lockless".

> CAS is "lock free" only in the sense that it doesn't require the processor to stop the world in order to flip the mutex boolean

No, CAS is lock free in the sense that a process or thread that freezes for whatever reason while doing a CAS operation cannot cause the entire system to stop making progress (provided your hardware is working as intended).

That's the only thing that matters to say if something is 'lock free' or not, and CAS meets that definition.

If you want some other property that's fine, but that's not what anyone understands by 'lock free'.

Right, I think you looked past what I said, and why I put "lock free" in quotes:

> I don't like usage of the term "lock free" to mean "locks that are less costly".

There's a difference between being formally "lock free" meaning "the system is guaranteed to not globally deadlock", and this informal use of "lock free" meaning "CAS is more efficient because it doesn't block execution of other threads in obtaining a mutex"

This article's intro:

  Data structures (and their corresponding methods) implemented with coarse grained locks
  are hard to scale in highly parallel environments and workloads where the structure is required
  to be accessed by several concurrent threads simultaneously (parallelism).
is talking about performance, not termination guarantees, and that muddles the meaning of "lock free".

> CAS is "lock free" only in the sense that it doesn't require the processor to stop the world in order to flip the mutex boolean. It's still a mutex, and it still gates access to a critical section, and you still need some kind of strategy to deal with waiting for the critical section to become available (e.g., spinlocking, signaling the OS to sleep thread execution).

CAS can/"must" be used to implement normal mutex locks, but this definitely isn't it's only use. The only way you could view CAS itself as locking something is at a low cache-coherency level, where it requires the processor to lock a cache line (https://en.wikipedia.org/wiki/MESI_protocol ); however, this is at a level that isn't semantically observable: it is atomic with respect to things like OS thread scheduling, unlike mutexes. In some sense, this is the critical reason that CAS and lock-free programming is interesting, as it avoids problems like priority inversion, and gives the guarantee of at least one thread always making progress.

> That's just putting one giant mutex around all access to the object. Having finer granularity != "lockless".

Is this a comment about the article, or just a general statement? Because the article doesn't propose finer granularity locks as a lockless solution.

The linked list discussed in the article forgoes mutexes, but in exchange, operations aren't guaranteed to succeed:

  Because we are no longer taking explicit locks,
  there is no guarantee that insert operation will succeed.
  Therefore, CAS based algorithms are usually implemented
  using a loop (aka CAS loop) to retry the operation when
  CAS fails.
So you're still getting spinlocks (or context switches, etc.) assuming these inserts have to eventually succeed. Implicitly, the CAS does "lock" in the sense of forcing other accesses into retry loops. You're still dealing with the problems inherent to your waiting strategy. This data structure doesn't prevent problems like starvation. In the pathological case, you'll get stuck forever in a spinlock trying to insert.

> In some sense, this is the critical reason that CAS and lock-free programming is interesting

Right. The article's preface criticizes implementations of mutexes in C++ and Java as unscalable, which is certainly a problem, but categorically a different concern from that of designing lock-free algorithms or data structures.

One thread possibly being starved is very different to all threads being starved, as can happen with blocking algorithms (the classic example being the thread holding a lock is descheduled: no-one needing it can make any progress). Atomic instructions like CAS do not inherently have this problem, and can be used in ways to avoid it (such as how the article describes it), but the same is not true of mutexes.

I personally find it makes sense think of this is more of a transactional thing (one thread invalidates the transactions of others), rather than a mutex-esque critical section.

In any case, the article (accidentally or otherwise) does actually use the correct technical term for code where some threads may be starved while others make progress (lock-free). If it was trying to discuss code that avoid this, it would be "Notes on Wait Free Programming", and would likely be significantly longer.

  In this post, I will summarize my understanding of a non-blocking (lock free) design of linked list data structure...
The author conflates blocking with lock-free. Starvation is a form of blocking, non-blocking implies no starvation.

It isn't a form of blocking, and it doesn't imply that, in the conventional/widespread definitions of these terms: "Lock-freedom allows individual threads to starve but guarantees system-wide throughput" https://en.wikipedia.org/wiki/Non-blocking_algorithm#Lock-fr... . The subclassification of non-blocking that avoids starvation is wait-free ("Wait-freedom is the strongest non-blocking guarantee of progress, combining guaranteed system-wide throughput with starvation-freedom", from the Wikipedia article above).

In that case, I guess we're just drawing from different glossaries. I've always heard "blocked" used to mean "can't execute because of something other threads are doing" regardless of whether that's starvation or getting slept by a mutex.

I don't think that's a definition that makes sense - consider Thread.sleep(Integer.MAX_VALUE). I think we would all agree that the thread is blocked, but not on the progress of other threads.

> So you're still getting spinlocks (or context switches, etc.) assuming these inserts have to eventually succeed.

While CAS can be used to implement spinlocks (and we agree this certainly isn't lock free), I would not label all uses of CAS with loops as spinlocks. If it's impossible for another thread "A" to fail (or otherwise suspend) holding a mutex in a 'locked' or other intermediate state, such that no other thread can continue their desired execution while that thread "A" is suspended - coinciding with e.g. the possible dangers of deadlocks - it's not a 'lock' as I'd use the term.

Stretch the term "locking" much farther and then you can start getting into philosophical debates about just how "lock free" is an atomic increment? CPU cache coherency protocols between cores amount to something rather similar to locking when you get right down to it... at this point, you abandon "lock free" as a useful term (and should instead use terms like "share nothing", "single threaded", or "buggy" where you previously used the term "lock free".)

> You're still dealing with the problems inherent to your waiting strategy. This data structure doesn't prevent problems like starvation. In the pathological case, you'll get stuck forever in a spinlock trying to insert.

Potentially, yes. "Lock free" is not contention free. It is a different strategy for managing contention. It is not "cheaper locking" - throwing a mutex and locks at the problem can be cheaper than using "lock free" algorithms. Serialized access to a resource can be faster than two cores false sharing, constantly invalidating each other's cachelines.

FWIW, this seems to line up with the definitions on https://en.wikipedia.org/wiki/Non-blocking_algorithm , which suggests the term "wait free" to describe options that cannot deadlock or livelock (e.g. all threads have a guaranteed upper bound of operations, within which they'll make progress, regardless of the failure or suspension of other threads accessing the resource.)

> This data structure doesn't prevent problems like starvation

But that's fine! Lock-free permits starvation. Not every data structure can solve every problem. If you can't have starvation then lock-free isn't enough. In which case, this article isn't for you.

You realize you cherry-picked a snippet of my post, and then made the exact point I just made, which is:

There aren't deadlocks, but there is still waiting and potential starvation, all of which can be informally called "locks" and so we need to be careful about qualifying "locking".

This isn't a criticism of lock-free methodology, but a consideration of the ways in which "lock-free" could be interpreted.

> all of which can be informally called "locks"

> the ways in which "lock-free" could be interpreted

This is only a problem if you use a definition of the 'lock' in 'lock-free' that isn't the same as everyone else's.

I guess I must be misunderstanding what you're trying to say if that isn't it. I'll stop criticising.

Other commenters have provided great responses. But to add my voice to the mix: CAS is definitely not a mutex. A mutex, by definition of the term, enforces mutual exclusion. CAS cannot do that because it is a synchronization primitive. You can use CAS to implement a mutex. Or you can use CAS to implement synchronization which does not require mutual exclusion. Your synchronization primitive does not determine if your approach is lock-based or not. It's what you do with that synchronization primitive that determines if your approach is lock-based or not.

(Here, I use "synchronization primitive" to mean the processor instruction you use for synchronization. A construct such as a mutex or a spinlock is something you would build with a synchronization primitive.)

I've always understood lock-free to mean that every worker is guaranteed to make progress in a (finite) bounded amount of time.

I like the terminology that Dmitry Vyukov uses[0], where "wait-free" means that any individual thread is guaranteed to make progress, and "lock-free" means that some thread is guaranteed to make progress. There's also the weaker guarantee "obstruction-free" where livelock is a possibility.

[0] http://www.1024cores.net/home/lock-free-algorithms/introduct...

(It's worth noting that those are the standard terms for each of those concepts: https://en.wikipedia.org/wiki/Non-blocking_algorithm )

Couldn't there be an even stronger requirement like "obstruction-free"? I could imagine code that is "obstruction free" but which severely slows down execution for itself or other threads.

These are theoretical concepts. If there's a bound on the slowdown, it's lock-free. If there's no bound, it's obstruction free.

These theoretical concepts might or might not be that relevant for real-world applications. For example, in real-time audio (the main domain of interest to me for lock-free techniques), what you really care about is that the audio rendering thread will produce one buffer worth of sound before the hardware finishes outputting the previous buffer. This sounds like a match for "wait-free", but it depends on the details. If the non-audio thread can cause a 100x slowdown (but no more) that would technically be wait-free, but you'd miss your buffer, and so it's not acceptable. Another system might use mutexes, but be carefully engineered so that the locks are never held for more than 100µs (say, using locks designed to handle priority inversion), and thus reliably meet the needs even if it's not technically anything like lock-free.

But, as with anything, these concepts are useful tools, and I _love_ lock-free algorithms for audio.

Obstruction-free is the weakest requirement, so your use of "even stronger" confuses me.

In any case, in practice wait-free has this property a bit: getting the guarantee of all threads making progress in finite time generally requires in them executing slower individually, on average, e.g. one common strategy is for thread A help thread B finish its work, because thread A needs that result.

Executing a little slower would be fine. Executing two orders of magnitude slower can be almost as useless as being obstructed.

I've only ever heard that with "lock" being qualified as a deadlock, not a synchronization primitive.

Non-Blocking works a bit better as a phrase, in my opinion.

Oh god, "non-blocking" is even worse, no thanks to Javascripters. Lock-free data structures can wait, and can block if they deal with failed operations by spinlocking.

The muddled concepts are: waiting - what you do until you eventual acquire a contested resource (spinlock) mutexes - objects used to enforce critical sections stopping the world - a processor stopping the execution of all threads in order to ensure proper order of thread-critical instructions (is there a formal name for this?) CAS - an atomic primitive often used to avoid the overhead of stopping the world to enter critical sections, or to avoid deadlock by implementing algorithms that do not have critical sections critical section - a routine that must be executed by only one thread at a time blocking - being unable to proceed with a given routine on account of another thread non-blocking / async - threading in general, or, instead of spinlocking, yielding execution to another thread deadlocking - a systematic failure wherein there is a cyclical dependency of threads on resources, and no thread can progress

All of these can be informally referred to by "locking", and lock-free (formal definition) algorithms avoid deadlocking by avoiding having critical sections or locks on resources at all. They're generally more performant, but that's coincidental.

If you have a critical section, and you implement a mutex using CAS, you've probably sped up your application because you no longer have to pause other threads. That's "lock free" in the sense that "stopping the world" is "locking" other threads, but not "lock free" in the sense of avoiding deadlock. If a thread enters a critical section with a CAS mutex and then waits forever, you can get deadlock.

You keep using terms in a non standard way. Please don't cinfuse matters.

A lock free data structure that deal with failed operations with a spin lock is not lock free.

A mutex is not lock free full stop.

Perhaps an example will help clarify the difference between mutex locking and CAS (lock-free) operations. Check out Sled[0]. Sled is an implementation of a 'ctrie'[1] data structure in go. Sled is a lock free key value store. Because it is lock-free any number of threads can read and write to it simultaneously while never interfering with eachother. The application remains parallel at all times.

Locks cause parallel applications to become single threaded for the scope of the lock. This means that most of the time you should just put things that require locks into their own thread and then communicate with that thread from others.

"Do not communicate by sharing memory; instead, share memory by communicating." - ancient go proverb

However, that is still potentially a locking operation because channels must block until both threads are in sync (with some optional caching, to help reduce the effect).

With truly lock free data structures, all threads can share data but do not need to synchronize access. This leads to much, much higher performance.

[0]: https://github.com/Avalanche-io/sled

[1]: https://github.com/Avalanche-io/sled/blob/master/ctries_pape...

Note that the linked paper predates the C++ / C11 memory model, which I would consider essential for anyone implementing lock-free algorithms in C or C++ today. It also long predates an understanding of the ABA problem[0].

[0] http://www.stroustrup.com/isorc2010.pdf

Eh? ABA problem was known since a long time ago. Google scholar[1] shows a paper titled "Correction of a Memory Management Method for Lock-Free Data Structures" in 1995, which in turn cites "ABA problem" mentioned in "System/370 Principles of Operation. IBM Corporation, 1983" (!!).

[1] https://scholar.google.com/scholar?q=aba+problem

I stand corrected, thanks! I had the feeling that the ABA problem affected a lot of the earlier lock-free algorithms, but didn't know the understanding of it went that far back.

The C/C++ memory model is also inspired by the Java memory model which dates back to the mid 90s and was updated back in 2004.

Absolutely. The Java ecosystem deserves a huge amount of credit for putting lock-free programming on a theoretically sound basis, and it's informative how long it took to get it right.

But none of that had a serious effect on C and C++ practice until C++ started formalizing the memory model. Even today, we still have to deal with a lot of legacy code that uses pre-memory model primitives such as __sync_synchronize, __atomic_cmpxchg, InterlockedCompareExchange, etc. I'd like to see all that stuff go sooner rather than later. In my opinion, the original article is not helping, as it shows C code that does reads and writes through pointers, expecting them to behave as atomic operations.

The good thing is that those legacy operations can be now understood in term of the memory model.

I think that's only half-useful, at best. If all your access to atomics happened to be mediated by, say, __atomic_cmpxchg, then modeling that as (say) atomic_compare_exchange_strong would give you confidence in the implementation. The problem is that pre-memory model code is also likely to check the current state of the atomic through a raw pointer load instead of anything like atomic_load (which generally has no counterpart in pre-memory model synchronization libraries). So, analyzing that code through the lens of C11 tells you little more than that it's undefined behavior.

Like most undefined behavior, it can work in practice. Probably. Most of the time. If you stick to using older compilers.

I don't quite understand it. When you replace lock with CAS (Compare and Swap), aren't you just replacing one synchronization primitive with another? Is lock not implemented in CPU as a primitive instruction and therefore less performant? What overhead does locking have compared to CAS?

This post doesn't get into the why very much, which is what you're asking about. Yes, locks will eventually be implemented with a primitive instruction. But it's not so much the cost of the mechanism but what those mechanisms allow you to do.

Generally, when you protect a data structure with a lock, that means that the thread which has the lock has mutually exclusive access to it. Other threads which want access must wait. When you have many threads which want access, you will tend to have serialized that part of your program. That will tend to hurt performance, if you gain performance from parallelism.

Lock-free approaches tend to allow multiple threads to access and modify a structure without having to get mutually exclusive access. Yes, if two threads are trying to touch the exact same part of the structure, they will interfere: one will succeed and move on, and the other will fail and have to retry. That's not much of a win over locks (generally). But, if it's common for the threads to touch different places of your structure, then they can do so without interfering with each other. That tends to allow better performance through parallelism.

I'm using a lot of weasel words ("tends to", "generally") because there are tradeoffs. If you have a shared structure which is accessed by multiple threads very rarely, then it may actually be better to use locks. Lock-free data structures tend to be more complicated (and hence, take a bit longer in the unconteded case) than simply protecting a data structure with a lock. So if you're almost always going to have uncontended access, it may actually yield better performance to use a lock.

But, locks can also be problematic when they're held too long or indefinitely. What if a thread dies while holding a lock? Or does I/O? Or is rescheduled? Other threads which want the lock must wait. Lock-free approaches tend to be free of such issues.

I see what you're saying, but it seems to me that your description can be misunderstood to mean that lock-free programming is simply more fine-grained locking, which it is very far from.

You can use mutexes to do fine-grained locking and still get into a dead-lock. Your algorithm can use only CASs and still not be deadlock-free.

The general idea of lock free is to holistically design a data structure that guarantees that when you I invoke its operations in any concurrent scenario, the overall system will make progress and won't get stuck.

Agreed. But in practice, for what I have worked on, the scalability benefits matter more - and fine grain locks don't let you scale as much as lock-free techniques.

A spin-lock would be more like a CAS loop (and the CAS IS a primitive CPU instruction), where you retry until the CAS succeeds.

The point of lock-free algos, though, is that you could usually do something different from just retrying. There is this concept/general approach in lock-free where if a CAS fails at a certain place, you could invoke helpers that would help some other part of the algorithm make progress. E.g. you may be inserting an element in some structure and your CAS fails at some level (usually because another thread is doing a delete). In that case, you would help the delete operation proceed by invoking some idempotent 'post_delete_cleanup' operation on the same node.

The idea of lock-free isn't that one particular thread won't block, but rather that the whole system would be making progress (i.e. no deadlock). It doesn't necessarily mean there will be no locks either.

I've also had some trouble reading(and probably really understanding) the term "lock free algorithms", since things like CAS are always mentioned and a CAS instruction locks the memory bus like any memory access.

Isn't it more like "mutex minimization" since you are trying to limit the critical section to a single a single atomic instruction(as opposed to more than one)?

>CAS instruction locks the memory bus like any memory access.

FYI, this hasn't been the case in at least 20 years, on any of the major architectures. Cache coherency guarantees a single writer for any cache line aligned chunk of memory without any sort of bus lock.

Sure, but even with MESI a CAS translates into an exclusive lock on the cache line by a single core no?

I probably could have articulated my original question better, sorry.

I guess you could call holding a cacheline in the exclusive state the same as holding a lock. The difference, I think, is that there's an hard upper bound on response time on serving any remote request for sharing.

The point is that overhead of locking is too much when the contention on the shared structure is fairly less. The cost of locking/unlocking/blocking/waiting or any other work done as part of enforcing synchronization is something that can be optimized away if the contention on the data structure is less.

By contention, I imply -- contention or conflicts on the same portions/pieces of structure.

If the data structure is subject to high concurrent access but in fairly disjoint regions of the data structure, we should try to implement fine grained locking scheme to increase the degree of parallelism and thus the scalability of data structure.

This is where CAS comes into picture. The article doesn't compare CAS to another locking primitive. Instead, CAS is just used as a mechanism to implement fine grained locking scheme where the methods operating on the data structure don't take global or coarse locks.

If the contention on the entire data structure is unreasonably high for some reason, then CAS based approaches or any flavors of it don't buy us much.

Besides other benefits mentioned, a spinlock is a user-space construct; you don't incur the overhead switching to kernel mode at the loss of the benefits and guarantees those primitives provide. (I'm simplifying some.)

I think spinlock is used when the critical section is relatively smaller where the threads are likely to perform small amount of work after acquiring locks. Thus locks are not expected to be held for longer duration. The other threads contending for the lock prefer to just spin and poll (and of course burn CPU) with the hope that lock will soon be relinquished by the owner thread. This way spinning threads avoid the overhead of context switching as well.

Goes without saying, spinlocks are not useful on uniprocessor machines or single core machines.

On Win32, this has been abstracted into the critical section API, which lives entirely in user-space and is a lightweight mutex. Many of the functionality implemented in the article are already part of the API (spin-aquire, etc.).

Though from what I've read, on Linux, pthreads implements its mutexes in user-space (futex) as well and are fairly cheap to use.

Yes, the difference is that spin locks never enter the kernel, while a good mutex only does on contention. In practice most mutexes use an hibrid approach.

Pet peeve: The example linked list implementation suffers from not using double-pointers (or pointers-to-pointers).

For example, when the "special case" of adding an element at the head or tail are discussed - those special cases disappear entirely when the search procedure simply returns a pointer to the next field of the preceding element (or a pointer to the root pointer of the linked list, if the new element should be inserted at the head).

This post directly compliments the original one http://concurrencyfreaks.blogspot.com/2017/02/bitnext-lock-f...

Are there any lock-free in-memory spatial databases?

Last time I checked lock free meant no locks.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact