There's a difference between (A) locking (waiting, really) on access to a critical section (where you spinlock, yield your thread, etc.) and (B) locking the processor to safely execute a synchronization primitive (mutexes/semaphores).
CAS is "lock free" only in the sense that it doesn't require the processor to stop the world in order to flip the mutex boolean. It's still a mutex, and it still gates access to a critical section, and you still need some kind of strategy to deal with waiting for the critical section to become available (e.g., spinlocking, signaling the OS to sleep thread execution).
An example of operation using coarse locks would be a method in Java with “synchronized” keyword.
If a thread T is executing a synchronized method on a particular object,
no other concurrent thread can invoke any other synchronized method on the same object.
No, CAS is lock free in the sense that a process or thread that freezes for whatever reason while doing a CAS operation cannot cause the entire system to stop making progress (provided your hardware is working as intended).
That's the only thing that matters to say if something is 'lock free' or not, and CAS meets that definition.
If you want some other property that's fine, but that's not what anyone understands by 'lock free'.
> I don't like usage of the term "lock free" to mean "locks that are less costly".
There's a difference between being formally "lock free" meaning "the system is guaranteed to not globally deadlock", and this informal use of "lock free" meaning "CAS is more efficient because it doesn't block execution of other threads in obtaining a mutex"
This article's intro:
Data structures (and their corresponding methods) implemented with coarse grained locks
are hard to scale in highly parallel environments and workloads where the structure is required
to be accessed by several concurrent threads simultaneously (parallelism).
CAS can/"must" be used to implement normal mutex locks, but this definitely isn't it's only use. The only way you could view CAS itself as locking something is at a low cache-coherency level, where it requires the processor to lock a cache line (https://en.wikipedia.org/wiki/MESI_protocol ); however, this is at a level that isn't semantically observable: it is atomic with respect to things like OS thread scheduling, unlike mutexes. In some sense, this is the critical reason that CAS and lock-free programming is interesting, as it avoids problems like priority inversion, and gives the guarantee of at least one thread always making progress.
> That's just putting one giant mutex around all access to the object. Having finer granularity != "lockless".
Is this a comment about the article, or just a general statement? Because the article doesn't propose finer granularity locks as a lockless solution.
Because we are no longer taking explicit locks,
there is no guarantee that insert operation will succeed.
Therefore, CAS based algorithms are usually implemented
using a loop (aka CAS loop) to retry the operation when
> In some sense, this is the critical reason that CAS and lock-free programming is interesting
Right. The article's preface criticizes implementations of mutexes in C++ and Java as unscalable, which is certainly a problem, but categorically a different concern from that of designing lock-free algorithms or data structures.
I personally find it makes sense think of this is more of a transactional thing (one thread invalidates the transactions of others), rather than a mutex-esque critical section.
In any case, the article (accidentally or otherwise) does actually use the correct technical term for code where some threads may be starved while others make progress (lock-free). If it was trying to discuss code that avoid this, it would be "Notes on Wait Free Programming", and would likely be significantly longer.
In this post, I will summarize my understanding of a non-blocking (lock free) design of linked list data structure...
While CAS can be used to implement spinlocks (and we agree this certainly isn't lock free), I would not label all uses of CAS with loops as spinlocks. If it's impossible for another thread "A" to fail (or otherwise suspend) holding a mutex in a 'locked' or other intermediate state, such that no other thread can continue their desired execution while that thread "A" is suspended - coinciding with e.g. the possible dangers of deadlocks - it's not a 'lock' as I'd use the term.
Stretch the term "locking" much farther and then you can start getting into philosophical debates about just how "lock free" is an atomic increment? CPU cache coherency protocols between cores amount to something rather similar to locking when you get right down to it... at this point, you abandon "lock free" as a useful term (and should instead use terms like "share nothing", "single threaded", or "buggy" where you previously used the term "lock free".)
> You're still dealing with the problems inherent to your waiting strategy. This data structure doesn't prevent problems like starvation. In the pathological case, you'll get stuck forever in a spinlock trying to insert.
Potentially, yes. "Lock free" is not contention free. It is a different strategy for managing contention. It is not "cheaper locking" - throwing a mutex and locks at the problem can be cheaper than using "lock free" algorithms. Serialized access to a resource can be faster than two cores false sharing, constantly invalidating each other's cachelines.
FWIW, this seems to line up with the definitions on https://en.wikipedia.org/wiki/Non-blocking_algorithm , which suggests the term "wait free" to describe options that cannot deadlock or livelock (e.g. all threads have a guaranteed upper bound of operations, within which they'll make progress, regardless of the failure or suspension of other threads accessing the resource.)
But that's fine! Lock-free permits starvation. Not every data structure can solve every problem. If you can't have starvation then lock-free isn't enough. In which case, this article isn't for you.
There aren't deadlocks, but there is still waiting and potential starvation, all of which can be informally called "locks" and so we need to be careful about qualifying "locking".
This isn't a criticism of lock-free methodology, but a consideration of the ways in which "lock-free" could be interpreted.
> the ways in which "lock-free" could be interpreted
This is only a problem if you use a definition of the 'lock' in 'lock-free' that isn't the same as everyone else's.
I guess I must be misunderstanding what you're trying to say if that isn't it. I'll stop criticising.
(Here, I use "synchronization primitive" to mean the processor instruction you use for synchronization. A construct such as a mutex or a spinlock is something you would build with a synchronization primitive.)
These theoretical concepts might or might not be that relevant for real-world applications. For example, in real-time audio (the main domain of interest to me for lock-free techniques), what you really care about is that the audio rendering thread will produce one buffer worth of sound before the hardware finishes outputting the previous buffer. This sounds like a match for "wait-free", but it depends on the details. If the non-audio thread can cause a 100x slowdown (but no more) that would technically be wait-free, but you'd miss your buffer, and so it's not acceptable. Another system might use mutexes, but be carefully engineered so that the locks are never held for more than 100µs (say, using locks designed to handle priority inversion), and thus reliably meet the needs even if it's not technically anything like lock-free.
But, as with anything, these concepts are useful tools, and I _love_ lock-free algorithms for audio.
In any case, in practice wait-free has this property a bit: getting the guarantee of all threads making progress in finite time generally requires in them executing slower individually, on average, e.g. one common strategy is for thread A help thread B finish its work, because thread A needs that result.
The muddled concepts are:
waiting - what you do until you eventual acquire a contested resource (spinlock)
mutexes - objects used to enforce critical sections
stopping the world - a processor stopping the execution of all threads in order to ensure proper order of thread-critical instructions (is there a formal name for this?)
CAS - an atomic primitive often used to avoid the overhead of stopping the world to enter critical sections, or to avoid deadlock by implementing algorithms that do not have critical sections
critical section - a routine that must be executed by only one thread at a time
blocking - being unable to proceed with a given routine on account of another thread
non-blocking / async - threading in general, or, instead of spinlocking, yielding execution to another thread
deadlocking - a systematic failure wherein there is a cyclical dependency of threads on resources, and no thread can progress
All of these can be informally referred to by "locking", and lock-free (formal definition) algorithms avoid deadlocking by avoiding having critical sections or locks on resources at all. They're generally more performant, but that's coincidental.
If you have a critical section, and you implement a mutex using CAS, you've probably sped up your application because you no longer have to pause other threads. That's "lock free" in the sense that "stopping the world" is "locking" other threads, but not "lock free" in the sense of avoiding deadlock. If a thread enters a critical section with a CAS mutex and then waits forever, you can get deadlock.
A lock free data structure that deal with failed operations with a spin lock is not lock free.
A mutex is not lock free full stop.
Locks cause parallel applications to become single threaded for the scope of the lock. This means that most of the time you should just put things that require locks into their own thread and then communicate with that thread from others.
"Do not communicate by sharing memory; instead, share memory by communicating." - ancient go proverb
However, that is still potentially a locking operation because channels must block until both threads are in sync (with some optional caching, to help reduce the effect).
With truly lock free data structures, all threads can share data but do not need to synchronize access. This leads to much, much higher performance.
But none of that had a serious effect on C and C++ practice until C++ started formalizing the memory model. Even today, we still have to deal with a lot of legacy code that uses pre-memory model primitives such as __sync_synchronize, __atomic_cmpxchg, InterlockedCompareExchange, etc. I'd like to see all that stuff go sooner rather than later. In my opinion, the original article is not helping, as it shows C code that does reads and writes through pointers, expecting them to behave as atomic operations.
Like most undefined behavior, it can work in practice. Probably. Most of the time. If you stick to using older compilers.
Generally, when you protect a data structure with a lock, that means that the thread which has the lock has mutually exclusive access to it. Other threads which want access must wait. When you have many threads which want access, you will tend to have serialized that part of your program. That will tend to hurt performance, if you gain performance from parallelism.
Lock-free approaches tend to allow multiple threads to access and modify a structure without having to get mutually exclusive access. Yes, if two threads are trying to touch the exact same part of the structure, they will interfere: one will succeed and move on, and the other will fail and have to retry. That's not much of a win over locks (generally). But, if it's common for the threads to touch different places of your structure, then they can do so without interfering with each other. That tends to allow better performance through parallelism.
I'm using a lot of weasel words ("tends to", "generally") because there are tradeoffs. If you have a shared structure which is accessed by multiple threads very rarely, then it may actually be better to use locks. Lock-free data structures tend to be more complicated (and hence, take a bit longer in the unconteded case) than simply protecting a data structure with a lock. So if you're almost always going to have uncontended access, it may actually yield better performance to use a lock.
But, locks can also be problematic when they're held too long or indefinitely. What if a thread dies while holding a lock? Or does I/O? Or is rescheduled? Other threads which want the lock must wait. Lock-free approaches tend to be free of such issues.
You can use mutexes to do fine-grained locking and still get into a dead-lock. Your algorithm can use only CASs and still not be deadlock-free.
The general idea of lock free is to holistically design a data structure that guarantees that when you I invoke its operations in any concurrent scenario, the overall system will make progress and won't get stuck.
The point of lock-free algos, though, is that you could usually do something different from just retrying. There is this concept/general approach in lock-free where if a CAS fails at a certain place, you could invoke helpers that would help some other part of the algorithm make progress. E.g. you may be inserting an element in some structure and your CAS fails at some level (usually because another thread is doing a delete). In that case, you would help the delete operation proceed by invoking some idempotent 'post_delete_cleanup' operation on the same node.
The idea of lock-free isn't that one particular thread won't block, but rather that the whole system would be making progress (i.e. no deadlock). It doesn't necessarily mean there will be no locks either.
Isn't it more like "mutex minimization" since you are trying to limit the critical section to a single a single atomic instruction(as opposed to more than one)?
FYI, this hasn't been the case in at least 20 years, on any of the major architectures. Cache coherency guarantees a single writer for any cache line aligned chunk of memory without any sort of bus lock.
I probably could have articulated my original question better, sorry.
By contention, I imply -- contention or conflicts on the same portions/pieces of structure.
If the data structure is subject to high concurrent access but in fairly disjoint regions of the data structure, we should try to implement fine grained locking scheme to increase the degree of parallelism and thus the scalability of data structure.
This is where CAS comes into picture. The article doesn't compare CAS to another locking primitive. Instead, CAS is just used as a mechanism to implement fine grained locking scheme where the methods operating on the data structure don't take global or coarse locks.
If the contention on the entire data structure is unreasonably high for some reason, then CAS based approaches or any flavors of it don't buy us much.
Goes without saying, spinlocks are not useful on uniprocessor machines or single core machines.
Though from what I've read, on Linux, pthreads implements its mutexes in user-space (futex) as well and are fairly cheap to use.
For example, when the "special case" of adding an element at the head or tail are discussed - those special cases disappear entirely when the search procedure simply returns a pointer to the next field of the preceding element (or a pointer to the root pointer of the linked list, if the new element should be inserted at the head).