A Wait-Free Stack

kumagi · on July 17, 2016

I already wrote wait-free stack https://gist.github.com/kumagi/d259274270fdc1385f81 It is much difficult than lock-free stack. https://gist.github.com/kumagi/b9a4715b1ce0dd511922

And published as book(in Japanese sorry) http://longgate.co.jp/books/grimoire-vol3.html

amaks · on July 17, 2016

You call malloc in your implementation -- you realize that memory allocation takes lock at some point, right? To make it truly lock or wait free you need to implement a corresponding memory allocator as well.

arunmoezhi · on July 18, 2016

Most of the concurrent lock-free search trees published in literature do not even give a garbage collection strategy(assuming the implementation language has no automatic garbage collection). But they still claim lock-free or wait-free. They make assumption that memory can be reclaimed using recent techniques provided in the literature. I'm not sure sure if there is a lock-free Malloc. But that is not the problem the algorithm is trying to solve.

And almost all these algorithms use Compare-And-Exchange(CAS) instruction which internally uses locks.

_ij0r · on July 17, 2016

With that reasoning, you should write your own kernel too -- you realize that the kernel scheduler will take locks at some point, right?

I think that malloc is a sufficiently abstract operation here that its implementation shouldn't constitute whether the algorithm as a whole is lock-free or not.

haberman · on July 18, 2016

Lock-free algorithms don't usually depend on an OS being present. They could just as easily run in an environment that has no scheduler. But if the algorithm calls malloc, that is a hard dependency.

If an algorithm depends on malloc, it needs to prove that lock-free/wait-free malloc() exists before calling itself lock/wait-free.

amaks · on July 17, 2016

Not really. Lots of applications where lock free algorithms are justified usually implement their own memory allocators.

cjensen · on July 18, 2016

Malloc can be lock free if you add a thread which wakes up periodically and ensures there is free memory available.

Also, any algorithm can be made mostly malloc free if you keep a freelist instead of freeing memory. Malloc will then only be called enough times to fill the max utilization.

kumagi · on July 19, 2016

In short, glibc malloc is wait-free. In long, You may not know that terminology of "lock-free" does not mean that it does not mean never using locks.

kumagi · on July 17, 2016

This book is published in 2013. Earlier than this arxiv paper.

lrem · on July 17, 2016

It's sad to be in "the other cultural circle", isn't it? Imagine that in western Europe a number of theorems by eastern European mathematicians is known by the name of the western professor who would propagate it...

Also: I feel there should be a comma in the sentence above, but can't figure out where to put it :/

ktta · on July 17, 2016

I understand what you're trying to convey and agree. I thought you might not have noticed the email address of the authors. Two of the authors' email addresses are of IIT Delhi, one of the universities managed by the Government of India.

So these authors are in "the other cultural circle" too. One in which there's more than 10 times competition in the educational system for funding, compared to not only western and European counterparts but the Japanese one too, which the parent comment I think is/was a part of. Not just funding for cool projects. There is a very dire need for good universities but higher education in a good university is very difficult to have.

numbsafari · on July 18, 2016

I think you want to use commas to set off "in Western Europe". I think it's referred to as an "aside".

jsprogrammer · on July 17, 2016

After the 3, in place of the period.

zerr · on July 17, 2016

Do you suggest that they've used your findings without referencing you? You can make an inquiry I believe. Other than that, if their method is completely different or they've arrived to the same solution independently - there is nothing wrong. It is not about the competition who was first, isn't it?

egwor · on July 17, 2016

I think that in this day and age we ought to have a way to share knowledge so that we aren't repeating the same (completed) research unknowingly.

In maths it is thought to be beneficial to come up with a diferent proof of the same theorem though, so maybe in computing the similar idea that coming up with another implementation/demonstration/proof has value too

xchip · on July 17, 2016

you rock! So what do the threads do when they are idle? Don't you put them to sleep?

duiker101 · on July 17, 2016

So uhm the top 2 link of HN in the last 5 hours have been about this and while they are highly voted there is very little discussion going. Can someone give some context as to why this is attracting so much (surely deserved) attention? How will this affect us? Is it likely to have a deep effect on general computing performances? In what ways will this be applied? I'm sorry if this are silly questions but I find interesting the lack of discussion going on.

badestrand · on July 17, 2016

I remember the topic of lock free data structures from my time working with C++ a few years ago.

C++ is a lot about performance. To do things quicker you create threads which execute code in parallel physically. Since they mess up when working on the same data structure (e.g. a stack) at the same time locks were introduced. So when pushing a new element onto a stack you acquire the lock, perform the push and free the lock. Unfortunately, of that whole operation acquiring/freeing the lock takes 99% of the time. Thus lock-free data structures were developed which magically did not need locking and thus could operate blazingly fast. I understand the article now introduces a lock-free stack.

As I understand these data structures have no effects on multi-system environments and can only be applied to a single system. This is because they rely on atomic CPU operations that are not available (natively) on distributed systems.

How does that affect us? Not much I guess. I have never seen those data structures widely used. In my understanding almost no application benefits noticably(!) from the speed up. Maybe games, but I don't think so. Nonetheless it is always nice to see basic building blocks improved.

banachtarski · on July 17, 2016

Lock free != wait free

> As I understand these data structures have no effects on multi-system environments and can only be applied to a single system

I have no idea where to even begin with this statement. By this logic, no optimizations should be made to operations in a single threaded or single process environment. Distributing computation helps you achieve scale and HURTS latency except for extremely long running tasks.

And yes in games these sorts of things REALLY matter. Things are measured in tens to hundreds of microseconds and optimized accordingly. Not just games, but have you noticed how slow software has gotten of late? Deep learning, rendering, neural networks, AI, etc. Heck, even the browser that I'm typing this in is slow as shit and likely using far more memory than it is supposed to. What about compilers? Those things use trees, stacks, queues, etc and are getting slower all the time. It's the attitude that programmers don't need to care about this stuff that will permanently separate, in my opinion, engineers at the top of the curve from the rest.

hueving · on July 18, 2016

>that will permanently separate, in my opinion, engineers at the top of the curve from the rest.

Precisely. The engineers at the top will understand cost benefit analysis and be okay with sub-optimal solutions while the rest sit and rant about how slow something is.

badestrand · on July 18, 2016

Sorry, but that optimization elitarism does not get anyone anywhere. Most software written today is quite high level, be it the >2 million apps or the >1 billion websites. In my household neither the smoke detector, toaster or washing mashine needs faster executed code.

I understand where you are coming from but for optimizations such low level considerations should be among the last because they matter so little. Most of the time an additional caching structure or the like solves any performance issues.

banachtarski · on July 20, 2016

I mention in my reply specific applications where performance really does matter. I don't understand why issuing the strawman of pointing to things that I also trivially acknowledge are not performance sensitive really helps any.

amaks · on July 17, 2016

> Thus lock-free data structures were developed which magically did not need locking and thus could operate blazingly fast.

This is wrong assumption. With lock-free data structures you don't skip waiting for your right to access data, you just avoid using lock. It may or may not be faster depending on conditions, and whole lot harder to get it right:

http://blog.memsql.com/common-pitfalls-in-writing-lock-free-...

zzzcpan · on July 17, 2016

> almost no application benefits noticably(!) from the speed up

A lot of applications could benefit from a significant speed up. But atomic operations cannot help with that, they are too slow, they still have to do all that nasty synchronization underneath.

bluejekyll · on July 17, 2016

Atomic operations on NUMA architectures are much more efficient than lock/unlock on the same memory.

It's more expensive than no synchronization, but it's the cheapest (that I know of) method of syncing state across threads and multiple CPU architectures.

The only thing cheaper is immutable structures, but those obviously limit your ability to change state across threads.

loeg · on July 17, 2016

You can do some checks as non-atomic instructions too, with careful use of memory barriers.

ktta · on July 17, 2016

I'm not the most informed person to explain this, so anyone else please feel free to correct me.

This paper is dealing with the computing process to be used (proposed) in highly parallel and distributed computing. Specifically how it manages and updates output data.

(warning: this is a not so great example of lock-free, wait free) If you have experience with linux, you can only apt-get or yum install only once at a time, because there is a 'lock' which prevents you from installing multiple packages to prevent collisions in the system data which they modify.

In a similar vein, in parallel or distributed computing (read: many threads) computation, the threads required access to the same files in executing the algorithm.

So the ''stack" is the memory that is in use during a computation. When there is a lock on a particular sector, any programs cannot access it and have to wait for the thread currently modifying it before it makes changes, so that they don't corrupt it (mess up the data). This really is disadvantageous since you are not having the max utilization of parallel threads and threads are kept waiting. So a lock-free stack means that any threads don't have to wait for the lock to be removed before they can modify anything.

So how will this affect us? make parallel computing more efficient than it already is.

See this for more info- https://en.wikipedia.org/wiki/Non-blocking_algorithm (It's called non-blocking since it is not 'blocking any other threads'. And I suggest you read this comment before you jump in the link- See this: https://news.ycombinator.com/item?id=12109439)

zamalek · on July 18, 2016

Disclaimer: I haven't implemented this yet, this is based on skimming the paper and what I know about lock-free structures.

Your run-of-the-mill lock-free code basically operates under the premise of "keep trying until something sticks." For example, with only CompareAndSwap you can implement lock-free bit-field operations as:

    do {
      oldValue = Read(variable);
      newValue = oldValue | 0x01;
    } while ((CAS(variable, oldValue, newValue) & 0x01) == 0);

You can see that we have to keep looping until the new value (which includes the flag) "sticks." The race is allowed to occur between the Read() and the CAS(), where CAS() is used to determine if a race occurred. CAS will check the current value of the address and update it only if it matches (as an atomic operation on the CPU). The worst case here is O(∞) - if another thread always wins, this loop will never exit. The run-of-the-mill lock-free stack code looks quite similar to this (you're basically updating the head until it sticks). This means that in certain circumstances, lock-free code can actually be slower. I have run into this myself[1]: that code uses locks because I couldn't figure out a faster lock-free implementation (that code is 30x faster than my best lock-free effort).

There are two operations on a stack: Push() and Pop(). Classically, these both compete to write the head (write-contention). The advantage with this algorithm is that only Push() competes to write the head so you're decreasing the probability of proving O(∞) with each Push(). Additionally, Pop() seems to be O(N) (where N = number of items in the stack) where-as classically that's also O(∞).

TL;DR Unless I'm misunderstanding wait-free (having never heard of that technical term before), it means eliminating a theoretically unbounded loop.

If you'd like to learn more I consider Jeff Preshing[2] as the authority figure on lock-free data structures. His articles are clear and could explain how these structures work to "ostensibly anyone." CPPCon also has at least one talk of his.

[1]: https://github.com/jcdickinson/AzXmpp/blob/master/src/AzXmpp... [2]: http://preshing.com/

the8472 · on July 17, 2016

one is a stack, the other a queue, they are different things.

bnjmn · on July 17, 2016

Make sure you check out Appendix A at the end of the paper (Asymptotic Worst-Case Time Complexity), in case you imagined the name "stack" implies constant-time push/pop performance. This data structure is only a "stack" in the sense that it provides last-in-first-out access.

chrisseaton · on July 17, 2016

Is it possible for a contended concurrent stack accessed by an arbitrary number of threads to have guaranteed constant-time operations? I wouldn't think that was a realistic expectation.

aidenn0 · on July 18, 2016

A lock-based stack can be O(M) in number of threads accessing it. Judging by the abstract (haven't read the paper) this is O(N) in the size of the stack.

EvgeniyZh · on July 18, 2016

"in terms of the number of concurrent threads in the system (N), the actual size of the stack(S) and the parameter W ... as soon as W consecutive nodes get marked ... the worst case time complexity of the pop operation is O(NWS)"

zilchers · on July 17, 2016

I'm not sure I see what's novel about this - maybe it's a verbiage thing around "wait free," but if they're atomically updating the top pointer and linked list, there will be lock contention on writes, and similarly when marking an item popped, on reads. I suppose the contention is bounded by the number of readers or writers, but I wouldn't consider that wait free (again, that could just be a verbiage thing). But, more to the point, this is just a slight twist on how Kafka works (stack vs log / queue, but same with the pointer holding place and a cleanup operation), I don't really see it as particularly novel...perhaps I'm missing something?

kasey_junk · on July 17, 2016

A "wait free" data structure has a specific technical definition so yes, in some ways the novelty in this is that it meets that verbiage.

Lock free stacks have been around since at least the 80s but I haven't seen a general wait free stack before (though I'm no expert).

In neither this paper or the classic lock free stacks do you have lock contention on writes as you are using CAS operations.

I'm having a little trouble parsing the algo in this paper, but what they seem to have added that is novel is their cleanup function will always complete in a finite number of steps, thus making the whole thing both wait free and bounded, which is pretty neat.

In any case, with wait free structures generally implementation details matter a lot and stacks are pretty notoriously hard to handle because of the contention around the top node, as opposed to something like an append only log like Kafka uses, so that particular comparison is not fair.

jemfinch · on July 17, 2016

> In neither this paper or the classic lock free stacks do you have lock contention on writes as you are using CAS operations.

Given that real-world locks are most often implemented using CAS, is this distinction a valuable one? In either case you have contention on a memory location between multiple threads trying to modify that memory location.

scott_s · on July 17, 2016

Yes, it is a real distinction. In lock-free algorithms, progress is guaranteed. If they are not wait-free, then they tend to update some values, then try to commit those values with CAS. If the CAS fails, they try again. Wait-free algorithms do the first part, but on the failure, they don't try again, they go off and do something different. All threads can make progress, even if some thread is suspended or dies; no thread has exclusive access to modify the state.

In lock-based algorithms, that is not the case; the thread with the lock can be suspended, killed, or go off on a wild goose chase, and the progress of the algorithm cannot continue. Only that thread may modify the state, and the rest must wait.

gpderetta · on July 18, 2016

You did not imply otherwise and your description is great, but I wanted to clarify that an algorithm that retries a CAS can still be lock-free (just not wait free) if the failed CAS implies some other thread made progress.

scott_s · on July 18, 2016

Yes, absolutely.

jemfinch · on July 18, 2016

> the thread with the lock can be suspended, killed, or go off on a wild goose chase, and the progress of the algorithm cannot continue.

That's what made it click. Thanks!

kasey_junk · on July 17, 2016

The answer is...it depends (on lots of things).

In the case "real world performance" it has to do with the implementation of the lock and the architecture, but the CAS intrinsics on the architecture are almost certainly going to be the most performant way to do memory sync, so either the lock elides to those or it doesn't but if you are doing the CAS yourself you know what you are using.

For the purposes of these algorithm discussions though it can matter a lot because of the semantics of the lock. To prove something is lock free you have to prove that a thread will complete in a finite number of steps and lots of lock semantics don't allow for that.

For wait free all of the threads must complete and I don't know of any locking semantics that can allow for this property (but I'm definitely not an expert and usually poorly muddle through lock/algo papers).

I think its important to recognize that lock-free/wait-free is more important for real time systems than it is for "performance" (in terms of latency of request). Real time systems are often practically slower than a non-real time version of the same thing, but they have a more consistent worst case. This is where lock-free/wait-free are especially useful.

gosu · on July 17, 2016

How useful is a lock-free (not wait-free) approach in a real-time context? The point is that any thread contending on lock-free data is liable to starve, so the worst-case behavior is still quite bad, no? The only benefit I see is freedom from priority inversion, and that's a fixable problem anyway. Are there other properties of the system that you somehow use to get real-time guarantees?

On the other hand, if you combine interrupt-disabling with a fair lock, then your lock wait time can be bounded in every thread. This is probably only an option inside the kernel, though.

In userspace, and with more threads than CPUs, the lock-free approach probably will save you a lot of context-switching time. It'll probably also be at least as fine-grained as the most fine-grained locking scheme you can come up with, probably with less overhead.

So I'd say that the lock-free approach is an optimistic strategy, and I don't see the benefit except for performance. I say this as someone who really likes lockfree programming.

kasey_junk · on July 17, 2016

> The only benefit I see is freedom from priority inversion, and that's a fixable problem anyway. Are there other properties of the system that you somehow use to get real-time guarantees?

Priority inversion was what I was thinking of. When you say its a fixable problem, the common way to fix that is to have a real time scheduler deal with it. Lots of systems don't have real time schedulers available to them, so lock free algos make sense there and you deal with the starvation issue instead of the inversion issue.

That said, I'm no expert on real time systems and my use of lock free algos has fallen into the exact category you are talking about, which is wanting to avoid context switches in user land, but that is more a property of the OSes I use than the algorithms themselves.

mfukar · on July 18, 2016

Both lock-free and wait-free algorithms/implementations offer the same advantage, as a corollary of their definition: latency guarantees. These are central to real-time systems.

PS. The difference is that lock-free algorithms improve throughput _at the expense of single-thread latency_, and thus are more useful for satisfying system-wide throughput requirements, e.g. for a DBMS. Wait-freedom additionally guarantees freedom from starvation.

xxs · on July 17, 2016

res = last.nextDone.compareAndSet( (null,false), (myNode, false))

I fail to see how under contention the statement above is "wait-free". Having globalPhase is another contention point (although atomic adds are wait free on most hardware, nowadays).

Not ceratain if 'tid' is threadId but having AtomicReferenceArray (all notation appears to stem from java.util.concurrent) with an arbitrary length is certainly not a wait-free operation, hence the announce would fail to be wait free. Using CPU-id instead is a hard one as well unless all threads are core-bound. Morealso if the array (announce) is not fixed length the entire algorithm cannot finish in finite amount of steps.

charlestoothall · on July 17, 2016

Lock contention? Why do you think they're using locks?

nemetroid · on July 17, 2016

The first few lines of pop() are:

  mytop <- top.get()
  curr <- mytop
  while curr != sentinel do
    mark <- curr.mark.getAndSet(true)
    ...

What's keeping curr from becoming a dangling pointer if the size of the stack is bounded?

programmer_dude · on July 17, 2016

What is a wait-free stack?

wcrichton · on July 17, 2016

Wait-freedom is described in the introduction. Quote:

There are three levels of progress guarantees for non-blocking data structures. A concurrent object is:

- obstruction-free if a thread can perform an arbitrary operation on the object in a finite number of steps when it executes in isolation,

- lock-free if some thread performing an arbitrary operation on the object will complete in a finite number of steps, or

- wait-free if every thread can perform an arbitrary operation on the object in a finite number of steps.

Wait-freedom is the strongest progress guarantee; it rules out the possibility of starvation for all threads. Wait-free data structures are particularly desirable for mission critical applications that have real-time constraints, such as those used by cyber-physical systems.

krig · on July 17, 2016

From the introduction:

> In this paper, we describe an algorithm to create a wait-free stack. A concurrent data structure is said to be wait-free if each operation is guaranteed to complete within a finite number of steps. In comparison, the data structure is said to be lock-free if at any point of time, at least one operation is guaranteed to complete in a finite number of steps. Lock-free programs will not have deadlocks but can have starvation, whereas wait-free programs are starvation free.

xchip · on July 17, 2016

I love papers but I love it more when their code is in github :) thanks for sharing!

EvgeniyZh · on July 18, 2016

Yeah, I'd love to see implementation too. Even better - implementation in some lib with comparison of performance

amaks · on July 17, 2016

How often would you find a situation where lock free data queue or stack would bring huge performance gains? Usually bad performance comes from a poor choice of data structure(s), bad data locality or by locking is too coarse or too fine causing livelocks/convoys/excessive context switches etc. What I'm saying is that using lock free or wait free algorithm is not a panacea.

_bz2r · on July 17, 2016