Hacker News new | comments | show | ask | jobs | submit login
Designing a Lock-free, Wait-free Hash Map (shlomisteinberg.com)
185 points by striking 23 days ago | hide | past | web | favorite | 64 comments

For those interested in the subject: Cliff Click of Azul Systems had a presentation on his design for a hash table meant for heavy multithreading. One part I remember was that he stored the hashes of keys alongside the keys, allowing for faster resolution of the "not match" case (you compare the hashes before doing the full key comparison).


While searching for the above, I also found this: http://preshing.com/20130605/the-worlds-simplest-lock-free-h...

Nice read! Today Cliff Click's NonBlockingHashMap officially lives here https://github.com/JCTools/JCTools/blob/master/jctools-core/...

But this trashes the cache line by half. Only half buckets are now in the cache, a miss costs 50 insn. It's a typical hashtable trade-off. It pays off with bigger tables, as you have much more misses there.

The beautiful (and I think novel) aspect of his design is the underlying FSM.

FSM’s are a good way to think about any problem you’d like to address lock-free, since they map so well to CAS.

His FSM is a bit more elaborate than merely modeling a CAS. 3 states, if I recall correctly.

[edit: https://web.stanford.edu/class/ee380/Abstracts/070221_LockFr... See page 32]

Right, but you can implement atomic state transitions quite naturally with CAS.

I suggest you actually listen to Cliff explain why he did what he did before you belabor the obvious. (Those state transitios are all CASed ..)

I have no idea what you think we’re disagreeing about. I’ve watched Cliff’s presentation a handful of times over the past decade and implemented something similar (albeit simpler) myself. Yes, the particular FSM in his hash table is interesting and novel, but just as insightful to me was his argument that data structures implemented as state machines built on atomic CAS can save you much of the trouble of proving you’ve put memory barriers in the right places. “All” you have to do is demonstrate that every possible state is valid.

I was recently wondering whether Intel Transactional Synchronization Extensions(TSX) will replace lock-free patterns in latency-critical programming. However, I have seen few benchmarks on this topic.

TSX is actually relatively bare for what it offers as transactional memory. IBM's Power8 and Blue Gene are significantly robust in that area.

For numbers, here's some for Thread-Level Speculation [0], though simulated and not against TSX. Performance comparisons are at page 9.

[0] http://www.cs.cmu.edu/~tcm/tcm_papers/isca00.pdf

It'd be nice if this was a bit readable. Thin fonts and light grey is impossible for old eyes.

Reader mode in Safari (at least on iOS) renders this nicely for me automagically, and turns the grey to a nice black.

For the desktop I was going to recommend Distill Mode in Chrome (add --enable-dom-distiller to the command line/launcher, then it's available from the vertical "..." menu top right) but it didn't behave the way I remembered it. You get a nicely formatted page with all (or at least most) of the extraneous crap filtered out, but the grey remains.

Internet Explorer supposedly has a Reading Mode too, but I can't find it on my desktop.

Firefox has a reader mode too, which works perfectly also.

Oh yes, thank you. I'd forgotten all about that .. it works well. Overall it'd be nice if I didn't _have_ to do it though.

So right. I have a grease monkey script globally to inject css to make text black. Which works well except when the background is dark of course. But I find reading light text on dark background also an effort so I usually just skip that.

On Firefox you can ask the browser to disable CSS, or even just to ignore any text color defined within CSS (thus using your browser defaults).

Or just use the reader mode.

The text is dark gray on my screen (hex color #4d4d4d, which is 70% black). You might want to try adjusting your monitor settings.

How rude to someone who has difficulties to see because of age.

I certainly didn't mean to be rude...? I was trying to help OP identify and fix the problem they were having.

rude? I thought he was being helpful

Well, I have a regular MacBook pro with a retina display with no customization, and it has been well calibrated for most every webpage I have seen.

The text is larger, and thicker than usual text in a newspaper or book, and it is just as black, or even more black (measured on a calibrated display).

You might want to change your monitor, or recalibrate it, because it seems like this is an issue with your display.

Photo: https://i.imgur.com/o5bknYs.jpg (The text on the newspaper there is #574839, while the text on the linked site is #454545. The text in the newspaper is the equivalent of 10px tall, and 1px thick, the text on the site is 16px tall and 2.5px thick)

Poor contrast and unreadable font is still visible even on your screenshot, the complaints are valid.

> Poor contrast and unreadable font is still visible even on your screenshot,

That’s an artifact of the camera, not the page. On a properly calibrated screen supporting the Rec.2020 colorspace with at least 1000 nits of color, this will have more than enough contrast.

The big problem is that today, between #454545 and the background a user with a 1000 nits screen and one with a 100 nits screen will see a difference in contrast of almost an order of magnitude.

TL;DR: Either you need to calibrate your screen, need a better one, or we need a way to specify the absolute brightness that a page should be shown with in code.

I have the latest MacBook pro with a Retina display. It is a superb display, and if that needs to be calibrated at all to view a web page, something is definitely wrong with the web page.

Just making sure I understand this correctly: so at a high level you're just trading software locks for hardware locks (which are most always more granular)? If you didn't have hardware atomic support, the same "lock free" data structure would be one where locks are held for as few instructions as possible (and that exhibits the same functional characteristics around memory ordering etc.)?

> Just making sure I understand this correctly: so at a high level you're just trading software locks for hardware locks

But the use of hardware ‘locks’ here has the ‘lock-free’ property. (‘Lock-free’ and ‘locks’ do not use exactly the same definition of ‘lock’.)

Exactly. Its about blocking vs making guaranteed forward progress.

To support what dkersten said: there's a trade-off when deciding to go with lock-free or lock-based structures. Lock-based structures tend to have less synchronization instructions in total, but the inherent mutual exclusion means that usually only one thread can make progress at a time. (And the cheap locks are non-deterministic as to who gets the lock next.) Lock-free algorithms tend to have more synchronization instructions in total, but all threads tend to be able to make progress at all times.

In general, lock-based structures and algorithms are better in low contention situations, and lock-free structures and algorithms are better in high contention situations. The intuition is that in low contention (either few threads, or few interfering threads), there's little harm in forcing mutual exclusion, so it's not worth paying a bunch more synchronization instructions. But in high contention (lots of threads, or lots of interfering threads), mutual exclusion can mean that most of the threads are sitting idle waiting for the right to do some work, so it's worth it to pay a bit more overhead so the whole system can more often progress.

> Lock-based structures tend to have less synchronization instructions in total, but the inherent mutual exclusion means that usually only one thread can make progress at a time [...]. Lock-free algorithms tend to have more synchronization instructions in total, but all threads tend to be able to make progress at all times.

technically not correct. In a lock based implementation, it is possible that no thread is making progress, i.e. no guarantees. Lock free is the guarantee that at least one thread makes progress. Wait free is the guarantee that all threads might be making progress.

Technically correct because I deliberately used weasel words (tend to, usually). Refining the specifics of a post doesn't have to be couched as a contradiction!

Maybe I don't understand atomic well enough. If all threads hit an atomic compare and swap on the same address at the same time, doesn't the CPU have to make sure that only one thread is updating the value at a time? Wouldn't this manifest as slightly increased instruction latency for the other threads?

Yes, CPU would have to make sure that only one thread is updating the value and this would definitely increase latency for other threads.

"Guaranteed forward progress" is barely of any interest here. They actually mean, that in the event of a thread randomly stopping for some reason, other threads will still be able to perform atomic operations, because no thread can actually stop in the middle of an atomic operation.

Huh? So the meaning is more like "blocking vs. non-blocking" rather than "synchronized vs. non-synchronized"?

No, from what I understand, you can't even do it in software anyway due to how the caches work between cores inside the CPU.

Watching this was enlightening to say the least (and the presenter is very good): https://www.youtube.com/watch?v=ZQFzMfHIxng

Huh, that first example is comparing apples and oranges.

This is the "slow" lock-free version from there:

  std::atomic<unsigned long> sum;
  void do_work(size_t N, unsigned long* a) {
    for (size_t i = 0; i < N; ++i) sum += a[i];
It atomically adds for every iteration of the loop.

This is the "fast" mutex-based version:

  unsigned long sum(0); std::mutex M;
  void do_work(size_t N, unsigned long* a) {
    unsigned long s = 0;
    for (size_t i = 0; i < N; ++i) s += a[i];
    std::lock_guard<std::mutex> L(M); sum += s;
Here the sum is accumulated before being added to the total.

Of course the mutex-based version is faster. Try again with:

  std::atomic<unsigned long> sum;
  void do_work(size_t N, unsigned long* a) {
    unsigned long s = 0;
    for (size_t i = 0; i < N; ++i) s += a[i];
    sum += s;

I think one of the questions at the end covers this. The speaker admits the lock choice is a gimmick and reiterates that the time sink is related to thrashing on the cache line.

I haven't watched the video up to the end, but, how do you think mutexes work under the hood? They use atomic operations. So the cache line thrashing happens in both cases.

Edit: come to think of it, they only use compare-and-exchange, which I guess is lighter on the cache lines.

Finally made it through this. Very interesting talk and it confirms my understanding that lock free is not wait free and that with hardware atomic instructions you're delegating the locking to the hardware but still relying on correctly placed memory barriers to guarantee the consistency you need.

Although semi-related, a wait-free stack was discussed on HN previously[0].

[0]: https://news.ycombinator.com/item?id=12109206

> You can verify that your implementation is capable of performing lock-free atomic operations on double-machine-word primitives with assert(data_ptr.is_lock_free()).

This may be true when compiling for a 32bit process and running on a 64 bit machine, but is not true for the more common 64 bit program on x86_64 (since pointers are 64 bit.)

data_ptr is an atomic-wrapped struct of an int and a pointer. If it’s lock-free you must be running somewhere with at least DCAS.

(Maybe you’re saying the statement is true but that assertion will be false on x86_64? But CMPXCHG16B has been around for a while.)

It is false for x86_64 when compiling with gcc 7.2 or clang 4.0. Not sure you can make atomic.h use CMPXCHNG16B.

That’s very surprising. Did you specify -march? If the compilers have to support ALL x64 then they can’t use it, since it’s a 2004-ish addition.

std::atomic<something-128bit-sized> do, or at least did last time I checked, use cmpxcg16b on gcc. There were talks about changing that because of surprising behavior: as there is no 128 atomic load, even loads need to use the cas, making them significantly slower than 'normal' atomic loads; also const std::atomic<128bit> is unusable in read only pages which is an expected use case for atomics.

I do not know whether that has actually been 'fixed' though.

Ah, template<typename T>, how I've not missed thee.

For what it's worth, unless you're writing a library, you probably shouldn't be writing your own template code. Most uses of templates I see are unjustified complexity.

That said, template generics are pretty damn awesome in terms of performance/behavior guarantees

It's not the complication that bugs me: it's the runtime inefficiency! Used carelessly, C++ templates produce massive code bloat from repetitive instantiations.

Your point is a decade out of date.

Templates are a prime mechanism for achieving efficiency in C++ code and do not produce code-bloat like those early compilers.


When templates are used to remove indirection, very good efficiency results with zero-overhead, which is a major reason why C++ sort can be quicker than qsort. (http://www.geeksforgeeks.org/c-qsort-vs-c-sort/)

My point is not out of date.

You can use techniques like type erasure to reduce bloat, and compilers and linkers do combine identical code sequences in many cases, but the point remains that templates work by replicating code for each instantiation. If you blindly trust the toolchain to clean up for you (as your first link suggests), you're going to run into problems. Size problems are insidious because they don't show up until they're hard to fix.

I'm not suggesting that you never use templates, but I am insisting that you understand what they do instead of trusting the compiler to sweep everything under the rug.

> It's not the complication that bugs me: it's the runtime inefficiency! Used carelessly, C++ templates produce massive code bloat from repetitive instantiations.

You mean you actually experience slowness clearly attributable to instruction cache misses? I don't know if I've ever had that. What bugs me is the compile-time inefficiency. A lot of times the same method gets regenerated for every time in the enclosing class even though most if not all of it could just as well be in the base class without the template parameter, to only get instantiated once.

There are tricks to avoid it if you really want to. Precompiled headers, force instantiating templates you need with LTO enabled and caching. Merging source files in the build system (possible thanks to ODR). Calling the compiler with many source files at once to use compiler cache.

The remaining inefficiency is supposed to go away with the upcoming module system.

> The remaining inefficiency is supposed to go away with the upcoming module system.

That's really not the main goal of the module system. It's more about code hygiene, no matter how much people are hoping it will completely solve the issue with compile times.

The page faults are usually worse than the icache misses. Code is not free.

As somebody who’s debugged and inspected machine code generated by expanded templates I can tell you this is not really a problem. At -O2 or -O3 optimization level with GCC or Clang almost all of that get unlined, removed, remerged and just optimized away. At -O0 god help you with single stepping any template code (like unique_ptr) in gdb.

This just shows how much a leg up garbage collected languages have when it comes to standalone lock-free structures.

Sure, this is technically a wait-free hashmap implemented in C++, which is pretty cool... but it makes two atomic operations on every read. Pure reader threads accessing the same element will contend over the same cache line. One of the very desirable aspects of a lock-free concurrent hashmap is cheap reads, since read-only scenarios are so common. The proof is in the pudding: as noted at the end this map is 200% slower than a lock-based implementation for a read-heavy load, which is exactly a type of load where the lock-free map should be able to exploit full concurrency.

Java has had zero-atomic ConcurrentHashMap reads since inception, and they are fast (a couple nanos or so). I'm sure other garbage-collected languages are the same. The big thing garbage collection gives you is automatically solving the memory reclamation problem: it's no coincidence that problem is highlighted in the article and that the atomic-ops-on-read comes directly from it. Garbage collection avoids it entirely, because you never explicitly delete the removed nodes, leaving it up to the garbage collector which can use magic beyond the read of the mere mortal to free them safely at a later time.

To get a concurrent[1] hashmap with reasonable read performance, i.e., "solving" the I am only aware of the following techniques:

1) Something like quiescent-state based reclamation: basically each thread occasionally signals when it is definitely not referencing any map nodes, at which point (or some later point) the implementation can determine it is safe to clean them up. Effectively it allows many fast read calls by batching up the "ok to reclaim" notification into a single occasional quiescent checkpoint call. It introduces some delay in reclamation, and you have to have a suitable place to put the quiescent checkpoint. If it works, I think this is perhaps the fastest. See for example https://github.com/rmind/libqsbr .

2) Use some kind of dekker-like flagging to coordinate between readers and deleters. E.g., a reader sets a going_to_read_flag and a deleter sets a going_to_delete flag and then each side checks the other flag and proceeds only if it is not set. This doesn't work with "plain" writes and reads on almost any arch because nearly all (including x86) allow store-load reordering, so you need a barrier. The good news is that you can move all the cost of the barrier to the writers with an async membarrier like http://man7.org/linux/man-pages/man2/membarrier.2.html . The cost for the deleters is very high though (a syscall at least), so you have to have a heavy read load to make this work. It's not as fast as QSBR since you still need to do some extra (non-atomic) ops on the read side and you have to figure out how to handle multiple reads).

3) Never actually free the memory. Rather than using plain malloc/free for nodes, just keep recycling the same nodes over and over, so once a reader gets a node, it is always "safe" to access it. Of course, this means that nodes aren't immutable, so readers may see a different value when a node is recycled. That's probably fine since the implementation can just make a copy (return "by value") - but you have to handle reads where a node is part way through the transition, to avoid returning a split read. If your data is small (8 bytes or less) that's easy since plain reads are atomic at that level. For larger values you could use a seqlock.

4) Transactional memory stuff, e.g., Intel HTM.

Has any seen any other approaches out there for native code?

The complexity and tradeoffs involved are why I don't think we'll be seeing a concurrent map in standard C++ any time soon.


[1] Not all of these may be strictly lock-free, certainly not wait-free - but an approximation that works well in practice is often much better than the 100% authentic read-deal when it comes to this kind of thing. That compromise is why you hear much more about "lock-free" than "wait-free" even though the latter is strictly better in a theoretical sense and Herlihy proved that wait-free algorithms generally exist when lock-free ones too (and indeed provided a recipe for changing the latter into the former).

> for (int i = ; i < chunks; ++i) {

> int old_val = ;

Please tell me this is a bug in the example and not some new C++ initializer syntax.

I can't find anything about that so I guess it's a problem with the page.

I thought at first glance it might be a reference being swallowed by the browser thinking it was an html entity ( &whatever; ).

I don't see anything in the page source either though. So if it was something swallowed as an entity it didn't make it into the final source at all. Perhaps an editor expanded it automatically into nothing.

    old_val <span style="color: #333333">=</span> ;

Kind words... "a little weird to put 100 hours into something that won't be noticed by its absence" "I don't want to give people problems they didn't know they should have though so it was worth fixing."

Wrong thread.

quoted from the piece. not into making a quilt.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact