
Designing a Lock-free, Wait-free Hash Map - striking
https://shlomisteinberg.com/2015/09/28/designing-a-lock-free-wait-free-hash-map/
======
waterhouse
For those interested in the subject: Cliff Click of Azul Systems had a
presentation on his design for a hash table meant for heavy multithreading.
One part I remember was that he stored the hashes of keys alongside the keys,
allowing for faster resolution of the "not match" case (you compare the hashes
before doing the full key comparison).

[https://web.stanford.edu/class/ee380/Abstracts/070221_LockFr...](https://web.stanford.edu/class/ee380/Abstracts/070221_LockFreeHash.pdf)

While searching for the above, I also found this:
[http://preshing.com/20130605/the-worlds-simplest-lock-
free-h...](http://preshing.com/20130605/the-worlds-simplest-lock-free-hash-
table/)

~~~
eternalban
The beautiful (and I think novel) aspect of his design is the underlying FSM.

~~~
twoodfin
FSM’s are a good way to think about any problem you’d like to address lock-
free, since they map so well to CAS.

~~~
eternalban
His FSM is a bit more elaborate than merely modeling a CAS. 3 states, if I
recall correctly.

[edit:
[https://web.stanford.edu/class/ee380/Abstracts/070221_LockFr...](https://web.stanford.edu/class/ee380/Abstracts/070221_LockFreeHash.pdf)
See page 32]

~~~
twoodfin
Right, but you can implement atomic state transitions quite naturally with
CAS.

~~~
eternalban
I suggest you actually listen to Cliff explain why he did what he did before
you belabor the obvious. (Those state transitios are all CASed ..)

~~~
twoodfin
I have no idea what you think we’re disagreeing about. I’ve watched Cliff’s
presentation a handful of times over the past decade and implemented something
similar (albeit simpler) myself. Yes, the particular FSM in his hash table is
interesting and novel, but just as insightful to me was his argument that data
structures implemented as state machines built on atomic CAS can save you much
of the trouble of proving you’ve put memory barriers in the right places.
“All” you have to do is demonstrate that every possible state is valid.

------
htfy96
I was recently wondering whether Intel Transactional Synchronization
Extensions(TSX) will replace lock-free patterns in latency-critical
programming. However, I have seen few benchmarks on this topic.

~~~
Everlag
TSX is actually relatively bare for what it offers as transactional memory.
IBM's Power8 and Blue Gene are significantly robust in that area.

For numbers, here's some for Thread-Level Speculation [0], though simulated
and not against TSX. Performance comparisons are at page 9.

[0]
[http://www.cs.cmu.edu/~tcm/tcm_papers/isca00.pdf](http://www.cs.cmu.edu/~tcm/tcm_papers/isca00.pdf)

------
sriram_malhar
It'd be nice if this was a bit readable. Thin fonts and light grey is
impossible for old eyes.

~~~
tomalpha
Reader mode in Safari (at least on iOS) renders this nicely for me
automagically, and turns the grey to a nice black.

For the desktop I was going to recommend Distill Mode in Chrome (add --enable-
dom-distiller to the command line/launcher, then it's available from the
vertical "..." menu top right) but it didn't behave the way I remembered it.
You get a nicely formatted page with all (or at least most) of the extraneous
crap filtered out, but the grey remains.

Internet Explorer supposedly has a Reading Mode too, but I can't find it on my
desktop.

~~~
kroltan
Firefox has a reader mode too, which works perfectly also.

------
dcow
Just making sure I understand this correctly: so at a high level you're just
trading software locks for hardware locks (which are most always more
granular)? If you didn't have hardware atomic support, the same "lock free"
data structure would be one where locks are held for as few instructions as
possible (and that exhibits the same functional characteristics around memory
ordering etc.)?

~~~
chrisseaton
> Just making sure I understand this correctly: so at a high level you're just
> trading software locks for hardware locks

But the use of hardware ‘locks’ here has the ‘lock-free’ property. (‘Lock-
free’ and ‘locks’ do not use exactly the same definition of ‘lock’.)

~~~
dkersten
Exactly. Its about blocking vs making guaranteed forward progress.

~~~
scott_s
To support what dkersten said: there's a trade-off when deciding to go with
lock-free or lock-based structures. Lock-based structures tend to have _less_
synchronization instructions in total, but the inherent mutual exclusion means
that usually only one thread can make progress at a time. (And the cheap locks
are non-deterministic as to who gets the lock next.) Lock-free algorithms tend
to have _more_ synchronization instructions in total, but all threads tend to
be able to make progress at all times.

In general, lock-based structures and algorithms are better in low contention
situations, and lock-free structures and algorithms are better in high
contention situations. The intuition is that in low contention (either few
threads, or few _interfering_ threads), there's little harm in forcing mutual
exclusion, so it's not worth paying a bunch more synchronization instructions.
But in high contention (lots of threads, or lots of interfering threads),
mutual exclusion can mean that most of the threads are sitting idle waiting
for the right to do some work, so it's worth it to pay a bit more overhead so
the whole system can more often progress.

~~~
gpderetta
> Lock-based structures tend to have less synchronization instructions in
> total, but the inherent mutual exclusion means that usually only one thread
> can make progress at a time [...]. Lock-free algorithms tend to have more
> synchronization instructions in total, but all threads tend to be able to
> make progress at all times.

technically not correct. In a lock based implementation, it is possible that
no thread is making progress, i.e. no guarantees. Lock free is the guarantee
that at least one thread makes progress. Wait free is the guarantee that all
threads might be making progress.

~~~
scott_s
_Technically_ correct because I deliberately used weasel words ( _tend to_ ,
_usually_ ). Refining the specifics of a post doesn't have to be couched as a
contradiction!

------
kuwze
Although semi-related, a wait-free stack was discussed on HN previously[0].

[0]:
[https://news.ycombinator.com/item?id=12109206](https://news.ycombinator.com/item?id=12109206)

------
nimrody
> You can verify that your implementation is capable of performing lock-free
> atomic operations on double-machine-word primitives with
> assert(data_ptr.is_lock_free()).

This may be true when compiling for a 32bit process and running on a 64 bit
machine, but is not true for the more common 64 bit program on x86_64 (since
pointers are 64 bit.)

~~~
twoodfin
data_ptr is an atomic-wrapped struct of an int and a pointer. If it’s lock-
free you must be running somewhere with at least DCAS.

(Maybe you’re saying the statement is true but that assertion will be false on
x86_64? But CMPXCHG16B has been around for a while.)

~~~
nimrody
It is false for x86_64 when compiling with gcc 7.2 or clang 4.0. Not sure you
can make atomic.h use CMPXCHNG16B.

~~~
twoodfin
That’s very surprising. Did you specify -march? If the compilers have to
support ALL x64 then they can’t use it, since it’s a 2004-ish addition.

~~~
gpderetta
std::atomic<something-128bit-sized> do, or at least did last time I checked,
use cmpxcg16b on gcc. There were talks about changing that because of
surprising behavior: as there is no 128 atomic load, even loads need to use
the cas, making them significantly slower than 'normal' atomic loads; also
const std::atomic<128bit> is unusable in read only pages which is an expected
use case for atomics.

I do not know whether that has actually been 'fixed' though.

------
sillysaurus3
Ah, template<typename T>, how I've not missed thee.

~~~
VHRanger
For what it's worth, unless you're writing a library, you probably shouldn't
be writing your own template code. Most uses of templates I see are
unjustified complexity.

That said, template generics are pretty damn awesome in terms of
performance/behavior guarantees

~~~
quotemstr
It's not the complication that bugs me: it's the runtime inefficiency! Used
carelessly, C++ templates produce massive code bloat from repetitive
instantiations.

~~~
throwaway613834
> It's not the complication that bugs me: it's the runtime inefficiency! Used
> carelessly, C++ templates produce massive code bloat from repetitive
> instantiations.

You mean you actually experience slowness clearly attributable to instruction
cache misses? I don't know if I've ever had that. What bugs me is the
_compile-time_ inefficiency. A lot of times the same method gets regenerated
for every time in the enclosing class even though most if not all of it could
just as well be in the base class without the template parameter, to only get
instantiated once.

~~~
AstralStorm
There are tricks to avoid it if you really want to. Precompiled headers, force
instantiating templates you need with LTO enabled and caching. Merging source
files in the build system (possible thanks to ODR). Calling the compiler with
many source files at once to use compiler cache.

The remaining inefficiency is supposed to go away with the upcoming module
system.

~~~
kbwt
> The remaining inefficiency is supposed to go away with the upcoming module
> system.

That's really not the main goal of the module system. It's more about code
hygiene, no matter how much people are hoping it will completely solve the
issue with compile times.

------
BeeOnRope
This just shows how much a leg up garbage collected languages have when it
comes to standalone lock-free structures.

Sure, this is technically a wait-free hashmap implemented in C++, which is
pretty cool... but it makes _two_ atomic operations on _every read_. Pure
reader threads accessing the same element will contend over the same cache
line. One of the very desirable aspects of a lock-free concurrent hashmap is
cheap reads, since read-only scenarios are so common. The proof is in the
pudding: as noted at the end this map is 200% slower than a lock-based
implementation for a read-heavy load, which is exactly a type of load where
the lock-free map should be able to exploit full concurrency.

Java has had zero-atomic ConcurrentHashMap reads since inception, and they are
fast (a couple nanos or so). I'm sure other garbage-collected languages are
the same. The big thing garbage collection gives you is automatically solving
the memory reclamation problem: it's no coincidence that problem is
highlighted in the article and that the atomic-ops-on-read comes directly from
it. Garbage collection avoids it entirely, because you never explicitly delete
the removed nodes, leaving it up to the garbage collector which can use magic
beyond the read of the mere mortal to free them safely at a later time.

To get a concurrent[1] hashmap with reasonable read performance, i.e.,
"solving" the I am only aware of the following techniques:

1) Something like quiescent-state based reclamation: basically each thread
occasionally signals when it is definitely not referencing any map nodes, at
which point (or some later point) the implementation can determine it is safe
to clean them up. Effectively it allows many fast read calls by batching up
the "ok to reclaim" notification into a single occasional quiescent checkpoint
call. It introduces some delay in reclamation, and you have to have a suitable
place to put the quiescent checkpoint. If it works, I think this is perhaps
the fastest. See for example
[https://github.com/rmind/libqsbr](https://github.com/rmind/libqsbr) .

2) Use some kind of dekker-like flagging to coordinate between readers and
deleters. E.g., a reader sets a going_to_read_flag and a deleter sets a
going_to_delete flag and then each side checks the other flag and proceeds
only if it is not set. This doesn't work with "plain" writes and reads on
almost any arch because nearly all (including x86) allow store-load
reordering, so you need a barrier. The good news is that you can move all the
cost of the barrier to the writers with an async membarrier like
[http://man7.org/linux/man-
pages/man2/membarrier.2.html](http://man7.org/linux/man-
pages/man2/membarrier.2.html) . The cost for the deleters is very high though
(a syscall at least), so you have to have a heavy read load to make this work.
It's not as fast as QSBR since you still need to do some extra (non-atomic)
ops on the read side and you have to figure out how to handle multiple reads).

3) Never actually free the memory. Rather than using plain malloc/free for
nodes, just keep recycling the same nodes over and over, so once a reader gets
a node, it is always "safe" to access it. Of course, this means that nodes
aren't immutable, so readers may see a different value when a node is
recycled. That's probably fine since the implementation can just make a copy
(return "by value") - but you have to handle reads where a node is part way
through the transition, to avoid returning a split read. If your data is small
(8 bytes or less) that's easy since plain reads are atomic at that level. For
larger values you could use a seqlock.

4) Transactional memory stuff, e.g., Intel HTM.

Has any seen any other approaches out there for native code?

The complexity and tradeoffs involved are why I don't think we'll be seeing a
concurrent map in standard C++ any time soon.

\---

[1] Not all of these may be strictly lock-free, certainly not wait-free - but
an approximation that works well in practice is often much better than the
100% authentic read-deal when it comes to this kind of thing. That compromise
is why you hear much more about "lock-free" than "wait-free" even though the
latter is strictly better in a theoretical sense and Herlihy proved that wait-
free algorithms generally exist when lock-free ones too (and indeed provided a
recipe for changing the latter into the former).

------
golemotron
> for (int i = ; i < chunks; ++i) {

> int old_val = ;

Please tell me this is a bug in the example and not some new C++ initializer
syntax.

~~~
DiThi
I can't find anything about that so I guess it's a problem with the page.

~~~
knome
I thought at first glance it might be a reference being swallowed by the
browser thinking it was an html entity ( &whatever; ).

I don't see anything in the page source either though. So if it was something
swallowed as an entity it didn't make it into the final source at all. Perhaps
an editor expanded it automatically into nothing.

    
    
        old_val <span style="color: #333333">=</span> ;

------
Modj
Kind words... "a little weird to put 100 hours into something that won't be
noticed by its absence" "I don't want to give people problems they didn't know
they should have though so it was worth fixing."

~~~
spicyj
Wrong thread.

~~~
Modj
quoted from the piece. not into making a quilt.

