
New Concurrent Hash Maps for C++ - adamnemecek
http://preshing.com/20160201/new-concurrent-hash-maps-for-cpp/?
======
zokier
How much overhead does std::mutex add in the single-threaded case for
std::map? Because its kinda curious that this new map is more than twice as
fast as std::map even on single thread.

~~~
berkut
std::map is a RB tree, not a hash map.

~~~
pjmlp
ANSI C++, section 23.4.2 and 23.4.4, doesn't specify the implementation only
the complexity requirements.

Any C++ implementation is free to choose their std::map implementation as long
as it meets the requirements, it doesn't say anywhere that a RB Tree is
required.

~~~
skystrife
libstdc++, libc++, and MSVC's STL all implement it as a red-black tree.

Even if it's not _required_ to be a red-black tree, it is _de facto_ a red-
black tree on every major compiler.

Moreover, this doesn't really change the parent's point. It _must_ be some
sort of ordered associative container, meaning that it's not going to have the
performance characteristics of a hash table.

~~~
pjmlp
Clang, gcc and MSVC are hardly "every major compiler", as there are many
others to choose from specially in the embedded systems, real time OSes,
classical commercial UNIX and mainframes.

We don't have always the luxury of choosing which compiler to use.

Relying on implementation details of the compiler or provided library is the
first trap to writing portable code across OSes and compiler vendors.

------
eps
Uhm... The absolute Y-position of the curve on the graph is secondary to its
shape. Also, the _scalability_ discussion is largely pointless if the sample
set covers only 6 CPUs.

Intel TBB scales proportionally, which is precisely what you'd want here.
Junction starts to flatten out on 6th CPU, which implies that it has
fundamental design issues that crop up at higher CPU counts. Chances are that
its performance not only won't scale further, but will actually drop further
down the graph.

On other hand TBB code can be put throught some routine code optimization
(hand-coded assembly and such) to increase its performance without affecting
its linear graph shape.

~~~
vardump
Based on the information available, you might be right and you might be wrong.

Before one can say anything either way, one needs to profile. Assumptions
rarely work when it comes to extracting high runtime performance.

So unless you did profile, your comment didn't really add anything.

~~~
vog
_> So unless you did profile, your comment didn't really add anything._

The previous comment explained why and how the author's analysis is flawed.
Then it goes over top by speculating exactly into the opposite direction. But
that doesn't reduce the quality of the first part of the comment. That part
was still valuable to me.

~~~
vardump
Ok, you have a point.

Original author's analysis is flawed, it doesn't extend to multi-socket
systems.

What we don't know is how thing scales beyond a single socket. The graph is
not going to tell us anything about that. Profiling will.

Experience has shown to me assumptions are bad.

So now I don't assume a certain call will succeed or even work the way I think
it does. Instead I check return value of every call and test it against my
assumptions.

I never assume much about performance either. I might occasionally use
microbenchmarks as a hint. But the main mode of operation is measuring as big
of a piece of functionality as possible. Many different size, but realistic,
workloads. Preferably on multiple different systems as well.

If performance is the goal, I'd advocate one trying out different concurrent
hash map implementations in the system one is building.

Never assume.

~~~
exDM69
> The graph is not going to tell us anything about that. Profiling will.

What kind of profiling would you employ for a concurrent data structure like
the hash map here? Instrumented code or sampling profilers?

I'm afraid both kinds of profiling would yield quite meaningless information,
because the individual operations are quite fast and the runtime may vary
depending on e.g. cache utilization and contention. Profiling is good for
bigger applications but it's not super useful for "primitive" operations like
hash map inserts.

If I were to optimize something like this, I'd first reach for the CPU
performance counters trying to understand what aspect is the bottle neck.

~~~
vardump
> ... Instrumented code or sampling profilers?

A sampling profiler would be helpful. Although most hash map related samples
would likely fall on atomic ops it presumably uses for synchronization. On the
other hand, you'd know whether you need to optimize this in the first place.

Instrumented profiler would yield garbage data for a lot of reasons, I
wouldn't use that, _except_ maybe over a _large_ group of hash map operations.

> ... I'd first reach for the CPU performance counters

I count CPU performance counters as profiling.

------
thrownaway2424
Interesting and impressive performance. The requirement to periodically halt
every thread seemed like a bit of a downer though. Am I reading that wrong?

~~~
lsr0
That seems to be the case, I do wish they'd made this requirement clear up
front. There is no getting around sychronised quiescence being a blocking
event, but in this case they essentially hope that either 1) you're already
using a kind of scatter gather thread model like the mentioned game example -
which implies an iterative discrete world, or 2) the set of threads (or tasks)
interacting with the collection is simply bounded and sychronised, and/or 3)
the performance is still better in aggregate even with infrequent world
blocking events.

~~~
tlipcon
Typically QSBR algorithms don't require blocking the world, or even blocking
any single thread. They just require each thread to periodically check in and
run a bounded amount of code which amounts to "hey, I'm not currently looking
at the map".

Some other background collector thread (which is going to actually delete
removed objects) just has to wait until it sees every mutator thread cross a
safepoint, at which point it knows that none of those threads could be hanging
onto references that have been unlinked from the data structure.

I'd recommend reading some surveys of RCU and SMR algorithms if this stuff is
interesting to you.

------
fenesiistvan
I hope that the usage will be easier than for Intel TBB. If you are not using
a mainstream build tool, then Intel TBB is a headache.

------
berkut
Those results look slightly suspicious to me, in that I've seen TBB
concurrent_hash_map scale much better than that...

I guess it depends on the workload ratio...

Would have been nice to see a binned/sharded hashmap in the results as well,
I've seen pretty good scalability on those as well.

------
Const-me
Interesting indeed. Here's one more comparison, BTW:
[http://const.me/tmp/versus-atl-map.png](http://const.me/tmp/versus-atl-
map.png)

