
Benchmarking C++ Sets - cpphacker
https://constantsmatter.com/posts/cpp-benchmark-container-lookup/
======
xenadu02
Always measure!

In a realistic application (with cache pressure) you may find an array
performs even better because the CPU can trivially predict and pipeline the
entire operation vs jumping around to random locations in memory.

Another approach is a B-tree with lots of nodes stored at the leaves. This can
get you the array's linear performance for small datasets, while preserving
other big-O benefits of sets/trees (eg insertion time).

As the side of your dataset grows, there is a performance cliff at L1, L2, and
L3 cache size. For very large datasets, creating data structures with
predictable access patterns often trumps Big-O entirely. If following a linked
list thrashes the cache it turns out the constant factor can and does
dominate.

So again, always measure with your application on your intended hardware. Be
sure you measure with realistic loads and access patterns.

~~~
AstralStorm
Vector is supposed to essentially optimize to direct array, except on insert
or append when it may resize.

Std::Array is expected to have almost identical performance characteristics.

------
forrestthewoods
These results don't make any sense and I don't believe them.

The author is generating random strings of length 64. They then generate N
random string and search for them. The probability of finding a match is
effectively zero.

> Interestingly, up to about 75 elements, the vector actually outperforms both
> sets.

No way. The unordered_set will perform one hash, one lookup, and return false.
That is not going to be slower than SEVENTY FIVE string comparisons.

Looking at the code I _think_ that randString() for each of the N search
strings is included in the timing code. That, uh, is not good? Line 29:
[https://github.com/akshaynanavati/benchmarks/blob/master/ben...](https://github.com/akshaynanavati/benchmarks/blob/master/benchmarks/lookup_string.cpp)

I don't know how Google Benchmark works. So I wrote my own simple test. I
could not find any combination of string length and num searches where a
vector was faster than an unordered_set.

[https://pastebin.com/raw/mfBmGa5B](https://pastebin.com/raw/mfBmGa5B)

Example results:

String length: 8 Num Elements: 75 Num Searches: 10000 Gen Time: 1.94252 Vec
Time: 4.12721 Set Time: 0.144987 Num Found: 0

~~~
AstralStorm
Not it won't for this small strings. A 64 byte string likely fits in a cache
line so you're doing a cache line comparison which may even have a microop in
the CPU. Alternatively using SSE or AVX to vector compare the strings which is
still really fast.

I suspect the author made critical mistakes too.

If you are actually hitting a match or do many operations with warmed cache,
vector may actually win big time even compared to unordered_set as there may
be no pointer chasing involved. (since both all keys and all values are in
cache) The vector scan essentially boils down to a memcmp for a sufficiently
smart compiler. Computing the hash may or may not be faster than this.

~~~
forrestthewoods
std::vector<std::string> is a pointer-chase cache-miss every compare.

In my testing unordered_map is faster so long as numberOfStrings >
stringLength. This true for any size string greater than roughly 4. As
stringLength grows the unordered_set advantage grows.

Original post says vector is faster until 75 elements. By my test the
unordered_set is 4x faster with 64 char strings/75 elements. At 250
numberOfStrings its ~10x faster.

Sorted vectors + binary search are very very good though. Author said vector
faster until 20,000 elements. My testing finds the numberOfStrings >
stringLength limit is still basically true. But the unordered_set's advantage
grows only to 1.9x by 20,000 numberOfStrings.

This advantage will grow as number of matches / near-matches grows. Random
strings are not a great test case. Data sets matter!

Author is fixing a couple of bugs and will be updating their post. Curious to
see if his updated results match mine. :)

~~~
Someone
_”std::vector <std::string> is a pointer-chase cache-miss every compare”_

Nitpick: that is (almost certainly) true for strings of length 64 on any
system, but not necessarily so.

On existing systems, it isn’t true for short strings (with ‘short’ being
implementation dependent), due to “short string optimization” (See
[https://stackoverflow.com/a/10319672](https://stackoverflow.com/a/10319672),
[https://stackoverflow.com/a/28003328](https://stackoverflow.com/a/28003328)).

There’s nothing in the standard that forbids putting extra room in the base
object to accommodate larger strings (but of course, it would surprise many
users, possibly lead to stack overflows in lots of code, etc)

------
chillee
Somewhat interestingly, C++ STL unordered_map is so much slower than it needs
to be:
[https://www.youtube.com/watch?v=fHNmRkzxHWs&t=47m13s](https://www.youtube.com/watch?v=fHNmRkzxHWs&t=47m13s)

Using one of the policy-based data structures instantly gets you about an
order magnitude of speedup (for accesses).

~~~
asdfasgasdgasdg
Kinda makes you wonder why those decisions (to provide bucket iteration,
bucket iterators) were made. It's hard to picture a use for such a feature
that makes it worthy of adding constraints to a standard container. I would
love it if someone could provide some insight into the decision process that
led to this state of affairs.

~~~
yoklov
IIRC boost::unordered_map/set offered those, and I'm pretty sure the API was
taken more or less verbatim from that. I suspect some desire of compatibility
was to blame, but don't know.

Hopefully in the future mistakes of these sort will be avoided, but that might
be asking for too much (certainly std::optional opted to break compat with
boost's (`std::optional<T&>` is disallowed) so there's some hope).

------
shin_lao
I have a lot of issues with this blog post:

\- Compiler, c++ library, compilation options, and platform are missing

\- The author doesn't benchmark vector and unordered_map insertions with
reserve()

\- If I read correctly, the author benchmarks up to 32,768 items. This is way
too low. There should be tests with millions of entries.

\- Random string/numbers isn't a realistic scenario and very favourable to
unordered_map. I would use data from real life examples, e.g. list of cities,
list of people, etc.

\- Ordered arrays are missing from the benchmark (e.g.
boost::container::flat_set), it's a very powerful data structure. Why just do
a sorted vector at the end?

\- Vector outperforms unordered_set probably because of the hashing function
used.

~~~
Jyaif
Even if they had did all that, some person on the internet would have had "a
lot of issues with the post" in order to show off how smart they are.

You have to draw the line somewhere, and this post is very interesting as it
is.

~~~
petermcneeley
The original article was updated as the results were completely incorrect. The
only time a linear search is faster is for a very tiny in-place integer
search.

------
ebikelaw
It is very rare that I have seen a use of std::set that seemed justified,
although it does happen. People just use it because it's easier to type than
unordered_set!

One way to speed up set's usually appalling performance is to give it a custom
allocator so that instead of N calls to allocate tiny tree nodes all over the
heap, you get all your tree nodes on an arena. This can dramatically speed up
short-lived sets (for example you are trying to build the unique set of some
non-unique input).

------
dagenix
> In one setup, I used std::strings of length 64 and in the other I used 64
> bit numbers. The main motivation behind measuring both numbers and strings
> was that comparing 64 bit numbers takes a constant number of clock cycles

Pedantic: comparing 64 character strings is O(1).

~~~
kryptiskt
But not a constant number of clock cycles. A comparison that fails on the
first bytes can bail out early.

~~~
colanderman
To be fair, Intel's recent offerings include 512-bit registers [1].

I highly doubt std::string takes advantage of them.

[1]
[https://en.wikipedia.org/wiki/AVX-512](https://en.wikipedia.org/wiki/AVX-512)

~~~
rurban
It shouldn't because it slows down the whole core not to get too hot. 256 is
the best size to use.

------
ridiculous_fish
For longish strings (length 64 qualifies), binary_search is missing an
important optimization: it ends up walking the same prefix over and over.

A better algorithm is called "radix quicksort with a three-way partition."
Rather than the classical qsort < and >= partition, you maintain three regions
of <, =, and >. Crucially, at each step, you sort based only on a single
character: the whole range via the first char, then the = region via the
second char, then that = region via the third char, etc. This allows for
super-fast comparisons without re-walking prefixes.

All of these data structures (std::binary_search, std::set, and and
set::unordered_set) are quite sub-optimal. The design of unordered_set with
closed-hashing separately-allocated buckets is especially distressing.

For a much more efficient take, see LLVM's DenseSet which does NOT separately
allocate buckets. This burdens the client with finding two values to represent
empty and tombstone entries, but if perf is a concern that's much better than
additional mallocs.

------
iainmerrick
It seems like many commenters here (the ones who aren’t questioning the
methodology) are drawing the lesson that you should use vector rather than
set.

From these charts, isn’t unordered_set clearly the best all-rounder? Which is
exactly as I’d expect. Am I missing something?

Sure, a simple vector might be a bit faster for 10 or 100 or 1000 elements,
but after a certain point performance is awful.

------
adzm
definitely check out flat_set too! most of the time using a sorted vector as
the backing store for these kinds of things ends up being a net win,
especially for smaller sets, but the size it starts to become a hindrance may
surprise you.

~~~
de_watcher
Flat sets are great.

By writing my own implementation (to avoid boost dependency) I kinda
understood why they aren't standard: iterators have different invalidation
rules and some special sorted-vector-derived useful properties, the
map/multimap keys can't be const as in other standard associative containers,
etc.

------
BenFrantzDale
Very interesting. This makes it sound like there should be a way to control
`std::lower_bound` to switch to linear probing once it's narrowed the range
down enough. (In the specific example, "enough" is about 125 elements.)

Likewise for a hash table with separate chaining, wouldn't one better-than
`std::unordered_set` way to do it be to keep the separate chains as a
`std::vector` that is kept sorted, and then use `std::lower_bound` (or the
above better version of it) to probe the table? Those separate tables are O(1)
in size, so lookup and insertion are still O(1) on average. Worst case goes to
O(n log n), but that doesn't seem too bad.

Related: Is there a widely-used "frozen set" C++ data structure? I'm picturing
something that is constructed from a sequence but that internally is a sorted
`std::vector<T>` that provides a set-like API but that uses `std::lower_bound`
to find data. It's easy enough to write, but it seems like it should be in the
standard or at least Boost.

~~~
InfamousRece
There is one in boost:
[https://www.boost.org/doc/libs/1_68_0/doc/html/container/non...](https://www.boost.org/doc/libs/1_68_0/doc/html/container/non_standard_containers.html#container.non_standard_containers.flat_xxx)

~~~
BenFrantzDale
Nice!
[https://www.boost.org/doc/libs/1_50_0/doc/html/boost/contain...](https://www.boost.org/doc/libs/1_50_0/doc/html/boost/container/flat_set.html)
and
[https://www.boost.org/doc/libs/1_65_1/doc/html/boost/contain...](https://www.boost.org/doc/libs/1_65_1/doc/html/boost/container/flat_map.html)
for reference.

~~~
BenFrantzDale
Interestingly, when I dig deep into `find`, I wind up in `priv_lower_bound`
which bisects to the bitter end rather than doing a linear find at the end.
[https://www.boost.org/doc/libs/1_50_0/boost/container/detail...](https://www.boost.org/doc/libs/1_50_0/boost/container/detail/flat_tree.hpp)

------
stochastic_monk
The use of unsorted vectors for sets was covered and benchmarked in a Scott
Meyers [0] post from 2015. Facebook's folly library supports linear containers
and maps for these small sets (~100 elements, as a rule of thumb, but
measure), and I've rolled my own for specific applications.

[0] [https://scottmeyers.blogspot.com/2015/09/should-you-be-
using...](https://scottmeyers.blogspot.com/2015/09/should-you-be-using-
something-instead.html)

------
jlarocco
The article should say which C++ compiler and standard library they're using.
Each implementation is different, and these results may not be true for all of
them.

------
Jyaif
Great! I'd love to know what percentage of the world's std::set could be
replaced by vectors, and how many megawatts would be saved.

------
halayli
This is completely useless if you don't mention what STL library version
you're using.

C++ sets define the interfaces and not the implementation.

------
kccqzy
It's actually not hard to beat the performance of standard set and map
containers in specific scenarios. For example in one application I have a
treap implementation that's much faster than the standard ordered map, but it
is much less featureful (doesn't support deletion for example).

------
shaklee3
Anyone compared this to sparsehash recently?

[https://github.com/sparsehash/sparsehash](https://github.com/sparsehash/sparsehash)

I read a while ago that it beat all the competitors at almost every size.

------
petermcneeley
Are these benchmarks done with cold caches and multithreading? If not then we
are not seeing the whole picture here.

~~~
mackal
They use the Google Benchmark library. (I would read into it to answer your
actual questions)

~~~
petermcneeley
Also looks like randString generation will dominate for small arrays for
(size_t i = 0; i < kLookups; ++i) {
benchmark::DoNotOptimize(std::find(vec.begin(), vec.end(), randString())); }

Also does that allocate a new string every call?

~~~
AstralStorm
Yes, this benchmark is relatively flawed.

Allocating a new string simulates heap fragmentation.

~~~
petermcneeley
The post was updated as the results were incorrect.

------
firethief
In the first table:

    
    
        set @ 4096: 10.80
        set @ 32768: 1.31
    

uh?

------
wmu
Comparing lookup time of unsorted vector and sorted set is unfair. The vector
should be sorted and then we might compare plain binary search against BST
performance.

~~~
recentdarkness
It is not unfair, and there is a comparison with A sorted vector as well. And
the author correctly indicated the reason as of why the sorted vector is
performing better than the unsorted vector. The cache friendliness is key. And
it shows that for small sizes of elements an array or vector would perform
better. It simply depends on your needs.

I would be interested in comparing an stack based array with integers to the
sets as well.

~~~
wmu
My fault, I just skimmed through the article and missed that comparison.

