
Benchmarks of Cache-Friendly Data Structures in C++ - s3cur3
https://tylerayoung.com/2019/01/29/benchmarks-of-cache-friendly-data-structures-in-c/
======
usefulcat
> Lookup in the ArrayMap is shockingly slow at large data sizes—again, I don’t
> have a good explanation for why this is.

Probably because it's doing a binary search over a large array? I wouldn't
expect that to be particularly fast with large arrays, since large array +
binary search == lots of cache misses.

There is a way (the name escapes me unfortunately) to order the items of the
array such that for N items, the item that would be at index N/2 in a sorted
array is at index 0, then the items that would be at N/4 and 3N/4 are at
indexes 1 and 2, and so on which of course is much more cache-friendly when
doing a binary search.

~~~
Veedrac
The issue is not just that it's ‘slow’, but that it's frequently slower even
than std::map. My suspicion is that the issue is its use of power-of-2 sizes,
which is a terrible idea for binary search because it causes cache line
aliasing. This is discussed in a lot of detail by Paul Khuong.

[https://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-
pathol...](https://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathological-
case-for-caches/)

> Even without that hurdle, the slowdown caused by aliasing between cache
> lines when executing binary searches on vectors of (nearly-)power-of-two
> sizes is alarming. The ratio of runtimes between the classic binary search
> and the offset quaternary search is on the order of two to ten, depending on
> the test case.

The approaches there are likely to give significant speedups.

~~~
kevingadd
Yeah, in my past experience if my sizes for my B-tree were too close to powers
of 2 the performance would tank. Adjusting them to be no where near a power of
two and also not take up too many cache lines was the ticket for me.

------
saagarjha
> And unlike with vectors, I don’t know that I’ve ever come across code that
> retained map iterators…

I've done this once. The thing I was trying to do was have multiple threads
write to a std::unordered_map in such a way that each thread would write to
its allocated "bucket" and nothing else (roughly, each thread would "own"
map[thread_id] as a somewhat convoluted "thread local storage" which would
then be collected and operated on at some point in the future from the
"master" thread). It turns out that to the only way to actually do this in a
standards-compliant way is to grab an iterator pointing to map.find(thread_id)
prior to starting and use for subsequent modifications of the value, since
using the subscript operator is apparently not guaranteed to be thread safe
for associative containers.

~~~
gpderetta
The subscript operator can modify the map if the key if the is not present, so
it is not constitute and thus formally not thread safe even if you are sure
the key is always present. At and find are always constitute so you can use
those instead; no need to cache iterators.

~~~
saagarjha
> The subscript operator can modify the map if the key if the is not present,
> so it is not constitute and thus formally not thread safe even if you are
> sure the key is always present.

Yeah, I know why it is unsafe in general, but I don't see why the C++ standard
can't specify that operations that don't invalidate iterators are legal to
perform in a concurrent manner provided they touch disjoint memory locations
(for associative containers, subscripting is defined to be one such operator,
as long as rehashing does not occur like in this case).

> At and find are always constitute so you can use those instead; no need to
> cache iterators.

Now that you mention it, I could have just called find every time. I guess I'm
too used to it being O(n) for other collections and avoided it somewhat
irrationally ;)

~~~
gpderetta
The wording could be changed. I guess it is just an issue of the committee
only having ago items time; as there are alternatives here getting the correct
wording right is not a priority so the fallback to const/non-cons thread
safety guarantees is sufficient.

------
pjc50
People like to talk about C++ giving complete control over memory layout, but
there's one very important cache-relevant technique that I've not seen done
transparently: "column based storage".

That is, if you have class foo { int x, y, z; }, and make an array or vector
of them, then they will normally be laid out in that order. For locality, you
might want to have all X be together - ie, three separate arrays of X, Y and
Z.

~~~
zozbot123
To be fair, the same is true of Rust. The issue is that with column-based
storage you can't have pointers/references to individual structs within the
vector. You need to use indexes.

~~~
AstralStorm
The thing is, in modern C++ you can fake a wrapper to access it like a pointer
or reference.

Of course it will not be high performance, but it can be done. (E .g. Eigen
library.)

~~~
smitherfield
Why wouldn't an implementation along these lines be performant?

    
    
      template<typename... Ts>
      class SoA : public tuple<vector<Ts>...> {
              // ...
              template<size_t... Is>
              tuple<Ts&...> subscript(size_t i, index_sequence<Is...>) {
                      return {get<Is>(*this)[i]...};
              }
      public:
              // ...
              auto operator[](size_t i) {
                      return subscript(i, index_sequence_for<Ts...>{});
              }
      };

------
rwbt
The std::deque is also another useful 'hybrid' container that performs close
to std::vector or better in some cases when you need to 'grow' the container
without reallocating memory so many times. But the only good cache friendly
implementation seems to be in clang libc++ (memory is allocated in 4KiB
blocks). MSVC's deque implementation is horrible (8 byte blocks) and GCC's is
also not that great (512 byte blocks).

~~~
svantana
8 bytes??? Are you sure you don't mean 8 * sizeof(class)? That's just
ridiculous.

Regardless it would be nice if one could specify the block size as an
argument. I suppose I could write my own allocator, but it's just too much
hassle for such a simple thing as storage.

~~~
rwbt
If I remember correctly, in MSVC implementation if sizeof(T) is greater than 8
bytes then each 'block' is just one element (also blocks aren't contiguous).
So effectively, MSVC implementation is useless.

Boost does have a a deque container where you can specify the block size using
type traits, but from my experience boost's deque wasn't such a good
implementation performance wise.

But Clang's libc++ implementation is very good performance wise and even beat
std::vector in some cases.

~~~
jstimpfle
How would you even link to the next block with just 8 bytes...

~~~
gpderetta
There is no next pointer in a deque block; the pointers to each block are in a
separate vector.

------
LHxB
Maybe in a logarithmic plot this would look differently, but it seems to me
that std::vector performs quite well in particular for few elements. So why
would I bother with smallVector if simple std::vector is on par in their
(supposed) prime discipline?

~~~
shereadsthenews
Vector can be slow if you create and destroy them a lot, since they allocate.
You can work around to some extent by providing a custom allocator, but using
something like SmallVector or absl::InlinedVector can be much faster when the
N is known.

~~~
AstralStorm
Or if it's fully static size, just use std::array. You'd be surprised how
often people use known static size or static max size vectors instead.

~~~
smitherfield
Yeah, that's one of my biggest pet peeves when looking at other people's code
(along with unnecessary dynamic allocations in general). One of the reasons I
perhaps irrationally still prefer C++ to Rust is the pervasive use of dynamic
arrays of known static size in the latter's documentation, and how it makes
fixed-size arrays much less ergonomic to use than dynamic arrays.

------
wmu
I'd love to see also comparison with B-trees
([https://code.google.com/archive/p/cpp-
btree/](https://code.google.com/archive/p/cpp-btree/)). In my tests done a few
years ago this implementation was faster than std::map. And my guess is the
reason of it is better use of cache.

~~~
jstimpfle
std::map is usually implemented as a Red-black tree. Which is basically a
B-tree with branching factor 2. This is less cache friendly than higher
branching factors, so it's entirely expected to be slower than B-trees. On the
upside, it offers stable pointers which higher branching factors cannot offer.

------
jstimpfle
> llvm::SmallVector [...] this class preallocates some amount of data (the
> actual size of which is configurable via a template parameter) locally on
> the stack, and only performs a (slow!) heap allocation if the container
> grows beyond that size. Because malloc is slow, and traversing a pointer to
> get to your data is slow, the SSO heap storage is a double-win.

But you need to "traverse" a pointer for a stack-allocated array just as well!
So it's mainly that in some cases this class helps forgo a malloc, which I am
not sure is that much of win, especially given that this implementation is
another class that requires more (and more complicated!) object code...

What I think might be better in many situations is preallocating the dynamic
array so that it doesn't need to be preallocated each time the function is
called. The preallocated array can be a function parameter (or object member,
for OOP weenies :>), or simply a global variable.

~~~
jcelerier
> But you need to "traverse" a pointer for a stack-allocated array just as
> well!

By the time you traverse this pointer, its pointed-to memory location is
almost certainly already in your L1 cpu cache since you are accessing this
pointer from its parent struct, while for std::vector it can be in any random
place in your memory.

In my own code, using smallvector judicously makes _drastic_ performance
differences and allows to forego an immense amount of memory allocations.

~~~
jstimpfle
Yes, the cache issue is why I was suggesting to instead just exercise a little
more control about where the heap allocated array is actually located.
Choosing a longer lifetime for the array than the lifetime of the function
that uses the array, one might already have achieved that the thing is cached.
Another option could be to require a temporary array argument from the caller.
Or (as someone else suggested) delegating that problem to the GC (which I
assume might be able to use "young generation" memory).

~~~
s3cur3
In my experience, the big benefit from the SSO vector comes from making it the
default and making it easy to use. If I emailed my team to ask them to rethink
all their short-lived vectors, no one would change anything... but if I say
“prefer this SSO vector to the std one,” they can actually adopt that without
significantly changing the way they write code. It’s as much a social problem
and developer productivity problem as it is about what’s actually best.

~~~
jstimpfle
I think you could email them to rethink whether short-lived things are in
general a good idea with regards to a) performance b)
maintainability/overseeability/debuggability/flexibility.

IMHO short-lived by default is a wrong practice, and it's mostly a consequence
of the misaplied ideology that everything should have as tiny a lexical scope
as possible (e.g. make stack variables where possible). And it's a consequence
of OOP thinking, where there isn't really a concept of "memory you can use
when you need it" but only "objects that are always constructed". And whenever
an object comes into or out of existence, work must be done to make that
transition official!

If matters were as simple as using SSO vectors by default (or almost always),
then new languages would choose SSO optimized datastructures almost
everywhere. But is it actually the case that almost all data fits in arrays of
length < 16 or so? I don't think so, and using SSO data structures causes more
complicated object code and is _slower_ in the larger cases.

------
kccqzy
How does DenseMap compare to, say hashbrown (in Rust) or SwissTable (in C++)
by Google?

~~~
s3cur3
I don't have the know-how to write a Rust comparison (I'd happily accept a
pull request from someone who does!), but I can take a to-do to add SwissTable
benchmarks. :)

