Hacker News new | past | comments | ask | show | jobs | submit login
Benchmarks of Cache-Friendly Data Structures in C++ (tylerayoung.com)
126 points by s3cur3 86 days ago | hide | past | web | favorite | 66 comments



> Lookup in the ArrayMap is shockingly slow at large data sizes—again, I don’t have a good explanation for why this is.

Probably because it's doing a binary search over a large array? I wouldn't expect that to be particularly fast with large arrays, since large array + binary search == lots of cache misses.

There is a way (the name escapes me unfortunately) to order the items of the array such that for N items, the item that would be at index N/2 in a sorted array is at index 0, then the items that would be at N/4 and 3N/4 are at indexes 1 and 2, and so on which of course is much more cache-friendly when doing a binary search.


The issue is not just that it's ‘slow’, but that it's frequently slower even than std::map. My suspicion is that the issue is its use of power-of-2 sizes, which is a terrible idea for binary search because it causes cache line aliasing. This is discussed in a lot of detail by Paul Khuong.

https://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathol...

> Even without that hurdle, the slowdown caused by aliasing between cache lines when executing binary searches on vectors of (nearly-)power-of-two sizes is alarming. The ratio of runtimes between the classic binary search and the offset quaternary search is on the order of two to ten, depending on the test case.

The approaches there are likely to give significant speedups.


Yeah, in my past experience if my sizes for my B-tree were too close to powers of 2 the performance would tank. Adjusting them to be no where near a power of two and also not take up too many cache lines was the ticket for me.


The name is Eytzinger layout.


Excellent, thanks!


Right... this result is one of those examples of why you don't guess, you benchmark, because my intuition is that binary search should be "really fast." Clearly though the cache misses hurt a lot more than my intuitions capture.


Any idea why is it not using B-trees?


Oh, the sorted array-backed "map" isn't using B-trees because I wanted it as a comparison point (to see how the LLVM containers fare against the old "just use a vector" advice). So the ArrayMap isn't production-ready or anything... it's intended to give a baseline against which to compare what would happen if you wrote your own dead-simple map replacement with the hope of improving perf.


> And unlike with vectors, I don’t know that I’ve ever come across code that retained map iterators…

I've done this once. The thing I was trying to do was have multiple threads write to a std::unordered_map in such a way that each thread would write to its allocated "bucket" and nothing else (roughly, each thread would "own" map[thread_id] as a somewhat convoluted "thread local storage" which would then be collected and operated on at some point in the future from the "master" thread). It turns out that to the only way to actually do this in a standards-compliant way is to grab an iterator pointing to map.find(thread_id) prior to starting and use for subsequent modifications of the value, since using the subscript operator is apparently not guaranteed to be thread safe for associative containers.


The subscript operator can modify the map if the key if the is not present, so it is not constitute and thus formally not thread safe even if you are sure the key is always present. At and find are always constitute so you can use those instead; no need to cache iterators.


> The subscript operator can modify the map if the key if the is not present, so it is not constitute and thus formally not thread safe even if you are sure the key is always present.

Yeah, I know why it is unsafe in general, but I don't see why the C++ standard can't specify that operations that don't invalidate iterators are legal to perform in a concurrent manner provided they touch disjoint memory locations (for associative containers, subscripting is defined to be one such operator, as long as rehashing does not occur like in this case).

> At and find are always constitute so you can use those instead; no need to cache iterators.

Now that you mention it, I could have just called find every time. I guess I'm too used to it being O(n) for other collections and avoided it somewhat irrationally ;)


The wording could be changed. I guess it is just an issue of the committee only having ago items time; as there are alternatives here getting the correct wording right is not a priority so the fallback to const/non-cons thread safety guarantees is sufficient.


If you didn't like caching the map iterator, you could have cached the pointer to the value object instead i.e. Foo* cached_value = &mymap[this_thread_id]. Although caching the map iterator sounds perfectly fine to me.


Isn't the whole point of iterators to prevent the above strategy from crashing your program when the map is moved to another region of memory?


I'm not exactly sure what you mean by "the map is moved to another region of memory", but a pointer to the value object in a map entry is safe (or unsafe) in exactly the same situations as an iterator into the map. Under the hood, the iterator is likely to be a pointer to a tree node struct that contains the key and value objects (along with pointers to other relevant nodes).


People like to talk about C++ giving complete control over memory layout, but there's one very important cache-relevant technique that I've not seen done transparently: "column based storage".

That is, if you have class foo { int x, y, z; }, and make an array or vector of them, then they will normally be laid out in that order. For locality, you might want to have all X be together - ie, three separate arrays of X, Y and Z.


That is called a "struct of arrays" (vs an "array of structs").

https://en.wikipedia.org/wiki/AOS_and_SOA

I don't know of any language that transparently supports it other than Jai, which isn't available yet.

https://github.com/BSVino/JaiPrimer/blob/master/JaiPrimer.md...


Haskell's "unboxed" vectors do this on a library level so `Vector (a, b)` is stored as `V_2 {-# UNPACK #-} !Int !(Vector a) !(Vector b)`. But to implement it for new types you need some boilerplate.

Alternatively you could use template haskell or cpp to generate the boilerplate, though. The default instances use cpp https://github.com/haskell/vector/blob/master/internal/unbox...


Jovial 73 has built in support for SoA arrays.


C++ does give you the power to lay your data members how you would like. But in this case you've just said that you want those three members to be laid out sequentially in memory. If you want to use column storage because you need to apply a specific transformation on the data then lay it out that way.

In the vast majority of cases having classes be automatically laid out in column based storage would be a detriment and not an advantage. You would actually be fighting against the cache in those cases.

For example I've seen rendering engines go from storing data in AoS to SoA, then back, then back again, then back again all depending on the hardware and tech stack available.


There are libraries to do it almost transparently : https://github.com/crosetto/SoAvsAoS


To be fair, the same is true of Rust. The issue is that with column-based storage you can't have pointers/references to individual structs within the vector. You need to use indexes.


This is just the old array of structs vs struct of arrays trade-off, right? This is true of pretty much any programming language's low level data structures.


Tell me to me. I try for several months to build a way around this on rust (ie: Have a generic interface to both columnar and row-oriented layout and "just" switch to optimized operations later).

I think is impossible. The closest thing is use a NDArray and pick a winner/default layout... that is row-oriented, despite my intention to be columnar first, and later develop an alternate, complete rewrite, for support columnar.


The thing is, in modern C++ you can fake a wrapper to access it like a pointer or reference.

Of course it will not be high performance, but it can be done. (E .g. Eigen library.)


Why wouldn't an implementation along these lines be performant?

  template<typename... Ts>
  class SoA : public tuple<vector<Ts>...> {
          // ...
          template<size_t... Is>
          tuple<Ts&...> subscript(size_t i, index_sequence<Is...>) {
                  return {get<Is>(*this)[i]...};
          }
  public:
          // ...
          auto operator[](size_t i) {
                  return subscript(i, index_sequence_for<Ts...>{});
          }
  };


You can however do this pretty easily in your class definition in a way that is transparent to callers. The "transparent to callers" part is pretty common to any OO model, while the "pretty easily in your class definition" is I think what most people say when they mean "complete control" -- you get much higher degrees of control than you do with, say, Java, though it will be effort beyond the default case.

I use this very approach in a code base I"m working on right now. Some object members are stored in the object and some are not but they all look like class members to the callers.


On the other hand, if you do random accesses to an index but need x, y, and z, then a struct of arrays as you propose will incur three cache misses instead of one for an array of structs. It really depends on the application which is preferable.


The std::deque is also another useful 'hybrid' container that performs close to std::vector or better in some cases when you need to 'grow' the container without reallocating memory so many times. But the only good cache friendly implementation seems to be in clang libc++ (memory is allocated in 4KiB blocks). MSVC's deque implementation is horrible (8 byte blocks) and GCC's is also not that great (512 byte blocks).


8 bytes??? Are you sure you don't mean 8 * sizeof(class)? That's just ridiculous.

Regardless it would be nice if one could specify the block size as an argument. I suppose I could write my own allocator, but it's just too much hassle for such a simple thing as storage.


If I remember correctly, in MSVC implementation if sizeof(T) is greater than 8 bytes then each 'block' is just one element (also blocks aren't contiguous). So effectively, MSVC implementation is useless.

Boost does have a a deque container where you can specify the block size using type traits, but from my experience boost's deque wasn't such a good implementation performance wise.

But Clang's libc++ implementation is very good performance wise and even beat std::vector in some cases.


How would you even link to the next block with just 8 bytes...


There is no next pointer in a deque block; the pointers to each block are in a separate vector.


Worse, this is not on the table to be fixed any time soon in MSVC, because the change would break ABI compatibility.

Every time someone says "just use a custom allocator," I sigh inside... it's so much overhead that almost no one does it, and even fewer do it correctly. :(


Maybe in a logarithmic plot this would look differently, but it seems to me that std::vector performs quite well in particular for few elements. So why would I bother with smallVector if simple std::vector is on par in their (supposed) prime discipline?


Vector can be slow if you create and destroy them a lot, since they allocate. You can work around to some extent by providing a custom allocator, but using something like SmallVector or absl::InlinedVector can be much faster when the N is known.


Or if it's fully static size, just use std::array. You'd be surprised how often people use known static size or static max size vectors instead.


Yeah, that's one of my biggest pet peeves when looking at other people's code (along with unnecessary dynamic allocations in general). One of the reasons I perhaps irrationally still prefer C++ to Rust is the pervasive use of dynamic arrays of known static size in the latter's documentation, and how it makes fixed-size arrays much less ergonomic to use than dynamic arrays.


Sure. Horses for courses. Array is great when you have exactly N. InlinedVector is handy when you have <N 99% of the time.


I tried this at some points where I often have to allocate dynamic memory though I knew an upper bound (no static size, though). Got some nice (and easy) extra performance, thank you (and the others who replied to my comment) very much for the hint!


I'm in games, and we really do have lots of vectors of maybe 8 or so elements. The small wins add up across the entire codebase.


Because you are developing LLVM or a similar performance critical project with lots of small vectors.


I'd love to see also comparison with B-trees (https://code.google.com/archive/p/cpp-btree/). In my tests done a few years ago this implementation was faster than std::map. And my guess is the reason of it is better use of cache.


std::map is usually implemented as a Red-black tree. Which is basically a B-tree with branching factor 2. This is less cache friendly than higher branching factors, so it's entirely expected to be slower than B-trees. On the upside, it offers stable pointers which higher branching factors cannot offer.


Hm, I'll look into it... :)


> llvm::SmallVector [...] this class preallocates some amount of data (the actual size of which is configurable via a template parameter) locally on the stack, and only performs a (slow!) heap allocation if the container grows beyond that size. Because malloc is slow, and traversing a pointer to get to your data is slow, the SSO heap storage is a double-win.

But you need to "traverse" a pointer for a stack-allocated array just as well! So it's mainly that in some cases this class helps forgo a malloc, which I am not sure is that much of win, especially given that this implementation is another class that requires more (and more complicated!) object code...

What I think might be better in many situations is preallocating the dynamic array so that it doesn't need to be preallocated each time the function is called. The preallocated array can be a function parameter (or object member, for OOP weenies :>), or simply a global variable.


> But you need to "traverse" a pointer for a stack-allocated array just as well!

By the time you traverse this pointer, its pointed-to memory location is almost certainly already in your L1 cpu cache since you are accessing this pointer from its parent struct, while for std::vector it can be in any random place in your memory.

In my own code, using smallvector judicously makes drastic performance differences and allows to forego an immense amount of memory allocations.


Yes, the cache issue is why I was suggesting to instead just exercise a little more control about where the heap allocated array is actually located. Choosing a longer lifetime for the array than the lifetime of the function that uses the array, one might already have achieved that the thing is cached. Another option could be to require a temporary array argument from the caller. Or (as someone else suggested) delegating that problem to the GC (which I assume might be able to use "young generation" memory).


In my experience, the big benefit from the SSO vector comes from making it the default and making it easy to use. If I emailed my team to ask them to rethink all their short-lived vectors, no one would change anything... but if I say “prefer this SSO vector to the std one,” they can actually adopt that without significantly changing the way they write code. It’s as much a social problem and developer productivity problem as it is about what’s actually best.


I think you could email them to rethink whether short-lived things are in general a good idea with regards to a) performance b) maintainability/overseeability/debuggability/flexibility.

IMHO short-lived by default is a wrong practice, and it's mostly a consequence of the misaplied ideology that everything should have as tiny a lexical scope as possible (e.g. make stack variables where possible). And it's a consequence of OOP thinking, where there isn't really a concept of "memory you can use when you need it" but only "objects that are always constructed". And whenever an object comes into or out of existence, work must be done to make that transition official!

If matters were as simple as using SSO vectors by default (or almost always), then new languages would choose SSO optimized datastructures almost everywhere. But is it actually the case that almost all data fits in arrays of length < 16 or so? I don't think so, and using SSO data structures causes more complicated object code and is slower in the larger cases.


> But you need to "traverse" a pointer for a stack-allocated array just as well!

No. Iterating a small stack array is much faster than iterating a heap array.

Instead of "traverse" the author means you have to follow the pointer out to main memory to read the data. Another term they might have used is "fetch".


> Instead of "traverse" the author means you have to follow the pointer out to main memory to read the data. Another term they might have used is "fetch".

No, I got that. But stack memory is main memory as well, and it is accessed through a pointer (in this case the stack pointer) just like heap memory. That is, unless the stack memory is not really stack memory, but optimized by the compiler to be register-only. Which is mostly not possible for stack-allocated arrays (when they are indexed with runtime indices, or when they are simply too large).

It's true that stack memory is (almost) always in the cache, which is why stack memory is considered fast. But couldn't that be mostly true for temporary heap storage as well?


Aside from cachability, there is still an extra indirection. If you have a std vector on the stack, to access the first element you need to dereference the stack poiner, at an offset, to fetch the heap pointer and then dereference it. The two loads are in a dependency chain. This is true for the first load at least. For subsequent loads the heap pointer might already be in a register.

For a small vector you still need to dereference the stack pointer to load the heap pointer/discriminant, the perform a conditional jump on it, then, if the vector is stack allocated perform a further load at an offset from the stack pointer. Thanks to jump prediction, the second load does not depend on the first, so the dependency chain is shorter which might make a difference in some scenarios.


Honestly the SmallVector case sounds more complicated. As to the dependency chain in the standard vector version, the heap pointer should be easily put in a register. And the SmallVector has such a dependency just as well: The discrimant must be loaded and tested first. In both cases it shouldn't matter much since it's expected to be optimized to a one-time cost.


Sure, the penality is only usually paid for the access to the vector, unless the compiler need to spill the heap pointer.

The branch predicated on the test is executed speculatively, so the next load does not actually depended on the discriminant load and test so the dependency chain is actually shorter (1 load instead of two). If the code is bottlenecked by something other than the dependency chain length (for example number of outstanding loads) of course it doesn't matter, but dependency chains are usually the first bottleneck.


This is insightful. Thanks.


Thanks. Mind you, that's the theory, I haven't actually benchmarked the use case. But in general reducing the length of pointer chains to traverse is an often low hanging fruits of performance optimizations.


I've just noticed that while the thing you mentioned about branch prediction might well often work out, nevertheless the compiler needs to emit additional object code for the discrimation test whenever it cannot infer whether an access is "small" or "large". It might mean a significant increase in code size. Of course I didn't measure anything, either.


It depends on the implementation. Using a discriminant plus branch is one way to do it, but another implementation is to just always use a data pointer, which points into the embedded array in the small vector case.

Now there is no branch and in fact the element access code is identical to std::vector case. Only a few, less frequently called paths need to be aware of the small vs large cases, such as the resize code (which needs to not free the embedded buffer).

The main downsides are that you have added back the indirection in the small vector case (indirection is unconditional), and that you "waste" the 8 bytes for the data pointer in the small case, as the discriminant version can reuse that for vector data.

IMO the observed performance improvements are basically 100% due to the avoidance of malloc and in the real world (but not this benchmark) due to better cache friendliness of on-stack data, but not because any vector functions such as element access are substantially sped up. That explains why the benefits disappear (in a relative sense) pretty quickly as the element size increases, even for the large-embedded-buffer case where all of the elements are still on the stack: the malloc benefit is (roughly) a one-time benefit per vector, not a "per access" or "per element" benefit so as the vector grows, the relative benefit diminishes and the non-malloc operation costs come to dominate.


The thing is that a cache line is 64 bytes, and loading from main memory to cache is orders of magnitude slower than from cache to register.

In the std::vector case, it's loading 64 bytes into cache just to get the heap pointer. Then the heap pointer tells it which other 64 bytes to load.

In the SmallVector case, the 64 bytes that contain the heap pointer are likely to include the entire vector, entirely eliminating the second load.


Since it's stack memory, in both cases you can assume that the struct (whether or not including a small-array) is already loaded in the cache. The question is can't you achieve that the heap memory of the vector version is also in the cache if you just care a tiny little bit about data organization.


Sure, it can be the case that both the data and the vector struct are in cache (in different lines). Analyses like "stack is in cache, but heap is in main memory" are too simple to capture the subtleties of real programs. In general, for a benchmark, you can expect everything to be in cache, unless the data is large enough that it doesn't fit, which usually means that it won't fit regardless of small-vector optimization or not.

That said, purely from a memory access and cache use point of view, it is more or less strictly better to pack all the vector data (data and metadata) together as the small vector does: you'll always only bring in the one cache line containing everything. In the split stack + heap case [1], sure both lines might be cache, so the access time might be the same, but you still need to have both of those lines in the cache, so the footprint is bigger. So at beast it can be "tied", but at some point you'll suffer misses with this strategy that you wouldn't suffer if everything were co-located. It follows directly from the general rule that you want to pack data accessed together close together.

---

[1] Of course it might be heap + heap since the std::vector itself might be allocated on the heap but it doesn't really change the analysis.


When I said main memory I meant failing to get a cache hit.

If you’re hitting the allocator (slow!) then odds are pretty good those bytes aren’t fresh in the cache. They could be if you just read them and then freed them and then the allocator handed them right back. But that’s probably not a good pattern either.

Another form of this pattern is the small string optimization. It’s exceptionally common. https://blogs.msmvps.com/gdicanio/2016/11/17/the-small-strin...


> If you’re hitting the allocator (slow!) then odds are pretty good those bytes aren’t fresh in the cache.

Incidentally, this is a great reason why GCs are useful even if you don't need the automatic memory management - a GC can allocate from memory that is always in cache, and they can do it in a couple of cycles.


How does DenseMap compare to, say hashbrown (in Rust) or SwissTable (in C++) by Google?


I don't have the know-how to write a Rust comparison (I'd happily accept a pull request from someone who does!), but I can take a to-do to add SwissTable benchmarks. :)




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: